Skip to main content

· 5 min read
Barun Acharya

LSM hooks in Linux Kernel mediates access to internal kernel objects such as inodes, tasks, files, devices, and IPC. LSMs, in general, refer to these generic hooks added in the core kernel code. Further, security modules could make use of these generic hooks to implement enhanced access control as independent kernel modules. AppArmor, SELinux, Smack, TOMOYO are examples of such independent kernel security modules.

LSM seeks to allow security modules to answer the question "May a subject S perform a kernel operation OP on an internal kernel object OBJ?"

LSMs can drastically reduce the attack surface of a system if appropriate policies using security modules are implemented.

DACs vs. MACs

DAC (Discretionary Access Control) based access control is a means of restricting access to objects based on the identity of subjects or groups. For decades, Linux only had DAC-based access controls in the form of user and group permissions. One of the problems with DACs is that the primitives are transitive in nature. A user who is a privileged user could create other privileged users, and that user could have access to restricted objects.

With MACs (Mandatory Access Control), the subjects (e.g., users, processes, threads) and objects (e.g., files, sockets, memory segments) each have a set of security attributes. These security attributes are centrally managed through MAC policies. In the case of MAC, the user/group does not make any access decision, but the access decision is managed by security attribute.

LSMs are a form of MAC-based controls.

LSM Hooks

LSM mediates access to kernel objects by placing hooks in the kernel code just before the access.

It can be seen here that the LSM hooks are applied after the DAC and other sanity checks are performed.

Here it is shown that the LSM hooks are applied in core objects, and these hooks are dereferenced using a global hooks table. These global hooks are added ( e.g., check apparmor hooks when the security module is initialized.

TOCTOU problem handling

LSMs are typically used for a system's policy enforcement. One school of thought is that the enforcement can be handled in an asynchronous fashion, i.e., the kernel audit events could pass the alert to userspace, and then the userspace could enforce the decision asynchronously.

Such an approach has several issues, i.e., the asynchronous nature might result in the malicious actor causing the actual damage before the actor could be identified. For example, if the unlink() of a file object is to be blocked, the asynchronous nature might result in the unlink getting successful before the attack could be blocked.

LSM hooks are applied inline to the kernel code processing; the kernel has the security context and other details of the object while making the decision inline. Thus the enforcement is inline to the access attempt, and any blocking/denial action can be performed without TOCTOU problems.

Security Modules currently defined in Linux kernel

$ grep -Hnrw "DEFINE_LSM" LINUX-KERNEL-SRC-CODE/

./security/smack/smack_lsm.c:4926:DEFINE_LSM(smack) = {
./security/tomoyo/tomoyo.c:588:DEFINE_LSM(tomoyo) = {
./security/loadpin/loadpin.c:246:DEFINE_LSM(loadpin) = {
./security/commoncap.c:1468:DEFINE_LSM(capability) = {
./security/selinux/hooks.c:7387:DEFINE_LSM(selinux) = {
./security/bpf/hooks.c:30:DEFINE_LSM(bpf) = {
./security/safesetid/lsm.c:264:DEFINE_LSM(safesetid_security_init) = {
./security/lockdown/lockdown.c:163:DEFINE_LSM(lockdown) = {
./security/integrity/iint.c:174:DEFINE_LSM(integrity) = {
./security/yama/yama_lsm.c:485:DEFINE_LSM(yama) = {
./security/apparmor/lsm.c:1905:DEFINE_LSM(apparmor) = {

In the above list, AppArmor and SELinux are undoubtedly the most widely used. AppArmor is relatively easier to use, but SELinux provides the greater intensive and fine-grained policy specification. Linux POSIX.1e capabilities logic is also implemented as a security module.

There can be multiple security modules used at the same time. This is true in most cases; the capabilities module is always loaded alongside SELinux or any other LSM. The capabilities security module is always ordered first in execution (controlled using .order = LSM_ORDER_FIRST flag).

Stackable vs Non-Stackable LSMs

Note that AppArmor, SELinux, and Smack security modules initialize themselves as exclusive (LSM_FLAG_EXCLUSIVE) security modules. There cannot be two security modules in the system with LSM_FLAG_EXCLUSIVE flag set. Thus, this means that one cannot have any two of the following (SELinux, AppArmor, Smack) security modules registered simultaneously.

BPF-LSM is a stackable LSM and thus can be used alongside AppArmor or SELinux.

Permissive hooks in LSMs

Certain POSIX-compliant filesystems depend on the ability to grant accesses that would ordinarily be denied at a coarse level (DAC level) of granularity (check capabilities man page for CAP_DAC_OVERRIDE). LSM supports DAC override (a.k.a., permissive hooks) for particular objects such as POSIX-compliant filesystems, where the security module can grant access the kernel was about to deny.

Security Modules: A general critique

LSMs, as generic MAC-based security primitives, are very powerful. The security modules allow the administrator to impose additional restrictions on the system to reduce the attack surface. However, if the security module policy specification language is hard to understand/debug, the administrator usually takes a stance of disabling it altogether, thus imposing friction in adoption.

References

  1. Linux Security Modules: General Security Support for the Linux Kernel, Wright & Cowan et al., 2002
  2. https://www.kernel.org/doc/html/v5.8/security/lsm.html

· 8 min read
Rudraksh Pareek

Benchmarking data

Config

  • Node: 4
  • Platform - AKS
  • Workload -> Sock-shop
  • replica: 1
  • Tool -> Apache-bench (request at front-end service)
  • Vm: DS_v2
VMCPURamData disksTemp Storage
DS2_v227 GiB814 GiB

Without Kubearmor

Average

ScenarioRequestsConcurrent RequestsKubearmor CPU (m)Kubearmor Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsMicro-service CPU (m)Micro-service Memory (Mi)
no kubearmor500005000--2205.5020.45340401.1287.3333333
Readings
ScenarioRequestsConcurrent RequestsKubearmor CPU (m)Kubearmor Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsMicro-service CPU (m)Micro-service Memory (Mi)
no kubearmor500005000--2246.790.4450380239
--------------------
no kubearmor500005000--2187.220.4570378358
no kubearmor500005000--2244.160.4460451258
no kubearmor500005000--2213.370.4520351304
no kubearmor500005000--2131.190.4690380251
no kubearmor500005000--2215.890.4510400326
no kubearmor500005000--2172.190.460428332
no kubearmor500005000--2195.730.4550444240
no kubearmor500005000--2206.410.4530385278
no kubearmor500005000--2242.070.4460414318
Average2205.5020.45340401.1287.3333333

Kubearmor with discovered Policy Applied

Average

ScenarioRequestsConcurrent RequestsKubearmor CPU (m)Kubearmor Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsMicro-service CPU (m)Micro-service Memory (Mi)
no kubearmor500005000141.2111.92169.3580.46090438.2435.1
Readings
ScenarioRequestsConcurrent RequestsKubearmor CPU (m)Kubearmor Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsMicro-service CPU (m)Micro-service Memory (Mi)
with Policy5000050001311132162.860.4620542446
with Policy5000050001391112190.720.4560457458
with Policy5000050001451122103.460.4750445395
with Policy5000050001491082155.550.4640440454
with Policy5000050001291132177.680.4590395394
with Policy5000050001601222198.530.4550435503
with Policy5000050001561172179.890.4590391451
with Policy5000050001341192196.780.4550408429
with Policy5000050001291142178.070.4590424435
with Policy5000050001401122150.040.4650445386
Average141.2111.92169.3580.46090438.2435.1

BPF LSM benchmarking data

ScenarioConfigKubeArmorMicroservices
RequestsConcurrent RequestsCPU (m)Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsCPU (m)Memory (Mi)
with kubearmor500005000130991889.810.5290407324
with kubearmor5000050001201041955.260.5110446423
with kubearmor5000050001221011952.940.5120433448
with kubearmor5000050001521041931.710.5180474405
with kubearmor5000050001421081896.010.5270564413
with kubearmor5000050001101071896.950.5270416375
with kubearmor5000050001151061868.770.5350354383
with kubearmor5000050001141091877.290.5330461355
with kubearmor5000050001301051962.810.5090552380
with kubearmor5000050001021101966.190.5090351297
Average123.7105.31919.7740.5210445.8380.3
ScenarioConfigKubeArmorMicroservices
RequestsConcurrent RequestsCPU (m)Memory (Mi)Throughput (req/s)Average time per req. (ms)# Failed requestsCPU (m)Memory (Mi)
with policy5000050001031101806.060.5290431330
with policy5000050001221111836.040.5110432348
with policy5000050001231081871.020.5120505393
with policy5000050001181111915.070.5180599331
with policy5000050001211101896.340.5270405310
with policy5000050001261131896.70.5270450430
with policy5000050001171101915.790.5350408382
with policy5000050001281111885.770.5330482321
with policy5000050001221141900.960.5090433359
with policy5000050001241041887.870.5090448393
Average120.4110.21881.1620.53180459.3359.7

· 3 min read
Rahul Jadhav

Introduction

Oracle Container Engine for Kubernetes (OKE) is a managed Kubernetes service for operating containerized applications at scale while reducing the time, cost, and operational burden of managing the complexities of Kubernetes infrastructure. Container Engine for Kubernetes enables you to deploy Kubernetes clusters instantly and ensure reliable operations with automatic updates, patching, scaling, and more.

Oracle Linux is a distribution of Linux developed and maintained by Oracle and is primarily used on OKE. It is based on the Red Hat Enterprise Linux (RHEL) distribution and is designed to provide a stable and secure environment for running enterprise-level applications. Oracle Linux includes Unbreakable Enterprise Kernel (UEK) which delivers business-critical performance and security optimizations for cloud and on-premises deployment.

Supporting KubeArmor on Oracle Linux

Oracle Linux

While UEK (Unbreakable Enterprise Kernel) is a heavily fortified kernel image, the security of the pods and the containers are still the responsibility of the application developer. KubeArmor, a CNCF (Cloud Native Computing Foundation) sandbox project, is a runtime security engine that leverages extended Berkeley Packet Filter (eBPF) and Berkeley Packet Filter-Linux Security Module (BPF-LSM) to protect the pods and containers.

With version 0.5, KubeArmor integrates with BPF-LSM for pod and container-based policy enforcement. BPF-LSM is a new LSM (Linux Security Modules) that’s introduced in the newer kernels (version > 5.7). BPF-LSM allows KubeArmor to attach bpf-bytecode at LSM hooks that contain user-specified policy controls.

Linux Kernel

KubeArmor provides enhanced security by using BPF-LSM to protect k8s pods hosted on OKE by limiting system behavior with respect to processes, files, and the use of network primitives. For example, a k8s service access token that’s mounted within the pod is accessible by default across all the containers within that pod. KubeArmor can restrict access to such tokens only for certain processes. Similarly, KubeArmor is used to protect other sensitive information (e.g., k8s secrets, x509 certificates) within the container. You can specify policy rules in KubeArmor such that any attempts to update the root certificates in any of the certificate’s folders (i.e., /etc/ssl/, /etc/pki/, or /usr/local/share/ca-certificates/) can be blocked. Moreover, KubeArmor can restrict the execution of certain binaries within the containers.

To Summarize

KubeArmor, a cloud-native solution now supports OKE to secure pods and containers using BPF-LSM for inline attack mitigation/prevention. In the case of k8s, the pods are the execution units and are usually exposed to external entities. Thus, it’s imperative to have a layer of defense within the pods so that the attacker is limited in their ability to use system primitives to exploit the vulnerability. KubeArmor is a k8s-native solution that uses Linux kernel primitives on Unbreakable Enterprise Kernel (UEK) to harden the pods, further fortifying the K8s engine.

· 3 min read
Achref Ben Saad

KubeArmor annotation controller

Starting from version 0.5, kubearmor leverages admission controllers to support policy enforcement on a wide range of Kubernetes workloads such as individual pods, jobs, statefulsets, etc … .

What is an admission controller?

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object(1). Admission controllers can be one of two types:

  • Validating admission controllers: used to either accept or reject an action on a resource, e.g: reject creation of pods in the default namespace. Kubernetes comes shipped with many validating controllers such as NodeRestriction controller that limits what kubelet can modify.
  • Mutation admission controllers: used to apply modifications on requests prior to persistence, e.g: add default resource requests if they are not defined by a user. AWS EKS uses mutation controllers to add environment variables(region, node name, …) to each created container.

The order of admission controllers executions is as follow:

  • All mutations are performed on the original request then merged, if a conflict occurs an error is yielded. Only the schema of the resultant merge is validated.
  • All validating controllers are called, the request will be rejected if one validating controller rejects the request.

Admission controller

KubeArmor leverages mutation controllers to enable policy enforcement on kubernetes workload.

What are the benefits of the annotation controller?

Before v0.5, policies were enforced by applying the appropriate annotations to pods by patching their parent deployment. This meant that policies can be applied only to pods that are being controlled by deployments.

By using mutation controllers we are able to extend kubearmor capabilities to support basically all types of workloads as the annotations will be applied to any pod prior to their creation, as a result, kubearmor will send far less requests to the API server as patch operations were executed in parallel and often in a concurrent manner by all kubearmor pods.

Admission controller

What if the controller fails?

KubeArmor maintains the old annotation logic as a fallback logic in order to enable our users to continue to benefit from kubearmor policy enforcement but at a degraded level in case of a failure, details can be found at the event section of the newly created pods.

How can I install it?

The controller comes bundled with kubearmor, you can install it via karmor cli tool or via our installation manifests under /deployments.

What are the Kubernetes versions that can support the new controller?

The controller can be run on kubernetes clusters starting from v1.9 on newer. Please keep in mind that kubernetes only support the three latest versions of kubernetes.

References

Kubernetes documentation(1)

· 3 min read
Barun Acharya

KubeArmor BPF LSM Integration

High Level Module Changes

NowProposed
NowProposed

Module Design

Module Design

Map Design

Map Design

Outer Map details

struct outer_hash {
__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
__uint(max_entries, X);
__uint(key_size, sizeof(struct outer_key)); // 2*u32
__uint(value_size, sizeof(u32)); // Inner Map File Descriptor
__uint(pinning, LIBBPF_PIN_BY_NAME); // Created in Userspace, Identified in Kernel Space using pinned name
};
  • Key
    • Identifier for Containers
struct outer_key {
u32 pid_ns;
u32 mnt_ns;
};

Inner Map details

&ebpf.MapSpec{
Type: ebpf.Hash,
KeySize: 4, // Hash Value of Entity
ValueSize: 8, // Decision Values
MaxEntries: 1024,
}
  • Value
struct data_t {
bool owner; // owner only flag
bool read; // read only flag
bool dir; // policy directory flag
bool recursive; // directory recursive flag
bool hint; // policy directory hint
};

Handling of Events

Handling of Events

Deeper Dive with Examples

  1. Example 1

  2. Example 2 But what if it's not a match Match We explore how directory matching works in the next example

  3. Example
    Notice How we split the directory policy to a sets of hints in the map. This helps in efficient matching of directory paths in Kernel Space. Map
    What if we try to access a file in a different directory.
    Route
    Presence of no hint helps break through the iteration hence optimizing the process.

Directory Matching

#pragma unroll
for (int i = 0; i < MAX_STRING_SIZE; i++) {
if (path[i] == '\0')
break;

if (path[i] == '/') {
__builtin_memset(&dir, 0, sizeof(dir));
bpf_probe_read_str(&dir, i + 2, path);

fp = jenkins_hash(dir, i + 1, 0);

struct data_t *val = bpf_map_lookup_elem(inner, &fp);
if (val) {
if (val->dir) {
matched = true;
goto decisionmaker;
}
if (val->hint == 0) { // If we match a non directory entity somehow
break;
}
} else {
break;
}
}
}

Hashing

Files and Source Names can be huge. Worst case both add to 8192 bytes. Which is a very large entity for key. So we hash that value to u32 key. We stored hashed values from userspace and lookup up hashed values from kernel space for decision making.

We plan to use a Jenkins hash algorithm modified for use in ebpf land and matching implementation in user land.

Based on Event Auditor Implementation

Inspirations

TODO/ToCheck

  1. List out LSM Hooks to be integrated with
  2. Explore Hashing
  3. Analyse Performance Impact
  4. ...

Miscellaneous Notes

· 5 min read
Barun Acharya

A few months back I presented at Cloud Native eBPF Day Europe 2022 about Armoring Cloud Native Workloads with BPF LSM and planted a thought about building a holistic tool for runtime security enforcement leveraging BPF LSM. I have spent the past few weeks collaborating with the rest of the team at KubeArmor to realize that thought. This blog post will explore the why’s and how’s of implementing security enforcement as part of KubeArmor leveraging BPF LSM superpowers at its core.

Why❓

Linux Security Modules provides with security hooks necessary to set up the least permissive perimeter for various workloads. A nice introduction to LSMs here.

KubeArmor is a cloud-native runtime security enforcement system that leverages these LSMs to secure the workloads.

LSMs are really powerful but they weren’t built with modern workloads including Containers and Orchestrators in mind. Also, the learning curve of their policy language seems to be steep thus imposing friction in adoption.

eBPF has provided us with the ability to safely and efficiently extend the kernel’s capabilities without requiring changes to kernel source code or loading kernel modules.

BPF LSM leverages the powerful LSM framework while providing us with the ability to load our custom programs with decision-making into the kernel seamlessly helping us protect modern workloads with enough context while we can choose to keep the interface easy to understand and user-friendly.

KubeArmor already integrates with AppArmor and SELinux and has a set of tools and utilities providing a seamless experience for enforcing security but these integrations come with their own set of complexities and limitations. Thus the need to integrate with BPF LSM would provide us with fine-grained control over the LSM hooks.

How✍️

The Implementation can be conveyed by the following tales:

  1. Map to cross boundaries - Establishing the interface between KubeArmor daemon (Userspace) and BPF Programs (KernelSpace)
  2. Putting Security on the Map - Handling Policies in UserSpace and Feeding them into the Map
  3. Marshal Law - Enforcing Policies in the KernelSpace

Map to cross boundaries🗺️

There’s a clear boundary between the territories of KernelSpace and the Userspace, so how do we establish the routes between these two.

We leverage eBPF Maps for establishing the interface between the KubeArmor daemon (Userspace) and BPF Programs (KernelSpace). As described in the kernel doc,

‘maps’ is a generic storage of different types for sharing data between kernel and userspace.

It seems apt to help us navigate here 😁

For each container KubeArmor needs to protect or how I like to term it ‘Armor Up’ the workload, we create an entry in the global BPF Hash of Maps pinned to the BPF filesystem under /sys/fs/bpf/kubearmor_containers, this entry has a value to another BPF Hashmap which has all the details of policies that are needed to enforce.

Map Interface

Putting Security on the Map📍

KubeArmor Security Policies have a lot of metadata, We cannot put all that to Maps and let the BPF Program navigate those complexities.

For instance, we usually map the security policies through labels associated with Pods and Containers. But we can’t send that to the BPF Program, let alone labels eBPF programs wouldn’t handle container names/IDs as well. So KubeArmor extracts information from Kubernetes and CRI (Docker/Containerd/CRI-O) APIs and simplifies it to something we can extract in the eBPF Program as well.

struct key {
u32 pid_ns;
u32 mnt_ns;
};

struct containers {
__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
__uint(max_entries, X);
__uint(key_size, sizeof(struct key)); // 2*u32
__uint(value_size, sizeof(u32)); // Rule Map File Descriptor
__uint(pinning, LIBBPF_PIN_BY_NAME); // Created in Userspace, Identified in Kernel Space using pinned name
};

Similarly, KubeArmor can receive conflicting policies, we need to handle and resolve them as part of KubeArmor Userspace program before putting them on the Map.

After all the rule simplification, conflict resolution, and handling of policy updates, We send the data to the eBPF Map prepping the BPF Programs to get ready for enforcement.

Marshal Law

We finally marshal all the data in the kernel space and impose MAC (Military Access Control? 🔫 Pun intended :P ).

In the kernel space where our BPF LSM Programs reside, In each program, we extract the main entity i.e. File Path, Process Path, or Network Socket/Protocol, and pair them up with their parent process paths and look up in the respective maps. We do the decision making based on these lookup values.

There are a fair bit of complexities involved here which I have skipped, if you are interested in them, check out Design Doc and Github Pull Request.

I would also like to credit How systemd extended security features with BPF LSM which acted as an inspiration for the implementation design.

Armoring Up

The environment requires a kernel >= 5.8 configured with CONFIG_BPF_LSM, CONFIG_DEBUG_INFO_BTF and the BPF LSM enabled (via CONFIG_LSM="...,bpf" or the "lsm=...,bpf" kernel boot parameter).

You can use the daemon1024/kubearmor:bpflsm image and follow the Deployment Guide to try it out.

A sample alert here showing enforcement through BPF LSM: BPF

Next Steps

We have unraveled just a drop of what BPF LSM is capable of, and we plan to extend our security features to lots of other use cases and BPF LSM would play an important role in it.

Near future plans include supporting wild cards in Policy Rules and doing in-depth performance analysis and optimizing the implementation.

👋
That sums up my journey to implement security enforcement leveraging BPF LSM at its core. It was a lot of fun and I learned a lot. Hope I was able to share my learnings 😄

If you have any feedback about the design and implementation feel free to comment on the Github PR. If you have any suggestions/thoughts/questions in general or just wanna say hi, my contact details are here ✌️

· 3 min read
Rahul Jadhav

High Level Design

Policy Mapping

KubeArmorPolicy for seccomp

apiVersion: security.kubearmor.com/v1
kind: KubeArmorSeccompPolicy
metadata:
name: ksp-wordpress-block-process
namespace: wordpress-mysql
spec:
severity: 3
selector:
matchLabels:
app: wordpress
seccomp:
arch: [x86_64, x86, x32] #OPTIONAL
syscalls: [accept4, epoll_wait, pselect6, futex, madvise]
action: Allow

Following is the mapped seccomp profile:

{
"defaultAction":"SCMP_ACT_ERRNO",
"architectures":[
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls":[
{
"names":[
"accept4",
"epoll_wait",
"pselect6",
"futex",
"madvise"
],
"action":"SCMP_ACT_ALLOW"
}
]
}

Rational for separate Kind: KubeArmorSeccompPolicy (kscmp for short)

K8s, Docker or any other container runtime allows the user to set just a single seccomp policy per pod. The seccomp policy has to be annotated as part of the container label and the pod needs to be restarted for any change in the policy. This is the primary reason for having a separate kscmp kind and not just reusing the existing ksp with seccomp rules.

Handling multiple kscmp policies per pod

There could be at most one seccomp policy per container. Thus when there are multiple seccomp policies pushed for the same pod then the seccomp rules have to be merged. General rule of thumb would be that a syscall deny would always take precedence in case there are multiple seccomp rules for the same syscall.

Consider an example:

  1. Policy 1: kscmp1 - allow{s1, s2, s4}, deny{s3}
  2. Policy 2: kscmp2 - allow{s1, s3}, deny{s2} Here s1, s2, s3, s4 are 4 different syscalls and kscmp1/2 are two KubeArmorSeccompPolicy. Final Seccomp Profile after merging the kscmp policies will be: allow {s1, s4}, deny {s2, s3}

Lets assume that policy kscmp2 get deleted at a later point in time. The final seccomp profile in that case would be: allow{s1, s2, s4}, deny{s3}

Do we need to use any third-party library (such as libseccomp)?

In case of k8s, the underlying container runtime (such as docker, containerd) does the job of enforcement. The enforcement can be done only by the process that forks the workloads and in case of k8s that process is runc.

Kubearmor needs to create the json profile and set the annotations appropriately for the container runtime to use it. For VM/bare-metal workloads, we need to do the same with systemd i.e., instruct systemd to use a given seccomp profile (example here). For VM/bare-metal with containers (but without k8s), the runc/docker comes into the play again since they spawn the workload processes and are in charge of setting seccomp profile.

libseccomp is primarily needed if someone wants to attach an ebpf filter with seccomp. seccomp-ebpf filter won't be needed for whitelist/blacklist based rules. Also I see that docker profile allows us to specify "parameterized syscalls" using the json profile as an input, so that takes care of even advanced use cases. (I checked that docker/runc do not internally link to libseccomp).

TODOs

  • KubeArmor restart: Reload seccomp profiles
  • What happens when the workload moves from one node to another and that the architectures (x86, arm,...) of these two nodes differ?
  • Volume mount for /etc/seccomp ... The /etc/seccomp has to be created when kubearmor is deployed.
  • Test: Even if apparmor/selinux is not available, seccomp should still work
  • Prepare DevSecOps flow for managing/handling seccomp policies with auto discovery.
  • Handle creation of /var/lib/kubelet/seccomp/ folder during KubeArmor init time.
  • Check location of seccomp folder on different k8s implementations (k3s/minikube/GKE/EKS/AKS)

· 11 min read
Rahul Jadhav

Problem Statement we are solving here?

Kernel system calls and other event auditing are done by various tools to detect malicious behavior of a process. For e.g., if a process which is not part of a “set of processes/process spec” attempts to access a particular path using open() system call then the module will raise an alert since it doesn’t expect any process outside of a particular process spec to access that file or file system path. Event monitoring/auditing systems are also used by various compliance frameworks (such as PCI-DSS, SOC2), hardening standards (such as STIGs) and attack frameworks (such as MITRE) that provide guidelines for setting up defense rules. Falco is one such event monitoring/auditing system which uses eBPF or kernel module to filter system events at runtime in the kernel space and check for any malicious behavior based on rules passed from the user space monitor process.

Process

Consider an example scenario where as per the policy only processes invoked from /usr/bin/ folder would be able to access the /etc/ folder. The allowed process spec in this case is any process from /usr/bin/ path. Now let's assume at runtime there is a process XYZ which does not match the process spec and tries to access a file /etc/crontab. As per the above figure, following steps will happen: Process XYZ does an open() call on /etc/crontab. This results in a syscall(open) getting invoked in the kernel space. The eBPF instruction set inserted for monitoring purposes will detect the syscall(open) event. It verifies that the filter does not match, that is it finds that the process XYZ which is attempting to open the file /etc/crontab does not match the process spec. It forwards the event to the monitor process in the userspace. Event monitoring systems can take into account spatial conditions for filtering and then raise an event that can be further used for analysis purposes. The spatial condition in the above example is that when a file open is attempted, the process context is additionally checked to verify if it belongs to a process spec before raising an event. Thus, the process context (name, pid, namespace, process path) are the spatial conditions on which the open() event could be further filtered.

Quality of such monitoring/filtering/auditing systems is dependent on:

  1. How well the filters can represent the rules as mentioned in the compliance/hardening standards?
  2. How much performance overhead is added by the filtering system?

Problems with monitoring/filtering/auditing systems:

There are two problems with such systems

  1. There is no option to apply conditions based on rate-limit. For e.g., generate an audit event only when a certain system event is detected more than 10 times per unit time (say 1 min).
  2. No option to apply temporal correlation. Currently the filters operate on the context available on that event instance. Temporal correlation is not possible. For e.g., setting a filter which says if network send() syscall is invoked more than 100 times in 1 min and file read() is invoked more than 100 times per second then raises an audit event.

Problems addressed by this design:

To overcome the problems mentioned above, this idea attempts to make two major changes:

  1. The idea allows to specify the rate-limit filters and temporal correlation filters from the userspace, but the filter is completely handled in-kernel and only the final result is emitted to user-space. This prevents any unnecessary context-switches.

  2. The idea provides an improved schematic/design to implement the rate-limit/temporal-correlation filters such that the memory overhead and the in-kernel processing overhead is kept to the minimum.

  3. By using policy constructs defined in this idea, a policy engine could avoid a lot of false positives in the real environment making the security engine robust.

Sample Use Cases

Sample policy for rate-limited events

apiVersion: security.accuknox.com/v1
kind: KubeArmorPolicy
metadata:
name: ksp-wordpress-config-block
namespace: wordpress-mysql
spec:
severity: 10
selector:
matchLabels:
app: wordpress
- process: *, -*/bash, -*/sh
msg: "readdir limit exceeded"
severity: 5
- syscall: readdir
param1: /*, -/home/*, -/var/log/*
rate: 10p1s

Sample policy allowing temporal correlation of events

apiVersion: security.accuknox.com/v1
kind: KubeArmorPolicy
metadata:
name: ksp-wordpress-config-block
namespace: wordpress-mysql
spec:
severity: 10
selector:
matchLabels:
app: wordpress
- process: *, -*/bash, -*/sh
msg: "readdir limit exceeded"
severity: 5
- syscall: readdir
param1: /*, -/home/*, -/var/log/*
rate: 10p1s

Design expectations & Limitations

Design Expectations

The design should sufficiently explain:

  1. How will the process filter work?

    1. How to ensure that least amount of overhead is incurred while handling processes which are not of interest?
    2. How to ensure that the events that the policies are not interested, do not induce additional control overhead?
  2. What eBPF bytecodes have to be loaded, both statically and dynamically?

  3. How event parameter handling will be done? Event parameter handling must incur the least overhead.

  4. How rate-limiting will work?

Limitations & Assumptions

  1. Works only for systems supporting eBPF >=4.18

  2. Different policies could induce different amounts of overhead. Thus, the use of syscalls to monitor must be properly reviewed and performance implications understood. In the future, we could have a system that can identify an approx overhead added by the policy and inform/alert the user.

  3. This design assumes linux kernel >=4.18

Sample reference policy

apiVersion: security.accuknox.com/v1
kind: KubeArmorPolicy
metadata:
name: detect-active-network-scanning
namespace: multiubuntu
spec:
- process: *
msg: "local reconn attempt with TCP scan"
severity: 5
- syscall: connect //FD1
proto: *P
ip4addr: 192.168.10.10/25 0xffffff80, 10.*.*.* 0xff0000000, 192.168.*.1 0xffff00ff
rate: 20p1s

- syscall: connect //FD2
proto: FILE
path: /tmp/*
rate: 20p1s

- process: *, /bin/*sh, -*ssh
msg: "consecutive RAW sends"
severity: 5
- syscall:raw_sendto
param2: 192.168.*.*, 10.*.*.*
rate: 20p1s

- process: *, /bin/*sh, -*ssh
msg: "consecutive RAW sends"
severity: 5
- syscall: raw_send
param2: 192.168.*.*, 10.*.*.*
rate: 20p1s

- process: *
msg: “outbound probes detected”
severity:
- kprobe: tcp_rst
Rate: 10p1s

- process: *
msg: “inbound probes detected”
severity:
- kprobe: tcp_rst_send
Rate: 10p1s

Note: Not every event might be associated with a process spec. There are events that are generated which may not have any associated task structure.

What design constraints do we have to live with?

Example the # of bpf programs, the instructions, memory…

Module Design

Module Design

Handling of events

On New Policy

When a new policy is provided as an input the policy might be either a

  1. Container based policy

  2. Host based policy

In either case, a new entry would be added in the process_spec_table containing the pid-ns of the container. In case of host-based policy the pid-ns would be 0.

process-spec-table

Container pid-nsprocess-specevent-filter-spec
12345*[event1-fd1, event2-fd2, …]
53678/usr/bin/*sh[event3-fd3, ...]
12312, -/*sh[event4-fd4, event5-fd5, …]
5235[NA]...
0 (host-based)......

Points to note:

  1. There could be several event-filter-specs for the same [pid-ns, process-spec] tuple.

  2. 0 pid-ns indicates host-based rules

  3. The event-filter-spec contains eBPF bytecode that is compiled on demand. The event-filter-spec has the event type/info for which the corresponding event/kprobe/tracepoint would be loaded.

  4. Every event-filter-spec’s compiled bytecode is pre-loaded in the BPF_MAP_TYPE_PROG_ARRAY for tail-call processing and file-descriptor noted in the event-filter-spec column.

process spec

On New Process

The process-filter-table is a bpf map that stores the mapping of {pid-ns, pid, event-id} to the corresponding set of { event-filter-fds }.

process-filter-table

Pid-ns, pid, Event-IDEvent-filter-FDOpaque Data
{ 0xcafebabe, 0xdeadface, SYSCALL-CONNECT}[FD1, FD2][...event-handler can keep rate-info and other event specific data…]

[TODO]: The process wildcard matching has to be done in the kernel space. Write a prototype code to validate the wildcard matching can be implemented effectively in kernel space.

Pid-ns, pid, Event-ID
Input: event_info_t (check next section for details)
1. Check the process-spec-table and check if the container-pid-ns matches.
    a. If there is no match, ignore the new process event.
2. If there is a match, add a new entry into the process-filter-table.
3. Note the event-filter-fd-map has to be populated.

New Process

On Kernel Event

On Kernel Event

Event Structure

event-info-structure

Note that this is not a bpf-map. This is an internal data-structure used to pass between tail-calls.

struct event_info {
uint32_t id; // updated by kernel-event bytecode
uint32_t fdset[MAX_FD_PER_EVENT]; // updated by matchProcess bytecode
void *context; // updated by kernel-event bytecode
} event_info_t;

where…
id is the event-id … such as SYSCALL-CONNECT, KPROBE-TCP_RST
fdset is the set of event handlers for the given kernel event
context is the kernel context available for the kernel event
onKernelEvent pseudo-code
1. A kernel event of interest (i.e., one which is enabled based on policy-event-filter) is called. Note that an event handler bytecode for a kernel event is inserted only if there exists a corresponding policy that operates on that kernel event.
2. The primary task of kernel_event_bytecode is to create an event_info { event_id, context } and then call the matchProcess bytecode.
3. The matchProcess matches the process. Once the process-filter-table entry is identified, the logic gets a list of tail-call FDs to call. The list of FDs are called one after another in the same sequence in which they appear in the policy spec.
4. The tail-call FDs are called one after another based on the FD set.
5. The event-handler might want to update the runtime state in the opaque-data of the process-filter-table.

Notes:

  1. It is possible that we receive a kernel event that does not have an associated process. For e.g., kprobe:tcp_rcv_reset. Such events could only be added for host-based audit rules.

  2. Note that New process event from the kernel needs a special handler, because it needs to fill the process-filter-table and might have to process the event-filters.[TODO].

onKernelEvent

Overall Event Processing Logic

Overall Event Processing Logic

On Process Terminate

[TODO] cleanup process-filter-table

On Policy Delete

Handle update of process-spec-table. This may lead to removal of loaded event-filter-spec ebpf bytecode and deletion of corresponding descriptors.

On delete container

Remove entry from the process-spec-table

Handling Rate-limit

Problem with handling rate-limit

Problem with handling

Consider the case where an event is to be observed with a rate of 10 per one second. The Period here is 1 sec. The dotted box shown in the figure above shows 1 second time period. The circles on the timeline show the occurrence of the events.

Approach 1: Fine-grained approach

This approach allows one to calculate the precise rate-limit but requires more memory to be maintained since every event observed in the time quantum has to be stored. There is also more processing time required because of the store and cycle operations.

Approach 2: Coarse-grained approach

This approach reduces the memory requirement by using adjoining time quantums but this may result in some cases that the rate-limits are not observed.

Approach Preference

Approach 2 results in much less memory and processing overhead. Also consider that in real-world cases, we do not expect the user to specify the exact rate i.e., user will in general provide a lower limit for the rate. For example, for the active-scanning policy scenario depicted in this document, the rate-limit of 10p1s is depicted but in reality the scanning speed will be much faster i.e., Approach 2 should easily be able to detect the rate.

Performance considerations

  1. If an event is not attached by any policy then there should not be any runtime overhead associated with that event handling.

  2. Minimum runtime overhead if an event is attached but the process is not of interest. We need to matchProcess and discard it. This will currently result in one map lookup and one tail-call before the event is discarded. It may be possible to remove the tail-call but will add additional memory requirements since the handleEvent() and matchProcess() has to be bundled together.

Tasklist

  1. Prototype code: eBPF bytecode to match process wildcard pattern [OPTION1]
  2. Prototype code: Auto generate event-filter bytecode. Merge multiple event-filter bytecodes into single code.
  3. Prototype code: Handling tail call and corresponding argument call. For a detailed tasklist check ref.