Introducing System Call Integrity Layer

Making syscalls explain themselves.

TLDR

TLDR: I propose a kernel subsystem, the System Call Integrity Layer (SCIL), that lets a user-mode EDR mediate selected syscalls via Alt Syscalls, pausing dispatch and handing a Pending Syscall Object (PSO) to an EDR process for inspection before the syscall continues.

You can find this project on GitHub!

Preface:

This mini-project (albeit not mini in terms of complexity) is a thought experiment and a series of architectural posits which could potentially be implemented by Microsoft or security groups in the future to combat techniques bad actors use to evade detection at runtime.

Some of what I wish to articulate is not possible to build in its entirety as a third party developer. It is also not possible for me to build this technique which relies upon Alt Syscalls and secure enclaves at the same time - as the Alt Syscall technique requires HVCI to be disabled.

What I will build therefore will be constructed in VTL0. The next best option to VTL1 would be ELAM and Protected Process Light - to save on fiddly work with this (and ease of debugging) I am going to build a standard user-mode application since this is a POC. If this were to ever make it to actual teams considering this, then a better application design would be in order.

I enjoy Windows Internals, and this work is not intended to be a finished solution with well defended trust boundaries, polished communication procedures, or robust threat models. I had an idea of something I thought was cool, and I wanted to try build a subsystem in the operating system, as though I was one of the hugely talented researchers working at Microsoft. Microsoft larping, if you will! Nevertheless, I do believe there to be the beginnings of an interesting concept for security software to move from the kernel into userland through what I wish to present as System Call Integrity Layer.

Be bold. Be brave. Experiment.

Introduction

System calls are the major backbone of Windows in which user and operating system applications must interact with the computer’s hardware, or reach into internal structures within the system in order for the computer to function.

Due to them brokering user applications with the kernel, and Microsoft preventing EDRs from directly hooking System Service Dispatch Tables (aka Patch Guard), EDR vendors decided it is a good idea to hook .text sections in user-mode applications in order to help it determine what is going on in an application (one of many mechanisms used on modern Windows).

Sadly for the good guys, threat actors and security researchers came along with techniques such as Hells Gate (and the rest) which are capable of bypassing EDRs hooking stubs within NTDLL which wrap around the system call. I have previously proposed a method called Ghost Hunting which uses Alt Syscalls and a few other signals to determine whether a process has attempted to evade the EDR when making a system call.

Whilst this was successful, it relies upon EDRs making use of Alt Syscalls, of which I believe is a totally closed source Microsoft proprietary internal system. Hence, it is most likely only a ‘Gedankenexperiment’.

I am personally an advocate for EDRs being in the Kernel as a driver, it makes perfect sense given the level of scrutiny they need to have over a system. Of course, there is always the occasional argument for drivers causing a shutdown of half the planet (ahem), however I do not take this argument as you could probably make the same argument about any driver or Msft kernel release. It wont be the first driver to cause outages, nor will it likely be the last.

I have been thinking a lot lately about how to take EDRs out of the kernel (for fun), and how to beat threat actors who bypass EDR .text section hooking. At first glance, this may seem like two separate issues, however, I would like to propose a single solution to this.

As I said in the preface, this idea is more of an architectural design for the Windows Operating System, which would have to be implemented by Microsoft and this isn’t something which can be taken away (I don’t think?) by security vendors now. Nevertheless, I would like to posit this as a foundational concept which other people in the community make take away, think about, and maybe one day - build on!

System Call Integrity Layer

The System Call Integrity Layer (SCIL) is designed to be a subsystem within the Kernel which allows an EDR from Userland to hook System Calls via Alt Syscalls. The EDR can mark which processes are to be hooked, and can designate only particular System Service Numbers to hook.

Architecturally the ideal secure solution to this would look as follows:

System Call Integrity Layer VTL1 Architecture

The SCIL subsystem then has two main functions when it is in motion:

Log system calls and parameters (this is essentially a similar feed to Events Tracing for Windows: Threat Intelligence).
For processes / system calls which require deep inspection:
1. Suspend the system call temporarily via a synchronisation object.
2. Communicate with the userland EDR application (EDR no longer in the kernel for this) notifying of a Pending Syscall Object (PSO). I haven’t yet designed exactly what PSOs will contain / point to.
3. The user-mode EDR application can then do EDR things it would ordinarily do in ntdll etc before allowing a syscall to dispatch.
4. If the EDR ok’s it, signal back to the SCIL subsystem to release the synchronisation object, which allows the syscall to continue dispatching.

The subsystem in practice would also need short-circuits in the event the EDR user-land handler of the malfunctioning / taking too long. Any such cases can then still have telemetry ingest to the EDR via point 1 above with the signal emission. This process in practice would look as follows (recall for this I am not using VBS):

System Call Integrity Layer Architecture

The recommendation would be for only a certain subset of system calls to be hooked via SCIL as to not exhaust the system with constant waiting, context switching etc. Some good candidates could be:

NtOpenProcess
NtAllocateVirtualMemory
NtWriteVirtualMemory
NtContinue
NtCreateThread
NtCreateProcess
NtProtectVirtualMemory
NtMapViewOfSection

The advantage of a security subsystem aimed at EDR vendors / Security Vendors, would be Microsoft can provide additional APIs such as GetThreadContext which could be used directly in a supported way for the EDR to do things like detecting Vectored Exception Handler abuse.

It would not be recommended to allow syscall hooking on machines which require near-real-time operation (such as SCADA), but you could consider still using the logging feature of the SCIL subsystem and read from that as though it were Events Tracing for Windows.

Suspending System Calls

I think the primary section I need to explore is the syscall dispatch suspension. Luckily, I have already done a good bit of research in tampering with Syscalls as I outlined in my post regarding my Hells Hollow technique. I would recommend a read through of that to understand exactly where in the dispatch sequence we can inject ourselves. Taking the image I added to that post, we can achieve this in the our callback handler box (ignore the stack stuff - as @sixtyvividtails pointed out to me we do not need to do that to read / modify the trap):

Windows 11 KTRAP_FRAME Alt Syscalls Hells Hollow modify return value

So, my train of thinking is we can use KeWaitForSingleObject if we meet the conditions to hook, allow the thread to do other things by making it an alertable wait, then, whichever happens first:

Wait timeout -> Log & continue execution
EDR user-mode process replied and the syscall can continue execution. We will have to track the event synchronisation objects such that when we are ready to release it we get the right one. I built the wdk-mutex crate which will allow a thread-safe way of tracking such structures.

Alternatively: I could consider a non-alertable wait and hope the scheduler deals with threads being suspended pending the transaction.. I will probably trial this too.

And remember, this wouldn’t be designed for every syscall and every process due to exhaustion bottlenecks & responsiveness issues.

Technical Challenges

Naturally, this proposal does not come without some technical problems. Namely:

A secure threat model and trust boundaries such that adversaries cannot intervene in syscall dispatch from userland (Rootkits always gonna rootkit) or to subscribe to the signals (similar protection would be needed as to ETW:TI).
Suspending threads mid-syscall feels.. dangerous.
Performance (objectively, this needs measuring. It would be good to test under real-world workloads).
Thread / system resource starvation.
Deadlocks.
‘Time of check time of use’ with regard to processes / SSNs being marked.
Fast copies into PSOs (we do not want to waste cycles by copying data into buffers unnecessarily).
Efficient communication and operation when PSOs are raised - does the EDR element need to open processes, inspect memory, etc. How acceptable is this?
Could these changes affect legacy applications? (my gut tells me no)
Would putting a thread in an alertable state cause recursion issues (given its smack bang in the middle of a hot syscall dispatch path), could it introduce other weird side effects / undefined behaviour? What happens if a suspended thread needs to service an APC which itself makes a hooked syscall..?
Can a threat actor abuse this to DOS a system?

I think these are not impossible to solve, but serious profiling and testing is needed. I am only at the beginning of this POC project, I have started the outline of the system and it is now on GitHub.

I’m going to begin development on this once I nail down a few of the more transactional APIs, and I need to do some serious testing of suspending threads (alertable) mid-syscall. I think this surface could be API’ified by Microsoft, and make more stable than I could ever dream of with their kernel engineers, and provided to security vendors. I do not work for a security vendor and I don’t know how fully they can integrate into the OS (if anything is shared behind the scenes), but I would love to see an effort in the future into providing more integration into the OS without relying on undocumented APIs. Threat actors aren’t slowing down, if anything, there are more than ever.

Next steps

I am not underestimating the complexity of this. But to break this down:

Define a threat model and trust boundaries. State explicitly how the system design would prevent malware from taking over the subsystem’s intentions.
Start with the logging module - apply this system wide and log all syscalls to a file via the user-mode EDR signal receiver.
Experiment: Hook NtOpenProcess in a test process and test the synchronisation object methodology, having the user-mode process signal back to resume execution.
Trial non-alertable waits.
Define what PSOs should look like.
Properly implement the PSO dispatch routine.
Add callable APIs (would have to be via IOCTLs, or maybe I can make up my own SSN’s and use AltSyscalls to dispatch them, that would be rad) that security vendors may be interested in the kernel extending.

Until next time, ciao xo!