Improving the Ghost Hunting implementation for flexibility and speed

Making some improvements.

Intro

This is more of a ‘devlog’ type post - if you have a vested interest in this technique, or in Rust generally, then read on! The project can be found over on GitHub. Seeing as though ‘Ghost Hunting’ is my own research, and makes up (so far) the core detection logic of the EDR around process injection and obfuscation techniques, I figured this would be worth a short post!

The changes that relate to this post can be found on this commit. Amd a bugfix commit here, and a performance commit here.

I’ll show you how I solved the problem I had, and then how I optimised it for speed!

Rust windows speed programming performance

Problem case

I have previously talked about Ghost Hunting Open Process, and the code worked just fine for that - however I have now gotten around to implementing Ghost Hunting for the Syscall Hook around VirtualAllocEx and the Events Tracing for Windows Threat Intelligence notification for a VirtualAlloxEx event.

The problem - my previous code wasn’t flexible enough to allow for any other type of event than an OpenProcess!

Another problem; the ETW event doesn’t occur immediately (like the Syscall and Kernel notifications do) - so the Ghost Hunt timers need a little adjustment to account for this delay.

How I first solved this

When an event happens that we think may be malicious; we add it to our Ghost Hunting so we can detect bad behaviour relating to tampering with hooked syscalls. This was fine for when we had only events from the driver and hooked syscalls; but now we have a third source of telemetry - the Windows ETW Threat Intelligence system.

So, instead, I created a new field on the GhostHuntingTimers called cancellable_by. This contains a vector of what events can cancel the timer. So say, you get:

A syscall hook notification that API X was called;
Then the driver tells you a callback was triggered for API X; and
ETW tells us that PID 1234 made an API call X

If we want to make sure we receive all parts of this; then what we want to do instead of the above logic, is create a vector (the field cancellable_by) of each type that we expect to start to nullify the live timer. When we receive / process events from those sources, we can remove them from the vector.

However; I wasn’t satisfied by this approach entirely.

Performance of the heap approach

Say we have 300 processes, which are constantly calling hooked API’s, opening process handles, allocating memory, etc etc, with the above approach, we are doing a lot of heap allocations, by virtue of using a Vec. Heap allocations are slow and do contain overhead. For an isolated program, or something where a few microseconds isn’t the bottleneck, for example a web server or GUI - this is fine. However, we are doing some systems development here with an EDR, and when it takes malware only a fraction of a second to steal all your data - time matters. Not only that, but slowing down the process of filtering all this data may have knock on effects for system performance overall due to the fact we are intercepting and interacting with live processes. This will be a bigger deal in the future once we start freezing processes based on EDR decision making.

So, I decided to profile this from the vector allocations. Here is what I got from adding a vector of cancellable events:

Elapsed: 11.4µs
Elapsed: 12.3µs
Elapsed: 3.9µs
Elapsed: 2.5µs
Elapsed: 4.1µs
Elapsed: 2.8µs
Elapsed: 3.4µs
Elapsed: 3µs
Elapsed: 5.4µs
Elapsed: 2.7µs
Elapsed: 4.1µs
Elapsed: 2.1µs
Elapsed: 5.1µs
Elapsed: 10.2µs

There’s no rhyme or reason to it - relatively small variations at our human macro scale of seconds; but a range of ~ 10µs is fairly big.

The final solution

I then decided instead of allocating on the heap, I can instead use bit fields to determine what event sources can cancel each API hook, where we define a number of constants:

/// The event source came from the kernel, intercepted by the driver
pub const EVENT_SOURCE_KERNEL: u8        = 0b0001; // 1
/// The event source came from a syscall hook
pub const EVENT_SOURCE_SYSCALL_HOOK: u8  = 0b0010; // 2
/// The event source came from the PPL Service receiving ETW:TI
pub const EVENT_SOURCE_ETW: u8           = 0b0100; // 4

And we can logically OR them together to get a mask of what event sources are allowed to cancel them out.

For example, this is how it w orks:

let result = EVENT_SOURCE_KERNEL | EVENT_SOURCE_SYSCALL_HOOK;
//  0011   = 0001                  0010 
let result = EVENT_SOURCE_KERNEL | EVENT_SOURCE_ETW;
//  0101   = 0001                  0100

When we register a new event that came from an API callback (whether kernel, syscall or ETW) we can XOR the event source to remove it from the list of flags so that we are not left with a dangling flag at any point. So for example:

// notification comes in, from the driver (kernel) and for that API we expect a driver notification and 
// a syscall notification. Thus, only those can clear the timer together.
let mask = EVENT_SOURCE_KERNEL | EVENT_SOURCE_SYSCALL_HOOK;
//  0011 = 0001                  0010 

// As the notification came from the driver, we need to remove the kernel flag, so we can do that via
// XOR:
let mask = EVENT_SOURCE_KERNEL ^ mask;
//  0001 = 0001                  0011 

// Thus we are now only waiting for the notification to come from the hooked syscall, as the remaining mask is the binary 0001 (aka 1u8).

Then, once the notification comes in from the syscall hook we have, we do the same XOR but this time with the EVENT_SOURCE_SYSCALL_HOOK u8, and Voilà, the mask now == 0 and it can be removed from the timers, meaning everything matched up perfect.

Final performance

Measuring the performance again with this approach, we consistantly get results of:

Elapsed: 400ns
Elapsed: 200ns
Elapsed: 700ns
Elapsed: 200ns
Elapsed: 300ns
Elapsed: 300ns

Which results in a consistent 20-60x speed improvement over heap allocations! Nice!