Improving the Ghost Hunting implementation for flexibility and speed
Making some improvements.
Intro
This is more of a ‘devlog’ type post - if you have a vested interest in this technique, or in Rust generally, then read on! The project can be found over on GitHub. Seeing as though ‘Ghost Hunting’ is my own research, and makes up (so far) the core detection logic of the EDR around process injection and obfuscation techniques, I figured this would be worth a short post!
The changes that relate to this post can be found on this commit. Amd a bugfix commit here, and a performance commit here.
I’ll show you how I solved the problem I had, and then how I optimised it for speed!
Problem case
I have previously talked about Ghost Hunting Open Process, and the code worked just fine for that - however I have now gotten around to implementing Ghost Hunting for the Syscall Hook around VirtualAllocEx and the Events Tracing for Windows Threat Intelligence notification for a VirtualAlloxEx event.
The problem - my previous code wasn’t flexible enough to allow for any other type of event than an OpenProcess!
Another problem; the ETW event doesn’t occur immediately (like the Syscall and Kernel notifications do) - so the Ghost Hunt timers need a little adjustment to account for this delay.
How I first solved this
When an event happens that we think may be malicious; we add it to our Ghost Hunting so we can detect bad behaviour relating to tampering with hooked syscalls. This was fine for when we had only events from the driver and hooked syscalls; but now we have a third source of telemetry - the Windows ETW Threat Intelligence system.
So, instead, I created a new field on the GhostHuntingTimers
called cancellable_by. This contains a vector of what events can cancel the timer. So say, you get:
- A syscall hook notification that API X was called;
- Then the driver tells you a callback was triggered for API X; and
- ETW tells us that PID 1234 made an API call X
If we want to make sure we receive all parts of this; then what we want to do instead of the above logic, is create a vector (the field cancellable_by) of each type that we expect to start to nullify the live timer. When we receive / process events from those sources, we can remove them from the vector.
However; I wasn’t satisfied by this approach entirely.
Performance of the heap approach
Say we have 300 processes, which are constantly calling hooked API’s, opening process handles, allocating memory, etc etc, with the above approach, we are doing a lot of heap
allocations, by virtue of using a Vec
. Heap allocations are slow and do contain overhead. For an isolated program, or something where a few microseconds isn’t the bottleneck,
for example a web server or GUI - this is fine. However, we are doing some systems development here with an EDR, and when it takes malware only a fraction of a second to steal all
your data - time matters. Not only that, but slowing down the process of filtering all this data may have knock on effects for system performance overall due to the fact we are
intercepting and interacting with live processes. This will be a bigger deal in the future once we start freezing processes based on EDR decision making.
So, I decided to profile this from the vector allocations. Here is what I got from adding a vector of cancellable events:
- Elapsed: 11.4µs
- Elapsed: 12.3µs
- Elapsed: 3.9µs
- Elapsed: 2.5µs
- Elapsed: 4.1µs
- Elapsed: 2.8µs
- Elapsed: 3.4µs
- Elapsed: 3µs
- Elapsed: 5.4µs
- Elapsed: 2.7µs
- Elapsed: 4.1µs
- Elapsed: 2.1µs
- Elapsed: 5.1µs
- Elapsed: 10.2µs
There’s no rhyme or reason to it - relatively small variations at our human macro scale of seconds; but a range of ~ 10µs is fairly big.
The final solution
I then decided instead of allocating on the heap, I can instead use bit fields to determine what event sources can cancel each API hook, where we define a number of constants:
/// The event source came from the kernel, intercepted by the driver
pub const EVENT_SOURCE_KERNEL: u8 = 0b0001; // 1
/// The event source came from a syscall hook
pub const EVENT_SOURCE_SYSCALL_HOOK: u8 = 0b0010; // 2
/// The event source came from the PPL Service receiving ETW:TI
pub const EVENT_SOURCE_ETW: u8 = 0b0100; // 4
And we can logically OR them together to get a mask of what event sources are allowed to cancel them out.
For example, this is how it w orks:
let result = EVENT_SOURCE_KERNEL | EVENT_SOURCE_SYSCALL_HOOK;
// 0011 = 0001 0010
let result = EVENT_SOURCE_KERNEL | EVENT_SOURCE_ETW;
// 0101 = 0001 0100
When we register a new event that came from an API callback (whether kernel, syscall or ETW) we can XOR the event source to remove it from the list of flags so that we are not left with a dangling flag at any point. So for example:
// notification comes in, from the driver (kernel) and for that API we expect a driver notification and
// a syscall notification. Thus, only those can clear the timer together.
let mask = EVENT_SOURCE_KERNEL | EVENT_SOURCE_SYSCALL_HOOK;
// 0011 = 0001 0010
// As the notification came from the driver, we need to remove the kernel flag, so we can do that via
// XOR:
let mask = EVENT_SOURCE_KERNEL ^ mask;
// 0001 = 0001 0011
// Thus we are now only waiting for the notification to come from the hooked syscall, as the remaining mask is the binary 0001 (aka 1u8).
Then, once the notification comes in from the syscall hook we have, we do the same XOR but this time with the EVENT_SOURCE_SYSCALL_HOOK
u8, and Voilà,
the mask now == 0 and it can be removed from the timers, meaning everything matched up perfect.
Final performance
Measuring the performance again with this approach, we consistantly get results of:
- Elapsed: 400ns
- Elapsed: 200ns
- Elapsed: 700ns
- Elapsed: 200ns
- Elapsed: 300ns
- Elapsed: 300ns
Which results in a consistent 20-60x speed improvement over heap allocations! Nice!