Improving consistency with EDR DLL Injection via APCs

Injecting your system with a high quality dose of Sanctum EDR!

Intro

You can check this project out on GitHub, and specifically you can see injection.rs which houses the code discussed in this blog post.

Previously I was using standard process injection from the EDR usermode engine to inject the Sanctum DLL into processes. This was fine for your average process, but I was soon coming into difficulty injecting the DLL into processes that were protected, or using other Windows protection mechanisms.

My brain turned to doing the injection from the Kernel, which I started trying to implement by mapping the DLL into process memory (which I soon gave up on as it’s a lot of work), I then tried calling LoadLibraryW using an APC, you can find my attempt at implementing this here, but I then struggled to implement this. Looking back, I think a small shellcode bootstrap for LoadLibraryW (or A) would work just fine (similar to the below).

After some conversations with @eversinc33, he very kindly advised me to look at bootstrapping LdrLoadDll with some shellcode and setting an APC off on that. He also kindly provided me with his C++ code which implements this, so it was a case of porting to Rust. This in itself was a small challenge due to some FFI bugs I ran into.. but I will talk about them below.

APCs

Put simply, APCs (Asynchronous Procedure Calls) are a Windows mechanism for executing code when a thread is not working and is in an alertable state. It allows the caller to queue tasks which don’t require immediate execution but have tasks to process when a thread is not busy.

In usermode, these are fully documented and expected for use, however in the Kernel, these are not officially supported - meaning Microsoft could change their APIs between kernel releases. That said, it seems the API has been stable for a while looking at some blogs from around 2020.

APCs are commonly used by malware as a method of process injection, which is one tactic I am looking to detect with this EDR project.

In APCs that can be called in a driver via KeInitializeApc the function definition is as follows (Rust):

pub fn KeInitializeApc(
    Apc: PKAPC,
    Thread: PKTHREAD,
    ApcStateIndex: KAPC_ENVIRONMENT,
    KernelRoutine: *const c_void,
    RundownRoutine: *const c_void,
    NormalRoutine: *const c_void,
    ApcMode: KPROCESSOR_MODE,
    NormalContext: PVOID,
);

In order:

Apc: A pointer to the KAPC non-paged pool object which contains information about the APC.
Thread: A pointer to the thread (KTHREAD) which will execute the APC.
ApcStateIndex: This defines what process / thread context to use. If you queue an APC with CurrentApcEnvironment, the APC will execute in whatever process context the thread is currently attached to.
KernelRoutine: A function pointer to a function which runs at APC_LEVEL in the kernel, before the APC is delivered. As this runs before the NormalRoutine, we are able to modify the NormalRoutine pointer.
RundownRoutine: A function pointer to a function which executes if a thread with a queued APC terminates.
NormalRoutine: A function pointer to a function which runs when the APC is delivered (AKA what you want the APC to execute, this may be kernel or user memory).
ApcMode: Determines how the APC is run, KernelMode will cause the code to run in kernel mode (full access to memory, etc), this is delivered . UserMode will have the code execute in user mode when a thread becomes alertable. This parameter essentially determines how the APC is scheduled.
NormalContext: A parameter passed as the first argument of the NormalRoutine.

The methodology here is to allocate memory for some shellcode which calls LdrLoadDll, allocate memory for the args required of LdrLoadDll, and then to use two APCs to run the show.

APC 1: This will be an APC executed in ApcMode: UserMode, which has the start address of the bootstrap shellcode.
APC 2: This will be a KernelMode APC which is delivered near enough immediately, we then cause the thread within the target process to become alertable, thus, forcing APC 1 to run.

Visually, it looks as follows:

Sanctum EDR Windows Driver APC injection Rust

Shellcode

First we need a function which will allocate some shellcode into the target process, which provides pointers in r8 and r9 for LdrLoadDll.

We can allocate in the process starting up by using the special process ID -1, which indicates the current process. Thanks to the intrinsics of how the callback notify routine works for image load events (where this code is), we don’t have to worry about getting process handles from process IDs.

The shellcode itself starts with zeroed out pointer slots for r8, r9, and rax. Rax will be the address of LdrLoadDll from ntdll.dll.

You can find the full code here, but some short relevant snippets (and thanks again to @eversinc33 for providing the shellcode):

let mut shellcode = [
    0x48u8, 0x83, 0xEC, 0x28, // sub rsp, 0x28
    0x48, 0x31, 0xD2, // xor rdx, rdx
    0x48, 0x31, 0xC9, // xor rcx, rcx
    0x49, 0xB8, 0, 0, 0, 0, 0, 0, 0, 0, // mov r8,
    0x49, 0xB9, 0, 0, 0, 0, 0, 0, 0, 0, // mov r9,
    0x48, 0xB8, 0, 0, 0, 0, 0, 0, 0, 0, // mov rax (adr of LdrLoadDll),
    0xFF, 0xD0, // call rax
    0x48, 0x83, 0xC4, 0x28, // add rsp, 0x28
    0xC3, // ret
];

let mut remote_shellcode_memory = null_mut();
let mut shellcode_size = shellcode.len() as u64;
let cur_proc_handle: HANDLE = (-1isize) as HANDLE;

let status = unsafe {
    ZwAllocateVirtualMemory(
        cur_proc_handle,
        &mut remote_shellcode_memory,
        0,
        &mut shellcode_size,
        MEM_COMMIT,
        PAGE_READWRITE,
    )
};

We can then patch in various UNICODE_STRING’s and pointers as needed into the shellcode array as so:

const OFF_R8_IMM: usize = 12;
const OFF_R9_IMM: usize = 22;
const OFF_RAX_IMM: usize = 32;
const PTR_WIDTH: usize = size_of::<usize>();

shellcode[OFF_R8_IMM..OFF_R8_IMM + PTR_WIDTH].copy_from_slice(&val_r8.to_le_bytes());
shellcode[OFF_R9_IMM..OFF_R9_IMM + PTR_WIDTH].copy_from_slice(&val_r9.to_le_bytes());
shellcode[OFF_RAX_IMM..OFF_RAX_IMM + PTR_WIDTH].copy_from_slice(&val_rax.to_le_bytes());

And Finally copy the shellcode into the allocated region within the processes Virtual Address space:

unsafe {
    // Patch in the shellcode to the remote region in the target process
    RtlCopyMemoryNonTemporal(
        remote_shellcode_memory,
        shellcode.as_ptr() as *const _,
        shellcode_size,
    );
}

The same needs to happen with the various string & buffer allocations required for registers r8 and r9.

Setting up the APCs

Finally, we need to set up the APCs such that we queue a usermode APC on the executing thread to run the shellcode that loads in Sanctum’s DLL, and then trigger the KernelMode APC which uses the undocumented function KeTestAlertThread to cause the thread to become alertable.

As an interesting note, originally the DLL would suspend all threads except its own, patch NTDLL and perform other tasks, then resume the process threads. After changing to this methodology, the DLL receives the NTSTATUS for ACCESS_DENIED when trying to get a handle to thread(s) in the process - I am not 100% sure why and this is where my internals knowledge gets murky. I will probably spend some time looking at this closer, as I am super interested in the internals of early process initialisation.

Before I begin, I did battle for about a day with a few blue screens and access violations thanks to some janky FFI on my part. As we have to link at comptime against some functions which are not readily available in the rust WDK, we have to provide our own definitions.

The definition I was using for KeInitializeApc:

KernelRoutine: Option,
RundownRoutine: Option,
NormalRoutine: Option<*const c_void>,

I took this definition from some other GitHub project (I have no idea which), but on inspecting memory around the access violations, if NormalRoutine was set to None, the violation would occur - because Rust was aliasing this to 1 - clearly, this is not null, and not a valid memory address or machine code. I changed the definition to:

KernelRoutine: *const c_void,
RundownRoutine:*const c_void,
NormalRoutine: *const c_void,

And instead of setting None, simply make it null_mut().

An Option<T> is that None would be represented by 0, and Some(T) would be represented by a pointer to the T in ordinary Rust. I only caught this error because of a Rust lint (I love cargo so much) explaining that using Option over an FFI function pointer can produce undefined behaviour. I dont know how ‘1’ made its way into that address, but here is a screenshot I took after discovering the bug:

Is the address of the KAPC object.
Is an address in NormalRoutine which I expected to be null
Is what is located within that address, clearly not valid machine code!

FFI bug Rust windows driver EDR

With that fixed, we can go ahead and create the two APCs, their callback routines, and let them go! One important note is that the NormalRoutine of the UserMode APC is the address of the shellcode allocation in its Virtual Address space.

Again, rather than show you the full code (see above link to the code on GitHub), here is the snippet for the UserMode APC that runs the shellcode:

let apc = unsafe {
    ExAllocatePool2(
        POOL_FLAG_NON_PAGED,
        size_of::<KAPC>() as u64,
        u32::from_le_bytes(*b"sanc"),
    )
} as *mut KAPC;

unsafe {
    KeInitializeApc(
        &mut *apc,
        thread,
        crate::ffi::_KAPC_ENVIRONMENT::OriginalApcEnvironment,
        apc_callback_inject_sanctum as *const c_void,
        rundown as *const c_void,
        shellcode_addr,
        UserMode as i8,
        null_mut(),
    );
}

let status =
    unsafe { KeInsertQueueApc(&mut *apc, null_mut(), null_mut(), IO_NO_INCREMENT as _) };
if !nt_success(status as _) {
    bail!("Failed to insert APC for shellcode execution. Code: {status:#X}");
}

Here is the KernelMode APC that makes the thread alertable:


{
    // ...
    let kapc = unsafe {
        ExAllocatePool2(
            POOL_FLAG_NON_PAGED,
            size_of::<KAPC>() as u64,
            u32::from_le_bytes(*b"sanc"),
        )
    } as *mut KAPC;

    unsafe {
        KeInitializeApc(
            &mut *kapc,
            thread,
            crate::ffi::_KAPC_ENVIRONMENT::OriginalApcEnvironment,
            kernel_prepare_inject_apc as *const c_void,
            rundown as *const c_void,
            null_mut(),
            KernelMode as i8,
            null_mut(),
        );
    }

    let status =
        unsafe { KeInsertQueueApc(&mut *kapc, null_mut(), null_mut(), IO_NO_INCREMENT as _) };
    if !nt_success(status as _) {
        bail!("Failed to insert KAPC for shellcode execution. Code: {status:#X}");
    }
}

unsafe extern "C" fn kernel_prepare_inject_apc(
    apc: PRKAPC,
    _normal_routine: PKNORMAL_ROUTINE,
    _normal_context: *mut PVOID,
    _system_arg_1: *mut PVOID,
    _system_arg_2: *mut PVOID,
) {
    unsafe { KeTestAlertThread(UserMode as i8) };
    unsafe { rundown(apc) };
}

Interestingly, when you look at the return value after LdrLoadDll has completed, the NTSTATUS is not 0 (success), it is set to 0xC00000E5 which is STATUS_INTERNAL_ERROR. I did notice when adding an int3 instruction into the shellcode, it was triggering twice - and on the second run, it was successful. I strongly suspect some loader lock / loader intrinsics is to blame here, and might be something I need to look back on (more from a learning point, as practically, it works and is controlled).

Ensuring the process does not continue without the DLL

Finally - how do we ensure the process is only allowed to run if Sanctum DLL is injected and has set up hooks in ntdll.dll?

Simple! First, when kernelbase.dll is mapped into the newly spawned process (we use the kernel callback to detect this), we block until we receive an IOCTL from the usermode EDR engine that the Sanctum DLL has done its thing. Once we receive that, we can release the process from the block and allow the image mapping to continue.

I have never reached a print to say the process is blocking it all happens that fast.. so there is no realistic negative impact on performance from doing this.

Closing

I hope you enjoyed this post, the intrinsics of process mapping, APCs and the kernel are always super interesting. There are a few questions I have identified I need to brush up on as outlined above, if anybody has any suggestions, corrections, or hints - I am all ears!

There are some really interesting side effects of this process documented in Dennis Babkin’s blog in the references below which I need to explore, particularly around having a module live in an early process environment before kernel32.dll is mapped into the process.

Ciao xo