VX Heaven

Library Collection Sources Engines Constructors Simulators Utilities Links Forum

Chomsky Hierarchy and the Word Problem in Code Mutation

October 2008

[Back to index] [Comments]
  1. Introduction
  2. Definitions
    1. Formal Grammar and connection to code mutation
    2. Word Problem and connection to virus detection
  3. Chomsky Hierarchy
  4. Chomsky Hierarchy in code mutation
    1. Type-3 Grammar in code mutation and MetaPHOR
    2. Type-2 Grammar as recursive mutation
    3. Type-1 Grammar as macro function mutation
    4. Type-0 Grammar as information annihilation
  5. Formal Grammar of MetaPHOR
  6. Conclusion
  7. References

1) Inroduction

Code Mutation (whether it is obfuscation of a decryption engine or full metamorphism) is a technique to prevent the detection of the computer virus using analytic scan algorithm. Formal grammar gives us the basis for a theoretical analysis of code mutation; therefore we must not neglect or ignore it.

In 1999, Qozah has written an article called "POLYMORPHISM AND GRAMMARS" [1], analysing polymorphism (exaclty: a garbage generator) for the first time with this theoretical tool. Even his results for non-regular grammars are not correct, his article opened the door for a new way analysing virus technologies. In 2007, Eric Filiol wrote an article called "Metamorphism, Formal grammars and Undecidable Code Mutation" [2], dealing with the word problem and a description of a extended MetaPHOR mutation engine using a "Tzeitzin semi-Thue system" (a Type-0 Grammar).

In this article you will first read shourtly about the formal grammar and word problem including the reason why this is important for virus authors. (Chapter 2)

Then you will learn about the hierarchy of classes of formal grammars, the Chomsky Hierarchy and about its connection to code mutation including an analyse of the formal grammar of The Mental Driller's MetaPHOR mutation engine. Especially for type-1 grammars, you will see a surprising result. (Chapter 3-4)

Later you will see the full formal grammar of MetaPHOR;a formalisation of all possible mutations rules. (Chapter 5)

Before reading this article, you should read and understand the idea of formal language and grammar - for example read Qozah's article.

2) Definitions

2.1) Formal Grammar and connection to code mutation

Definition 1:

A formal Grammar G is a 4-tuple G = (N, T, S, R) where:

Definition 2:

A word is a finite sequence of characters of an alphabeth T.

Definition 3.1:

A universal language UL over an alphabeth T is a subset T* of all words over that alphabeth.

Definition 3.2:

A formal language L is a subset of UL, contains all words generated by a formal grammar G := (N, T, S, R)

Formal language in code mutation:
Contains all functions of the computer virus.
Word in code mutation:
A word is a finite sequence of characters of the x86 instruction set (any possible function generatable with the x86 instructions).

If the word is an element of L: It has a specific genotype (behaviour), but no specific phenotype (kernel code); it is a specific micro-function.

Non-terminal Symbol in code mutation:
Function, which return (recursively) a specific terminal symbol depending on its behaviour information. Can not be replaced by rewriting system alone. ("a::= ..." does not exist; but "ab::=..." does)
Alphabeth of terminal symbols
x86 instructions set
Rewriting system R in code mutation:
The rules of mutation; how the mutation engine can change specific code or the decryption engine.

Standard Example:

	G = (N, T, S, R)
	N = {A, B}
	T = {a, b}
	R = {
		S ::= bS | aA
		A ::= bS | aB | a
		B ::= bB | aB | a | b


S => bS => bbS => bbaA => bbaaB => bbaabB ==> bbaaba

Polymorphical Example:

	G = (N, T, S, R)
	N = {A, B}
	T = {a, b, c, x, y}
		a = "nop"
		b = "sub edx, 0"
		c = "push ebx"+"pop ebx"
		x = "XOR [EDI], AL"
		y = "INC AL"
		e = ""
	R = {
		S::=aS | bS | cS | xA
		A::=aA | bA | cA | yB
		B::=aB | bB | cB | e


S => aS => acS => acxA => acxcA => acxcyB => acxcybB => acxcybe
        push    ebx
        pop     ebx
        XOR     [EDI], AL
        push    ebx
        pop     ebx
        INC     AL
        sub     edx, 0

Metamorphical Example:

	G = (N, T, S, R)
	N = {A, B, C}
	T = {a, b, c, d, e, f}
		a = "mov reg, 0"
		b = "pop reg"
		c = "push 0"
		d = "mov reg2, 0"+"push reg2"
		e = "xor reg, reg"
		f = "sub reg, reg"
	R = {
		S ::= A | Bb | C | D
		A ::= a
		B ::= c | d
		C ::= e
		D ::= f


S => Bb => db
        mov     reg2, 0
        push    reg2
        pop     reg

The polymorphical example have been already used in [1]. The metamorphic example is very trivial, but even very complex viruses as MetaPHOR use grammars like this (See Chapter 5).

2.2) Word Problem and connection to virus detection

The word problem can be formulate in mathematics with group theory, but we will use the formulation of computability theory.

The problem (a decidion problem) is to construct an algorithm to decide for a given word if it belongs to a formal language or not.

In connection with computer virus detection that means: An anti virus program scans a file, and has to decide whether a file is infected or not. Or: It has to decide wether a specific code of the file is part of a viruscode.

OR: It has to decide wether a specific word from universal language UL belongs to a formal language L generated by a formal grammar G := (N, T, S, R), which represents a polymorphic or metamorphic engine.


	G = ({A, B}, {a, b, c}, S, T)
	T = {
		S ::= aA | bA | c
		A ::= aA | B
		B ::= bB | c

Is "aaabbbc" part of L? Yes, because:

S => aA => aaA => aaaA => aaaB => aaabB => aaabbB => aaabbbB => aaabbbc

Is "aabbca" part of L? No, because:

L={ (a|b)(a)*(b)*c, c }
, and the given word has no "c" at the end.

3) Chomsky Hierarchy

The Chomsky Hierarchy is a containment hierarchy of classes of formal grammars.[3] All possible formal grammars are divided in Type-0 ... Type-3 grammars, where Type-3 is subset of Type-2 (is subset of Type-1 (is subset of Type-0)).

Type 0: Recursively enumerable

Restriction: no restrictions


	S ::= aBBA
	AB ::= BA | Sc | C
	cBc ::= aaa | ab | c
	A ::= CBC
	B ::= cBca | cb | aB
	C ::= a | c


S => aBBA => acBcacbCBC => aabacbcBc => aabacbccbc
Type 1: Context-sensitive

Restriction: For all (w1 => w2): |w1|<=|w2|


	S ::= aBBA
	AB ::= BA | Sc | C
	cBc ::= aaa
	A ::= CBC
	B ::= cBca | cb | aB
	C ::= a | c


S => aBBA => acBcacbCBC => aaaaacbccbc
Type 2: Context-free

Restriction: For all (w1 => w2): w1 is an element of non-terminal symbols


	S ::= aBBA
	A ::= CBC
	B ::= cBca | cb | aB
	C ::= a | c


S => aBBA => acbcbCBC => acbcbacba
Type 3: Regular

Restriction: R = { A = aB, A = Ba, A = a }


	S ::= aS | aA
	A ::= Bc
	B ::= Bc | c


S => aS => aaA => aaBc => aaBcc => aaccc

4) Chomsky Hierarchy in code mutation

For an anti virus programm it is important that scanning a file does not monopolize CPU-speed for long time. We can describe the time usage of an algorithm with Landau notation O(n), where n is the amount of data or number of mutation.

A simple sort algorithm has a O(n) performance: For a list of 20 entries it needs t=c*20; for a list of 2.000 entries it needs t=c*2.000.

A usual algorithm for multiplication of a N x N matrix with a N Vector needs N(N-1)=N^2-N operations. The performance is O(n^2)

A usual algorithm for multiplication of a N x N matrix with a N x N matrix needs N(N(N-1) operations. The performance is O(n^3).

About the complexity of an algorithm, which solves the word problem for Chomsky Hierarchy:


Here we use an asymtotically description of the given problem:

lim n -> infinity

It's obvious that in practise this is not possible. Nevertheless a virus can reach very high n, i. e. with deep (infinity) recursive mutation. Virus Scanners have to detect all variants of a virus, so that asymtotically description can be used.

4.1) Type-3 Grammar in code mutation and MetaPHOR

This is the lowest type of possible code mutation, but unfortunatly used in most polymorphic or metamorphic viruses. In Chapter 5 you can see that most parts of MetaPHOR's grammar is Type-3.

For example:

	N02 ::= t003 | t004
	N32 ::= t078 | M21N32 => N32 ::= t078 | t079N32
	M43 ::= t125 | t126

A big disadvanage is, that the shrinker of MetaPHOR is the inverse of expander (obfuscator). "The expander is the part that undoes what the shrinker did. It performs exactly the reverse operation." [4]. This result presents the determinism in the engine. Code mutation engines have to be created in a way, that linear time complexity will be prevented.

4.2) Type-2 Grammar as recursive mutation

When this type of grammar is used in a metamorphic virus, the operations for deciding the word problem increases at least cubically. As (rarly) done in MetaPHOR, creating a Type-2 grammar requires good planning. The rules have to be recursively defined.

As example, the rewriting rule N24 has a recursion level > 5, maybe some rules have even higher recursive level. Because of that, Mental Driller has put a recursivity control for preventing the code to grow too much.

Such infitive recursive rules are our goal, for increasing the complexity of the word problem.

4.3) Type-1 Grammar as macro function mutation

Using this formal grammar in a virus, you can mutate macro functions, not just the micro-functions aka. x86 instructions.

For example:

	A ::= "PUSH [hWnd]" | ...
	B ::= "PUSH lpString" | ...
	C ::= "PUSH 0xFF" | ...
	D ::= "CALL GetWindowText | ...

	E ::= "PUSH [hWnd]" | ...
	F ::= "PUSH WM_GETTEXT" | ...
	G ::= "PUSH 0xFF" | ...
	H ::= "PUSH lpString" | ...
	I ::= "CALL SendMessage" | ...

Beside of the extremly increased complexity of the virus, the word problem will become not decideable in available time.

Combining this macro-function mutation with code integration to the host file, the virus can use parts of the host code and parts or its own code for mutation.

4.4) Type-0 Grammar as information annihilation

When you use shrinking-rules, a Type-0 Grammar may be created. Using the inverse rules of the obfuscator, these rules have no effect. For a valueable usage of deletion rules, the existence of big words is a condition, otherwise the virus code may loose information.

5) Formal Grammar of MetaPHOR

	G = (N, T, S, R)

	N = { N01, ..., N47, M01, ..., M49 }
	T = { code of MetaPHOR }
	R = {
		S ::= [ T, such that the hardcoded X86 instructions t000-t135
			are replaced by the behaviour functions N01..M49 ]

		N01 ::= t001 | t002 | M24N36
		N02 ::= t003 | t004
		N03 ::= t000 | t005 | N10 | t007 | t008 | t009 | t010 | t011
			t028 | M03M06 | M06M03 | t090 | t091 | M34M35
		N04 ::= t012 | t013
		N05 ::= t014 | t015
		N06 ::= t016 | N09 | t018 | t019 | t020
		N07 ::= t021 | t022 | t023
		N08 ::= t024 | t025 | t026 | t027
		N09 ::= t029 | t030 | M01M02
		N10 ::= t031 | t032
		N11 ::= t033 | t034
		N12 ::= t035 | t036
		N13 ::= t037 | t038
		N14 ::= t041 | M01M03
		N15 ::= t043 | t033 | M05M04 | N16M09
		N16 ::= t046 | M05M03
		N17 ::= t047 | M06M02
		N18 ::= t049 | M07M03
		N19 ::= t053 | N14N20 | N16M12N17 | N14M31N17
		N20 ::= t054 | N18M15
		N21 ::= t055 | N09N12 | N15N10
		N22 ::= t056 | N15M10
		N23 ::= t058 | N10N12 | N12N10
		N24 ::= t059 | N19M11 | M12M13
		N25 ::= t063 | N21M10
		N26 ::= t065 | M14N12
		N27 ::= t066 | N18M49
		N28 ::= t069 | N18M16
		N29 ::= t071 | N16M17
		N30 ::= t073 | N16M18 | M05N38
		N31 ::= t077 | N16N18
		N32 ::= t078 | M21N32
		N33 ::= t080 | M22M24 | M34M36
		N34 ::= t083 | N01M24 | N10
		N35 ::= t085 | N02M25
		N36 ::= t087 | M24N01 | N10
		N37 ::= t088 | M25M26
		N38 ::= t092 | M03M18
		N39 ::= t093 | N06M27 | M28M29
		N40 ::= t097 | N16M30N17
		N41 ::= t105 | t106 | N14M37 | N16M40
		N42 ::= t109 | N16M38
		N43 ::= t112 | N16M39
		N44 ::= t121 | N18M42
		N45 ::= t124 | N18M46
		N46 ::= t134 | M46M47M48
		N47 ::= t135 | M48M47M46 
		M01 ::= t039 | N14M06
		M02 ::= t040 | M03N17
		M03 ::= t042 | M08N18
		M04 ::= t044
		M05 ::= t045 | N16M06
		M06 ::= t048
		M07 ::= t050
		M08 ::= t051
		M09 ::= t052
		M10 ::= t057
		M11 ::= t060
		M12 ::= t061 | N18M33N18
		M13 ::= t062
		M14 ::= t067
		M15 ::= t068
		M16 ::= t070
		M17 ::= t072 | N18M19 | N17N26
		M18 ::= t074 | N18M20
		M19 ::= t075
		M20 ::= t076
		M21 ::= t079
		M22 ::= t081
		M24 ::= t084 | N01N34
		M25 ::= t086 | N02N35
		M26 ::= t089 | M25N37
		M27 ::= t094
		M28 ::= t095
		M29 ::= t096
		M30 ::= t098
		M31 ::= t099 | N18M32N18
		M32 ::= t100
		M33 ::= t101
		M34 ::= t102
		M35 ::= t103
		M36 ::= t104
		M37 ::= t107 | t108 | N18M41
		M38 ::= t110 | t111
		M39 ::= t113 | t114
		M40 ::= t115 | t116 | t117 | t118 | N18M44 | N18M45
		M41 ::= t119 | t120
		M42 ::= t122 | t123
		M43 ::= t125 | t126
		M44 ::= t127 | t128
		M45 ::= t129 | t130
		M46 ::= t131
		M47 ::= t132
		M48 ::= t133
		M49 ::= t064

	t001="XOR Reg,-1"
	t002="NOT Reg"
	t003="XOR Mem,-1"
	t004="NOT Mem"
	t005="MOV Reg,Reg"
	t007="ADD Mem,0"
	t008="OR Reg,0"
	t009="OR Mem,0"
	t010="AND Reg,-1"
	t011="AND Mem,-1"
	t012="SUB Reg,Imm"
	t013="ADD Reg,-Imm"
	t014="SUB Mem,Imm
	t015="ADD Mem,-Imm
	t016="XOR Reg,0"
	t018="AND Reg,0"
	t019="XOR Reg,Reg"
	t020="SUB Reg,Reg"
	t021="XOR Mem,0"
	t022="MOV Mem,0"
	t023="AND Mem,0"
	t024="CMP Reg,0"
	t025="OR Reg,Reg"
	t026="AND Reg,Reg"
	t027="TEST Reg,Reg"
	t028="MOV Mem,Mem"
	t029="MOV Reg,Imm"
	t030="LEA Reg,[Imm]"
	t031="ADD Reg,Imm"
	t032="LEA Reg,[Reg+Imm]"
	t033="LEA Reg2,[Reg]"
	t035="LEA Reg,[Reg+Reg2]"
	t036="ADD Reg,Reg2"
	t037="LEA Reg,[Reg2+Reg2+xxx]"
	t038="LEA Reg,[2*Reg2+xxx]"
	t039="PUSH Imm"
	t040="POP Reg"
	t041="MOV Mem,Imm"
	t042="POP Mem"
	t043="MOV Reg2,Reg"
	t044="POP Reg2"
	t045="PUSH Reg"
	t046="MOV Mem,Reg"
	t047="MOV Reg,Mem"
	t048="PUSH Mem"
	t049="MOV Mem2,Mem"
	t050="PUSH Mem2"
	t051="POP Mem2"
	t052="MOV Reg2,Mem
	t053="OP Reg,Imm"
	t054="OP Reg,Mem"
	t055="LEA Reg,[Reg2+Imm]"
	t056="LEA Reg,[Reg2+Reg3]"
	t057="ADD Reg,Reg3"
	t058="LEA Reg,[Reg+Reg2+Imm]"
	t059="OP Reg,(Imm OP Imm2)"
	t060="OP Reg,Imm2"
	t061="OP Mem,Imm"
	t062="OP Mem,Imm2"
	t063="LEA Reg,[Reg2+Reg3+Imm]"
	t064="LEA Reg,[(RegX+)Reg2+Imm]"
	t065="LEA Reg,[(RegX+)2*Reg2+Imm]"
	t066="MOV Mem3,Mem"
	t067="MOV Mem3,Mem2"
	t068="OP Reg,Mem2"
	t069="MOV Mem,xxx"
	t070="MOV Mem2,xxx"
	t071="CALL Reg"
	t072="CALL Mem"
	t073="JMP Reg"
	t074="JMP Mem"
	t075="CALL Mem2"
	t076="JMP Mem2"
	t077="MOV Mem2,Reg"
	t078="MOV Reg,yyy"
	t079="OP Reg,xxx"
	t080="JMP @xxx"
	t081="Jcc @xxx"
	t083="ADD Reg,1"
	t084="NEG Reg"
	t085="ADD Mem,1"
	t086="NEG Mem"
	t087="ADD Reg,-1"
	t088="ADD Mem,-1"
	t089="NOT Mem"
	t090="CMP X,Y " - without Jcc
	t091="TEST X,Y" - without Jcc
	t093="MOVZX Reg,byte ptr [Mem]"
	t094="MOV Reg8,[Mem]"
	t095="MOV Reg,[Mem]"
	t096="AND Reg,0FFh"
	t097="OP Reg,Reg2"
	t098="OP Mem,Reg2"
	t099="OP Mem,Reg"
	t100="OP Mem2,Reg"
	t101="OP Mem2,Imm"
	t102="CMP Reg,Reg"
	t103="JO/JB/JNZ/JA/JS/JNP/JL/JG @xxx" - without Jcc
	t104="JNO/JAE/JZ/JBE/JNS/JP/JGE/JLE @xxx" - without Jcc
	t105="TEST Reg,Imm" + Jcc
	t106="CMP Reg,Imm" + Jcc
	t107="TEST Reg,Mem" + Jcc
	t108="CMP Reg,Mem" + Jcc
	t109="CMP Reg,Reg2" + Jcc
	t110="CMP Mem,Reg2" + Jcc
	t111="SUB Mem,Reg2" + Jcc
	t112="TEST Reg,Reg2" + Jcc
	t113="TEST Mem,Reg2" + Jcc
	t114="AND Mem,Reg2" + Jcc
	t115="CMP Mem,Imm" + Jcc
	t116="SUB Mem,Imm" + Jcc
	t117="TEST Mem,Imm" + Jcc
	t118="AND Mem,Imm" + Jcc
	t119="TEST Reg,Mem2" + Jcc
	t120="CMP Reg,Mem2" + Jcc
	t121="TEST Mem,Reg" + Jcc
	t122="TEST Mem2,Reg" + Jcc
	t123="AND Mem2,Reg" + Jcc
	t124="CMP Mem,Reg" + Jcc
	t125="CMP Mem2,Reg" + Jcc
	t126="SUB Mem2,Reg" + Jcc
	t127="TEST Mem2,Imm" + Jcc
	t128="AND Mem2,Imm" + Jcc
	t129="CMP Mem2,Imm" + Jcc
	t130="SUB Mem2,Imm" + Jcc
	t131="PUSH EAX"
	t132="PUSH ECX"
	t133="PUSH EDX"

6) Conclusion

Efficient Code Mutation has always been a goal for virus writers. The analysis of these techniques on a mathematical background is an important venture to increase the complexity of mutation techniques.

The word problem and its connection to the antiviral detection problem gives just the right basis for our work.

Creating a metamorphic mutation engine needs very much time, and without using a type-2, type-1 or even type-0 grammar, the detection is still "trivial". Creating a concept for a valueable formal grammar of mutation engine does not need much time, but increases the complexity alot - the complexity may become cubical or even exponential.

My final advice: Invest some hours for planning a efficient grammar, and as result your engine will be harder to detect by analytic scanning.

- - - - - - - - - - - - - -
    Second Part To Hell

  written in October 2008
- - - - - - - - - - - - - -

7) References

  1. Qozah; "Polymorphism and Grammars"; 29A E-Zine issue 4; 1999.
  2. Filiol, Eric; "Metamorphism, Formal grammars and Undecidable Code Mutation"; International Journal of Computer Science, vol. 2, number 1, 2007, pp. 70-75; 2007.
  3. Chomsky, Noam; "Three models for the description of language"; IRE Transactions on Information Theory (2): 113-124; 1956.
  4. The Mental Driller; "Metamorphism in practice"; 29A E-Zine issue 6; 2002.
[Back to index] [Comments]
By accessing, viewing, downloading or otherwise using this content you agree to be bound by the Terms of Use! aka