# On the Time Complexity of Computer Viruses

Zhihong Zuo, Mingtian Zhou, Qing-xin Zhu
IEEE Transactions on Information Theory, Vol. 51, No. 8
ISSN 0018-9448
August 2005

3

Manuscript received July 13, 2004; revised December 26, 2004.
The authors are with the College of Computer Science and Engineering,
University of Electronic Science and Technology of China, Chengdu, Sichuan,
610054 China (e-mail: [email protected];[email protected]; [email protected]).
Communicated by T. Johansson, Associate Editor for Complexity and Cryptography.
Digital Object Identifier 10.1109/TIT.2005.851780

## Abstract

Computer viruses can disable computer systems not only by destroying data or modifying a system's configuration, but also by consuming most of the computing resources such as CPU time and storage. The latter effects are related to the computational complexity of computer viruses. In this correspondence, we investigate some issues concerning the time complexity of computer viruses, and prove some known experimental results mathematically. We prove that there exist computer viruses with arbitrarily long running time, not only in the infecting procedure but in the executing procedure. Moreover, we prove that there are computer viruses with arbitrarily large time complexity in the detecting procedure, and there are undecidable computer viruses that have no "minimal" detecting procedure.

Index Terms

Computational complexity, computer viruses, detection, infection, time complexity.

## I. Introduction

The first abstract theory of computer viruses is the viral set theory given by Cohen, based on the Turing machine [1], [2]. A viral set is defined by where is a Turing machine and is a nonempty set of programs on the Turing machine. Each is called a computer virus and satisfies the following condition: if it is contained in the tape at time , then there exist a time and a such that is not contained in the tape at time , but contained in the tape at time . The most important one of Cohen's theorems is about the undecidability of computer viruses [1], [2].

In a different approach, Adleman developed an abstract theory of computer viruses based on recursive functions [3]. In his definition, a virus is a total recursive function v which applies to all programs (the Gödel numberings of programs), such that has characteristic behaviors of computer viruses such as injury, infection, and imitation. Furthermore, Adleman proved that the set of computer viruses is -complete [3].

Although several abstract theories about computer viruses were established and many important results were derived from these theories, we know little about the computational complexity of computer viruses. Very little research has been done so far in this respect. Adleman [3] discussed some problems but failed to give any conclusion. Spinellis [4] proved that to reliably identify the bounded-length of computer viruses is NP-complete. To our best knowledge, there is no more results on computational complexity of computer viruses.

In the following, we investigate some issues on the time complexity of computer viruses. We focus on two kinds of time complexity: the time complexity of computer viruses in their computational environment, and the time complexity of detecting computer viruses. It is well known that there exist arbitrarily complex recursive functions [5], [6], but we cannot apply this fact to computer viruses directly, because computer viruses are some "fixpoints" of recursive operations [3]. On the other hand, theoretical results about the time complexity of detecting computer viruses are very important to antivirus practice.

The structure of the correspondence is as follows: In Section II, we introduce some notations; in Section III, we give definitions of computer viruses based on recursive functions. In Section IV, we prove some auxiliary lemmas and then derive main results about time complexity of computer viruses. In Section V, we give a brief summary and some suggestions for future research.

## II. Preliminaries

We describe some notations below.

Let be the set of all natural numbers and be the set of all finite sequences of natural numbers. For , let denote a computable injective function from to and its inverse is also computable. If is a partial function, for , we write instead of . Similarly, for denote a computable injective function from to , satisfying for all , and its inverse is also computable. For a partial function , let represent .

For a sequence , let denote its th element, and denote the sequence obtained by replacing with in , that is,

If is a computable function, is simply written as . If more than one elements in p are replaced or evaluated by some computable functions, we write the result as

respectively.

Adopting Adleman's notations [3], let denote a function computed by a computer program in the running environment where represents data (including clock, spaces of diskettes, and so on) and represents programs (including operating systems) stored on computers. If the index (the Gödel numbering) of is , the function is also denoted by .The domain and range are denoted by and , respectively. If is a recursive function, we also use the symbols and for its domain and range. It is worth noting that there is no essential distinction between and , as in the case of Von Neumann machine. In this correspondence, we use the symbol just for easy understanding.

For any program , let be the function defined as follows:

If is an index of , we write for . It is obvious that for all , and the predicate defined by [6].

We say that a predicate holds for almost all , if holds for all but finitely many numbers . Equivalently, there is a number such that holds for all . This property is simply denoted by a.e.(almost everywhere).

## III. Definitions of computer viruses

In this section, we extend Adleman's definitions about computer viruses [3] to comply with common understanding of computer viruses [7].

Defintion 3.1 (Nonresident Virus): A total recursive function is called a nonresident virus if for all , satisfies the following

1. and are two recursive functions. and are two recursive predicates such that there is at least one pair of in which holds, and that there is no in which both and hold simultaneously.
2. The set is infinite.

and are called injury condition (trigger) and infection condition, respectively. When holds, the virus executes the injury function , and when holds, the virus chooses a program by selection function , infects it first, then executes the original program . Conditions and together with functions and are called the kernel of a nonresident virus in the rest of this correspondence, because they determine a nonresident virus uniquely.

In definition 3.1, formula (i), (ii), and (iii) describe three typical types of behavior of computer viruses, namely, injury, infection, and imitation. Condition b) guarantees that an infected program would infect at least one other program, and the infection and injury action cannot be completed simultaneously. In other words, infectivity is an indispensable attribute of computer viruses. Condition c) requires that an infected program imitates the original program at infinitely many points. It is a quite strong condition and necessary for the important property that the set of one type of computer viruses with same kernel is -complete [7]. However, it is not needed in investigating time complexity about computer viruses.

We may define almost all kinds of computer viruses in this way. For example, we can give similar definitions for other kinds of computer viruses which are both important and interesting, theoretically and practically. In these definitions, we no longer list conditions b) and c) as in Definition 3.1, but assume they are included in all of these definitions.

Defintion 3.2 (Polymorphic Virus With Two Forms): The pair of two different total recursive functions and is called a polymorphic virus with two forms if for all , satisfies

Polymorphic viruses with forms can be defined as an -tuple of different total recursive functions, under similar conditions as in Definition 3.2. Polymorphic viruses have spread widely in the last ten years and caused a lot of trouble as they are very hard to detect. Polymorphic viruses have billions of forms and in general any two forms do not have three consecutive bytes in common. However, polymorphic viruses are not the hardest viruses to detect. Two other kinds of viruses, the polymorphic viruses with infinite forms and metamorphic viruses [8] are a real challenge.

Defintion 3.3 (Polymorphic Virus With Infinite Forms): A total recursive function is called a polymorphic virus with infinite forms if for all and , satisfies

and for all , .

Up to now polymorphic viruses with infinite forms have not been found in the real world, but their existence can be proved by the recursion theorem of [7] just like ordinary computer viruses.

Defintion 3.4 (Metamorphic Virus): The pair of two different total recursive functions and is called a metamorphic virus if for all , satisfies

where (resp., )is different from (resp., ).

A metamorphic virus seems to combine two different computer viruses and . But, there is a crucial distinction between a metamorphic virus and a pair of two viruses, that is, when infects a program, the program is indeed infected by (not ), and vice versa. The metamorphic virus can be defined in the similar way. The main difference between a metamorphic virus and a polymorphic virus is that each form of a polymorphic virus has the same kernel, but each component of a metamorphic virus has its own kernel [8].

More discussions about computer viruses can be found in [7].

## IV. Main results

In this section, we prove the main results of this correspondence, using the traditional notations and symbols in recursive function theory([9], [10]).

Lemma 4.1: Let be an infinite recursive set and be a total recursive function. Then there exists a recursive subset such that a.e., where is any index of the characteristic function of A.

Proof: Let be an increasing total recursive function with . Define

Since

and is a recursive predicate, there is an effective procedure to decide whether is defined, and to compute its value when it is defined. Consider the function

Since is an recursive set, is a well-defined total recursive function. Let , by diagonal construction of , whenever is defined.

Now, for any such that for infinitely many , let

( if does not exist). Choose such that and . If for some , there is nothing to prove. Assuming for all , we have

Hence, is the least index satisfying (11). From the definition (8) we know that is defined and equals . Thus, we have obtained that for any satisfying , for infinitely many , there exists a such that . Since whenever is defined, it follows that a.e.

Let . is a recursive set, and .

Roughly speaking, Lemma 4.1 implies that for any infinite recursive set, there exists a recursive subset whose decision procedures have arbitrarily large time complexity.

Corollary 4.1: For any total recursive function , there exists a recursive set such that a.e., where is any index of characteristic function defined by

Proof: Let , by Lemma 4.1 we have the conclusion.

Corollary 4.2: Let be a total recursive function and be recursively enumerable but not a simple set. Then there exists a recursive set such that a.e., where is any index of characteristic function of .

Proof: Since is recursively enumerable and not simple, its complement contains an infinite recursively enumerable set that again has an infinite recursive subset . Applying Lemma 4.1 to , there is a recursive subset such that a.e., where is any index of characteristic function of . Let , then is the set required.

Lemma 4.2: Let be a total recursive function. For any recursive function , there exists a total recursive function such that and a.e. for all , where is any index of .

Proof: By the s-m-n theorem, there is a total recursive function such that

and satisfies (if ) [6].

Define

where is given by (8) for and . From the proof of Lemma 4.1, is a total recursive function and a.e. for all , where is any index of .

Since

this completes the proof of the lemma.

Lemma 4.2 extends the s-m-n theorem in the sense that the index function could be any recursive function, not just the primitive recursive function as in the s-m-n theorem. Lemma 4.1 can be further extended as follows.

Lemma 4.3: For each , let and be a total recursive -ary function. There exists a total recursive -ary function such that

and a.e. for all , where is any index of .

Lemma 4.3 implies that for any type of computer viruses, there exists a computer virus whose infecting procedure has arbitrarily large time complexity. This conclusion is formulated in the following theorem.

Theorem 4.1: Let be a total recursive function. For any kind of computer viruses, there exists a computer virus such that a.e., where is any index of .

Proof: Let be a total recursive function. Consider

By Lemma 4.3, there is a total recursive function such that

and a.e. for all , where is any index of .

By the s-m-n theorem, there exists a total recursive function such that . By recursion theorem, there is an such that . Let , then

By Definition 3.1, the total recursive function is a nonresident virus and a.e., where is any index of . Similar results also hold for other kinds of computer viruses defined in Section III and [7]. This completes the proof of the theorem.

Intuitively, we can construct a virus with arbitrarily large time complexity in its infection procedure by adding time-consuming operations in the procedure. But Theorem 4.1 implies more than that. Since by our definition a virus is a recursive function mapping a program into its infected form , Theorem 4.1 shows that for any kind of computer viruses there is a virus such that any implementation of can have arbitrarily large time complexity in its infection procedure.

It is natural to ask whether there exists a computer virus such that the infected program has arbitrarily large time complexity for any program . In general the answer is NO, because may be a partial recursive function, and if so, by infinite imitation requirement (see c) in Definition 3.1) the function computed by the infected program is also a partial recursive function, as well as the function . Hence, the comparison with a total recursive function may be meaningless at infinitely many points. But for total recursive functions we have the following result.

Theorem 4.2: Let be a total recursive function. There exists a computer virus such that a.e. for any total recursive function .

Proof: For each , consider

where is defined in (8) with and . By the proof of Lemma 4.1, is a recursive function, and if is a total recursive function and is an index of , then for almost all .

From the proof of Theorem 4.1, there is a total recursive function satisfying

By Definition 3.1, the total recursive function is a nonresident virus. (In (20), condition (i) and (i'), (ii), (iii) denote injury, infection, and imitation of nonresident viruses, respectively.)

Since is an index of the recursive function , it follows that a.e. for total recursive function .

Virus detection is one of the most important issues in antivirus practice. It is well known that the set of all computer viruses is undecidable [1]. Furthermore, there exists a computer virus such that the set of its infected programs is undecidable [3], [11]. In other words, we can never find a procedure to pick up exactly all the programs infected by . Formally, let [3], then is a nonrecursive recursively enumerable set. To find all programs infected by , it is necessary to find a recursive set such that . This implies that for every detecting procedure of , if it has no false negatives, then it always has false positives.

Although most of the computer viruses in real world are decidable, there are still two unresolved questions. 1) If is decidable, what is its time complexity? 2) If is undecidable, what is the time complexity of the recursive set containing ? The following theorem gives a partial answer to the first question.

Theorem 4.3: Let be a total recursive function, then there exists a decidable computer virus such that for infinitely many , where is any index of characteristic function of .

Proof: Let be an infinite recursive set and be an increasing recursive function with . Consider

where is the identity function, and is defined by

From the proof of Theorem 4.1, there is a recursive function such that

and

Moreover, is an increasing function [10]. Let , then

By Definition 3.3, is a polymorphic virus with infinite forms.

Let . Since and are increasing functions, is also an increasing function. This implies that is a decidable virus and is recursive. Let , then . By Corollary 4.1, there is a set such that a.e., where is any index of the characteristic function of .

Now suppose is the characteristic function of and is one of its indices, since is the characteristic function of , a.e.. Let , this completes the proof of the theorem.

When is undecidable, we should consider the time complexity of the recursive sets containing . The following theorem shows that for any undecidable computer virus, there is one detecting procedure which has arbitrarily large time complexity.

Theorem 4.4: Suppose is a computer virus and is a nonrecursive recursively enumerable set. For any total recursive function there exists a recursive set such that a.e., where is any index of the characteristic function of .

Proof: Let be a computer virus and be the identity function. By Rice's theorem, is a recursively enumerable set. By Definition 3.1 b), is not an identify function for all . This implies if is any index of , then . Hence, and is not a simple set. By Corollary 4.2 we have the conclusion.

If is an undecidable computer virus, that is, is a nonrecursive recursive enumerable set, then is infinite for any recursive set containing . This implies that any of its detecting procedures which can pick up all of its infected program, will make infinitely many errors (false positives). However, it is often desired to find the detecting procedure that gives "minimal" errors. In other words, we want to find a minimal recursive set such that . Here a recursive set containing is called minimal means that for any recursive set such that , the set is finite. In the following theorem, we prove that this desired minimal recursive set does not exist for some computer viruses. The proof, inspired by Adleman [3], depends on the following lemma.

Lemma 4.4: Let be a creative set. Then there is no minimal recursive set containing .

Proof: Let be the productive function of . If satisfies , then clearly and . By the s-m-n theorem, let be a recursive function such that for all

Suppose is a recursive set such that , and let . Since is recursively enumerable, for some . Now let

Clearly is recursively enumerable. Since are pairwise distinct, is infinite and has some infinite recursive subset . Also, by the above properties, . Hence, and . Let , then is recursive and , but is infinite.

Theorem 4.5: There exists computer virus such that is nonrecursive and there is no minimal recursive set containing it.

Proof: Let be an increasing padding function, i.e., for all and , . Let and be the total recursive function such that . Consider the function

It is clear that . Let , since is a one-to-one function, it follows that

Now let be the one-to-one total recursive function such that

From the proof of Theorem 4.1, there exits a total function such that

Substituting by in (29), it follows that

Let , then is a nonresident virus by Definition 3.1. Since is a one-to-one total function, . Combined with (27) we have

i.e., . It means is 1-complete set, hence equivalently creative set. From Lemma 4.4, it follows that there is no minimal recursive set containing .

## V. Conclusion

In this correspondence, we discussed the time complexity of computer viruses. The main contributions are as follows.

1. We proved that there are computer viruses with arbitrarily large time complexity, not only in their infecting procedures, but also in their executing procedures.
2. We proved that there are computer viruses whose detecting procedures have sufficiently large time complexity.
3. We proved that there are undecidable viruses which have no minimal detecting procedure.

It is worth noting that in our discussions we used only the following two features of :

1. for all ; and
2. is decidable.

If we take these two conditions as axioms (as in [5]), all of the conclusions on time complexity of computer viruses can be extended directly to other computational complexity satisfying these two assumptions, such as space complexity of computer viruses, etc.

Contrary to the conclusion of Theorem 4.4, in antivirus practice we are concerned more about the existence of a recursive set satisfying and the characteristic function of have "low" time complexity, such as linear or polynomial time complexity. This problem is trivial when , but if we require that is infinite and is as small as possible, it is an open problem.

For example, typical pattern-based virus-detection software is usually based on the Boyer-Moore string-searching algorithm [12] (or similar algorithms), and can detect most of simple computer viruses with false positives in linear time. But it is not clear whether there are linear or polynomial algorithms which can work for all kinds of computer viruses, especially for undecidable computer viruses.

## Acknowledgment

The authors wish to thank the referees for their helpful comments that improved the correspondence greatly.

## References

1. F. Cohen, "Computational aspects of computer viruses," Computers & Security, vol. 8, no. 1, pp. 325-344, 1989.
2. F. Cohen, A Short Course on Computer Viruses. New York: Wiley, 1994.
3. L. M. Adleman, "An abtract theory of computer viruses," in Advances in Cryptology-CRYPTO'88 (Lecture Notes in Computer Science), S. Goldwasser, Ed. Berlin, Germany, 1988, vol. 403, pp. 354-374.
4. D. Spinellis, "Reliable identification of bounded-length viruses is NP-complete," IEEE Trans. Inf. Theory, vol. 49, no. 1, pp. 280-284, Jan. 2003.
5. M. Blum, "A machine-independent theory of the complexity of recursive function," J. ACM, vol. 14, no. 2, pp. 322-336, 1967.
6. N. Cutland, Computability: Introduction to Recursive Function Theory. Cambridge, U.K.: Cambridge Univ. Press, 1980.
7. Z. H. Zuo and M. H. Zhou, "Some further theoretical results about computer viruses," Comp. J., vol. 47, no. 6, pp. 625-633, 2004.
8. P. Ször and P. Ferrie. (2000) Hunting for Metamorphic. [Online]. Available: http://www.virusbtn.com
9. H. J. Rogers, Theory of Recursive Functions and Effective Computability. New York: McGraw-Hill, 1967.
10. R. I. Soare, Recursively Enumerable Sets and Degrees. New York: Springer-Verlag, 1987.
11. D. M. Chess and S. R. White, "An undetectable computer virus," in Proc. Virus Bulletin Conf., Orlando, FL, Sep. 2000.
12. R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Commun. ACM, vol. 20, no. 10, pp. 262-272, Oct. 1977.