VX Heaven

Library Collection Sources Engines Constructors Simulators Utilities Links Forum

A Mathematical Theory for the Spread of Computer Viruses

Winfried Gleissner
Computers & Security, 8, 1989, pp.35-41
ISSN 0167-4048
February 1989

PDFDownload PDF (400.13Kb) (You need to be registered on forum)
[Back to index] [Comments]
\text{T_EX size}
Industrieanlagen-Betriebgesellschaft mbH, Ottobrunn, FRG

A model is introduced to treat the spread of computer viruses mathematically. A recurrence formula is given which allows a closed expression to be derived for the probability that, starting from an initial state, a given viral state will be reached after executing exactly k programs. In some special cases this recurrence formula can be used for numeric computations, It is shown that the infection process does not stop before all programs are infected, which are visible for any infected program in the initial state.

Keywords: Computer viruses, Mathematical model of virus infection

0167-4048/89/$3.50 © 1989, Elsevier Science Publishers Ltd.

1. The Situation

A computer virus is a program that can reproduce itself and modify other programs by including a possibly evolved copy of itself [1]. That means that whenever an infected program is called, the virus is implanted into another program, if there is any, of the account from where it was called. In refs 1 and 2 some experiments with this kind of program are described. The result was that within a few hours the whole computer system was infected. The aim of this paper is to develop a closed expression for the probability that any given program is infected after a given time. To achieve this aim it must be known which programs were initially infected and for each program one needs the probability that it is called. The use of a computer is regarded as a sequence of program calls.

For the purpose of the present paper there is no difference between the call of a system command, a standard program (editor, linkage editor, compiler), and a user-written program.

2. The Notation

In the computer system there are N accounts. The m_i programs in account i are denoted by P_1^i,P_2^i,\dots,P_{m_i}^i. Let {q_{i,j}}^{(k)} denote the probability that user i calls the program P_j^k in the account k. This gives


Let F denote the set of all N-vectors with integer components, which is defined as follows:

F = \{v \in N_0^N; 0 \le v_i \le m_i, 1 \le i \le N\}

For v \in F the computer system is said to be in viral state v, if in account i there are v_i infected programs. One can further assume that these are the first v_i programs according to the alphabetic order in which the programs are written down. For convenience the viral state in which all programs are infected is denoted by v^{(F)}, i.e.:

v^{(F)} = (m_1,\dots,m_N)^I

On F one defines a modulus by

|v| = \sum^N_{i=1}v_i

and an order by

v \ge \overline{v} \Leftrightarrow v_i \ge \overline{v}_i \hspace{50} 1 \le i \le N

Given the viral degree v \in F, one needs the probability s_{i,v} that an infected program will be called from account i

s_{i,v} = \sum^N_{k=1}\sum^{v_k}_{j=1}{q_{i,j}}^{(k)}

The probability t_v that any not infected program is executed is calculated as

t_v = \sum^N_{i=1}\sum^N_{k=1}\sum^{m_k}_{j=v_k+1}{q_{i,j}}^{(k)}

For purposes which will become clear later, one defines for v_i = m_i and for v=v^{(F)}

s_{i,v} = 0\hspace{50}t_{v^{(F)}}=1

Let {p_{\overline{v},k}}^{(v)} denote the probability that starting from viral state v viral stae \overline{v} will be reached after exactly the k program calls

3. The Recursion Formulae

The {p_{\overline{v},k}}^{(v)} can be calcualted recursively as follows:

	{p_{\overline{v},0}}^{(v)} = \{
		& 1 & v = \overline{v}\\
		& 0 & \text{otherwise}

For \overline{v} = (\overline{v}_1,\overline{v}_2,\dots,\overline{v}_N)^I \overline{v}^{(i)} denotes the vector whose ith component is reduced by unity

\overline{v}^{(i)} = (\overline{v}_1,\dots,\overline{v}_{i-1},\overline{v}_{i-1},\overline{v}_{i+1},\dots,\overline{v}_N)^I

Using this notation one obtains

{p_{\overline{v},k}}^{(v)} = t_{\overline{v}}\text{Prob}\{\overline{v}\text{ is reached after }k-1\text{ calls already}\}\\
	+ \text{Prob\{\text{the last program is infected by the }k\text{th call}\}

{p_{v,k}}^{(v)} = t_{\overline{v}}{p_{\overline{v},k-1}}^{(v)} + \sum^N_{i=1} s_{i,\overline{v}^{(i)}}{p_{\overline{v}^{(i),k-1}}}^{(v)}\hspace{100}(1)

The following recursion formula holds for the {p_{\overline{v},k}}^{(v)}.

Lemma 1:

{p_{\overline{v},k}}^{(v)} = \sum^N_{i=1}s_{i,\overline{v}^{(i)}} \sum^{k-1}_{j=0}{t_{\overline{v}}}^j{p_{\overline{v}^{(i),k-1-j}}}^{(v)}

Proof: The assertion is proven by induction on k For k=1

{p_{\overline{v},1}}^{(v)} = t_{\overline{v}}{p_{\overline{v},0}}^{(v)} + \sum^N_{i=1}s_{i,\overline{v}^{(i)}}{p_{\overline{v}^{(i)},0}}^{(v)}

and as {p_{\overline{v},0}}^{(v)} = 0

{p_{\overline{v},1}}^{(v)} = \sum^N_{i=1}s_{i,\overline{v}^{(i)}}\sum^0_{j=0}{t_{\overline{v}}}^j{p_{\overline{v}^{(i)},0-j}}^{(v)}

Induction from k to k+1

	&=& t_{\overline{v}}{p_{\overline{v},k}}^{(v)} + \sum^N_{i=1}s_{i,\overline{v}^{(i)}}{p_{\overline{v}^{(i)},k}}^{(v)}\\
	&=& t_{\overline{v}} \left( \sum^N_{i=1}s_{i,\overline{v}^{(i)}} \sum^{k-1}_{j=0} t_{\overline{v}}^j{p_{\overline{v}^{(i)},k-1-j}}^{(v)} \right) + \sum^N_{i=1}s_{i,\overline{v}^{(i)}}{p_{\overline{v}^{(i)},k}}^{(v)}\\
	&=& \sum^N_{i=1}s_{i,\overline{v}^{(i)}}\sum^k_{j=0}t_{\overline{v}}^j{p_{\overline{v}^{(i)},k-j}}^{(v)}

This ends the proof of the lemma.

4. The Calculation of the Infection Probabilities

For \overline{v} \ge v one defines

l = |\overline{v}| - |v| = \sum^N_{i=1}(\overline{v}_i - v_i)

Let F_{v,\overline{v}} denote the set of all sequences of length l of vectors w = (\overset{\rightarrow}{w}^{(i)})_{0\le i\le l} with

v = \overset{\rightarrow}{w}^{(0)} \le \overset{\rightarrow}{w}^{(1)} \le \dots \le \overset{\rightarrow}{w}^{(i)} \le \overset{\rightarrow}{w}^{(i+1)} \le \dots\\
\le \overset{\rightarrow}{w}^{(l-1)} \le \overset{\rightarrow}{w}^{(l)} = \overline{v}


|\overset{\rightarrow}{w}^{(i)}| + 1 = \overset{\rightarrow}{w}^{(i+1)}\hspace{100}0\le i\le l-1

if \overset{\rightarrow}{w}^{(\lambda)} and \overset{\rightarrow}{w}^{(\lambda+1)} differ in the i_\lambdath component then


For an element w \in F_{v,\overline{v}} one defines the Vandermond determinant V_w by

V_w = 
	1 & 1 & \dots & 1 & 1 \\
	t_{\overset{\rightarrow}{w}^{(0)}} & t_{\overset{\rightarrow}{w}^{(1)}} & & t_{\overset{\rightarrow}{w}^{(l-1)}} & t_{\overset{\rightarrow}{w}^{(l)}}\\
	\vdots & \vdots & \ddots & \vdots & \vdots \\
	{t_{\overset{\rightarrow}{w}^{(0)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(1)}}}^{l-1} & \dots & {t_{\overset{\rightarrow}{w}^{(l-1)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(l)}}}^{l-1}\\
	{t_{\overset{\rightarrow}{w}^{(0)}}}^l & {t_{\overset{\rightarrow}{w}^{(1)}}}^l & \dots & {t_{\overset{\rightarrow}{w}^{(l-1)}}}^l & {t_{\overset{\rightarrow}{w}^{(l)}}}^l\\

(For more details see ref [3] p.179.)

If the sequence w' is obtained from the sequence w by deleting the last element, the following recursion formula holds:

V_w = V_{w'}\prod^{l-1}_{\lambda=0}\left(t_{\overset{\rightarrow}{w}^{(l)}} - t_{\overset{\rightarrow}{w}^{(\lambda)}}\right)\hspace{100}(2)

For 0\le j \le 1 one obtains the determinant V_{w,(j)} deleting the last line and the jth column of V_w. The following determinant V_{w,k} is derived from the Vandermond form substituting the exponent k for exponent l in the last line:

V_{w,k} = 
	1 & 1 & \dots & 1 & 1 \\
	t_{\overset{\rightarrow}{w}^{(0)}} & t_{\overset{\rightarrow}{w}^{(1)}} & & t_{\overset{\rightarrow}{w}^{(l-1)}} & t_{\overset{\rightarrow}{w}^{(l)}}\\
	\vdots & \vdots & \ddots & \vdots & \vdots \\
	{t_{\overset{\rightarrow}{w}^{(0)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(1)}}}^{l-1} & \dots & {t_{\overset{\rightarrow}{w}^{(l-1)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(l)}}}^{l-1}\\
	{t_{\overset{\rightarrow}{w}^{(0)}}}^k & {t_{\overset{\rightarrow}{w}^{(1)}}}^k & \dots & {t_{\overset{\rightarrow}{w}^{(l-1)}}}^k & {t_{\overset{\rightarrow}{w}^{(l)}}}^k\\

and if l = 0

V_{w,k} = {t_{\overset{\rightarrow}{w}^{(0)}}}^k

The V_{w,k} can be calculated using the V_{w,(j)} by the following lemma:

Lemma 2:

		{t_{\overset{\rightarrow}{w}^{(l)}}}^k - 
		\prod^{l-1}_{i=0\\i\ne j}
			({t_{\overset{\rightarrow}{w}^{(l)}}}^k -
			{t_{\overset{\rightarrow}{w}^{(j)}}}^k) V_{w',(j)}

Proof: Adding the first line times {t_{\overset{\rightarrow}{w}^{(l)}}}^k to the negative of the last line one obtains

	&=& (-1)\left|\begin{array}{cc}
		1 & 1 & \dots & 1 & 1 \\
		t_{\overset{\rightarrow}{w}^{(0)}} & t_{\overset{\rightarrow}{w}^{(1)}} & & t_{\overset{\rightarrow}{w}^{(l-1)}} & t_{\overset{\rightarrow}{w}^{(l)}}\\
		\vdots & \vdots & \ddots & \vdots & \vdots \\
		{t_{\overset{\rightarrow}{w}^{(0)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(1)}}}^{l-1} & \dots & {t_{\overset{\rightarrow}{w}^{(l-1)}}}^{l-1} & {t_{\overset{\rightarrow}{w}^{(l)}}}^{l-1}\\
		{t_{\overset{\rightarrow}{w}^{(l)}}}^k - 
		{t_{\overset{\rightarrow}{w}^{(0)}}}^k & 
		{t_{\overset{\rightarrow}{w}^{(l)}}}^k -
		{t_{\overset{\rightarrow}{w}^{(1)}}}^k & \dots &
		{t_{\overset{\rightarrow}{w}^{(l)}}}^k -
		{t_{\overset{\rightarrow}{w}^{(l-1)}}}^k & 0 \\
	&=& (-1)\sum^l_{j=1}(-1)^{2(l+1)-j}
			{t_{\overset{\rightarrow}{w}^{(l)}}}^k - 
		) V_{w,(l-j)}\\
	&=& \sum^{l-1}_{j=0}(-1)^{l+1-j}
			{t_{\overset{\rightarrow}{w}^{(l)}}}^k - 
		) V_{w,(j)}\\
	&=& \sum^{l-1}_{j=0}(-1)^{l+1-j}
			{t_{\overset{\rightarrow}{w}^{(l)}}}^k - 
			\prod^{l-1}_{i=0\\i\ne j}
				t_{\overset{\rightarrow}{w}^{(l)}} -
			) V_{w',(j)}

Now the following theorem can be proven without undue effort:


{p_{\overline{v},k}}^{(v)}=\sum_{w\in F_{v,v}}\prod^{l-1}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_{w,k}}{V_w}

Proof: The assertion is proven by induction on the distance l of v and \overline{v}. For l=0




Induction from l-1 to l By Lemma 1

	&=& \sum^N_{i=1}s_{i,\overline{v}^{(i)}}\sum^{k-1}_{j=0}{t_{\overline{v}}}^j {p_{\overline{v}^{(j)},k-1-j}}^{(v)}\\
	&=& \sum^N_{i=1}s_{i,\overline{v}^{(i)}}\sum^{k-1}_{j=0}{t_{\overline{v}}}^j \sum_{w\in F_{v,\overline{v}^{(i)}}} \prod^{l-2}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_w,k-1}{V_w}\\
	&=& \sum^N_{i=1}\sum_{w\in F_{v,\overline{v}^{(i)}}}\frac{s_{i,\overline{v}^{(i)}}}{V_w}\prod^{l-2}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\sum^{k-1}_{j=0}{t_{\overline{v}}}^j V_{w,k-1}

With Lemma 2 one infers

\sum^{k-1}_{j=0}{t_{\overline{v}}}^j V_{w,k-1} = \frac{V_{(w,\overline{v}),k}}{\prod^{l-1}_{j=0}\left(t_{\overset{\rightarrow}{w}^{(l)}}-t_{\overset{\rightarrow}{w}^{(j)}}\right)}

and with this and eqn (2)

	&=& \sum^N_{i=1}s_{i,\overline{v}^{(i)}}\sum_{w\in F_{v,\overline{v}^{(i)}}}\prod^{l-2}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_{(w,v),k-1}}{V_{(w,v)}}\\
	&=& \sum_{w\in F_{v,\overline{v}}}\prod^{l-1}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_{w,k}}{V_w}

This ends the proof of the theorem.

5. Conclusions

The conclusions will be stated in the form of three remarks.

Remark 1: Accounts that do not call any infected programs cannot be infected. If for v\in F there exists an i such that s_{i,v}=0 then for \overline{v}\ge v with \overline{v}_i > v_i

{p_{\overline{v},k}}^{(v)} = 0\ \ \text{for}\ all\ \ k\in N

Proof: As \overline{v}_i>v_i there must be an infection in the account of the ith user, i.e. for every w\in F_{v,\overline{v}} there must be an index \lambda such that i_\lambda=i. As

	{p_{\overline{v},k}}^{(v)}=\sum_{w\in F_{v,\overline{v}}}\prod^{l-1}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_{w,k}}{V_w}=0

For v\in F one defines

	N_v=\{i; s_{i,\overline{v}} = 0, \overline{v} \ge v\}\\
	F_v=\{\overline{v}\in F; \overline{v}\ge v\text{ and }\overline{v}_i=v_i\text{ for }i\not\in N_v\}\\
	\overline{v}^{(v)}\in F_v\ with\ {v_i}^{(v)}=m_i\text{ for }i\not\in N_v

Remark 2: If v is the initial viral state, the infection process does not stop before the viral state \overline{v}^v is reached. One has to show that for \overline{v}\in F_v with \overline{v}\ne \overline{v}^v


Proof: As s_{i,v}>0,t_{\overset{\rightarrow}{w}^{(h)}}<1for all w\in F_{v,\overline{v}}. If all t_{\overset{\rightarrow}{w}^{(h)}} are distinct


If some t_{\overset{\rightarrow}{w}^{(h)}} are equal, once can choose sequences t_{{\overset{\rightarrow}{w}_\varepsilon}^{(h)}} with

	t_{\overset{\rightarrow}{w}^{(h)}}\ne t_{\overset{\rightarrow}{w}^{(k)}}\text{ for }h\ne k\text{ and }t_{\overset{\rightarrow}{w}^{(h)}}<1

Using de l'Hopital's rule one calculates


and hence


Remark 3: The maximum possible viral degree \overline{v}^{(F)} is actually reached in the limit

\text{If }\overline{v}=\overline{v}^{(F)}\\

Proof: The case where some t_{\overset{\rightarrow}{w}^{(h)} are equal can be treated as the limit of the case where all t_{\overset{\rightarrow}{w}^{(h)} are distinct. Hence only this case will be considered. The assertion shall be proven by induction on the distance l between v and \overline{v}. For l=1 t_{\overset{\rightarrow}{w}^{(0)}}=1 and




For the induction v^{(i)} denotes the viral degree which is obtained from v by adding 1 to the ith component. The distance between v^{(i)} and \overline{v}^{(F)} is assumed to be l. Then

	&=&\lim_{k\to\infty}\sum_{w\in F_{v,\overline{v}}}\prod^l_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}\frac{V_{w,k}}{V_w}\\
		\frac	{\prod^{l-1}_{\lambda=0}s_{i_\lambda,\overset{\rightarrow}{w}^{(\lambda)}}}
	&=&\sum^N_{i=1}\frac{s_{i,v}}{1-t_v}\lim_{k\to\infty}{p_{\overline{v},k}}^{(v^(i))} = \sum^N_{i=1}\frac{s_{i,v}}{1-t_v}=1

Fig.1 Infection process with the probability model

Fig.1 Infection process with the probability model

6. Numerical Examples

Number of infection paths: Using the notation introduced above one can calculate the number {I_N}^{(v)} of possible infection paths leading from an initial viral state v=(v_1,\dots,v_N) to the final state v^{(F)}=(m_1,\dots,m_N).

Theorem: The following formula holds for the number of infection paths:


Proof: Whenever a program is infected one writes down the number of the account in which it resides. Thus the infection process is represented as a finite sequence of integers with elements between 1 and N. The length of the sequence is


There are exactly m_i-v_i entries with value i in this sequence for 1\le i\le N. This gives for N=2


The rest follows by induction on N.

Fig.2 Infection process of a real system

Fig. 2 Infection process of a real system

This shows that it is not feasible to do calculations for more than once account and a sufficiently large number of programs.

One account with m programs: For convenience the assumption is made that all programs are called with equal frequency 1/m. The computation proceeds as follows. Initially only one program is infected. After each call of a program the expectation for the number of infected programs is calculated. As soon as it exceeds the number of initially infected programs by more than one, the process is started anew but with one more program, which is initially infected.

On the x axis the number of runs of a program is plotted and on the y axis the percentage of infected programs. Supposing that there are 80 programs of which only one is infected in the beginning the model shows that all of them are infected after 378 calls (Fig. 1).

This is contrasted with an infection process on a real system, where the number of the program to be called was determined by a random number generator. Figure 2 shows a typical infection process on a real system. The number of program calls, after which the whole account was infected, lay in the range from 230 and 700. The nature of the curve shows an exponential growth of the infection process for the real system as well as for the probability model. Assuming an average number of ten runs per hour this shows that after about 40 hours all programs are infected. The computations show that m programs will be infected after approximately 5m runs.

Takeover of users: The model can be used to say something when all users are infected instead of the takeover of all programs. This problem can be treated with similar arguments as above. Assuming that each account has only one program or that the virus infectes all programs of the account from which it was called, this case is reduced to the situation discussed earlier.


  1. F. Cohen, Computer Viruses, Theory and Experiments, Preprint, University of Southern California, 1984.
  2. F. Cohen, Computer Viruses, PhD Thesis, University of Southern California, 1985.
  3. S. Lang, Linear Algebra, Addison-Wesley, Reading, MA, 1970.

Dr. W. Gleissner Dr. W. Gleissner received a Masters degree in numerical analysis from Oxford University in 1972 and a Diploma in mathematics from the Technical University of Munich in 1973. His main fields of interest were interval arithmetic and approximation theory but later changed to pure mathematics and representation theory of infinite groups. He received a doctoral degree in 1978 from Technical University of Munich. He then joined the Munich Reinsurance Co. as a project leader for the computer-based administration of the life reinsurance business for client insurance companies from all over the world. Later he was in charge of introducing workstations and in 1985 he became responsible for guideline development and programming methodology in an institution produsing software for the administration of the Federal Republic of Germany. His main fields of interest now are computer viruses and security-related topics in Unix. He was published more than a dozen papers about mathematical models in economics (Industrieanlager-Betriebgesellschaft mbH, Einsteunstrasse 20, D-8012 Ottobrunn, F.R.G.)

[Back to index] [Comments]
By accessing, viewing, downloading or otherwise using this content you agree to be bound by the Terms of Use! aka