VX Heaven

Library Collection Sources Engines Constructors Simulators Utilities Links Forum

Stripping down an AV engine

Igor Muttik
Virus Bulletin Conference
September 2000

PDFDownload PDF (44.75Kb) (You need to be registered on forum)
[Back to index] [Comments]
Network Associates Inc (McAfee Division), Alton House, Gatehouse Way,
Aylesbury, Bucks, HP4 8YD, UK
Tel +44 1296 318700 • Fax +44 1296 318777 • Email [email protected]


The complexity of anti-virus software has grown enormously over the last five years. The methods used to detect viruses have evolved from dumb-grunt scanning of the whole file from top to bottom for a specific search string to very intelligent methods based on a combination of heuristic and specific detection methods. This paper discusses this evolution in detail: from old-fashioned methods to the most complex contemporary ones. When speaking to the people not directly involved in the AV business I found it rather amusing that they are usually surprised to find that these days we do not use scan strings as such any more. In fact, we do, but not frequently, because for contemporary malware better methods can be used. What are they?

The definitions and examples of various detection methods are given including: search string detection, checksumming or CRCing, X-raying elimination, static heuristic analysis, dynamic heuristic analysis, etc. The advantages and disadvantages of these individual methods and their combinations are presented. A theoretical battle is constantly going on. Which is better: specific, precise detection or generic ways of handling viruses?

There are pros and cons for both approaches:

Some AV products utilize both approaches. However, there are many different ways to combine two approaches. What is the best way to mix them? With the worm, Trojan and backdoor problem becoming more and more serious the scanners have to deal with a lot of new malware written mostly in high level languages. Old detection methods are not very suitable for those kinds of files. Which methods can be used effectively to enable reliable and generic detection of malware without giving false alarms? Can contemporary malware be detected generically? The effect of different detection methods on the speed of scanners is analysed.

The key requirement of an AV product is to ensure minimal disruption to the everyday operation of a user – regarding both the AV software itself and the malware it protects the user from. Pure heuristic methods are not perfect in that respect – they tend to give more false alarms then usual methods and the quality of cleaning they can achieve is less then of specific methods.

The advantges and disadvantages of ‘generic vs. specific’ are discussed and applied to different stages of a scanner operation: detection identification, reporting and cleaning. We analyse how different approaches affect the individual requirements of different users.

1 The evolution of scanner technology – A brief overview

When the first viruses appeared, programmers quickly researched them and came up with an idea of how to detect them easily and reliably. As viruses copy themselves from one executable file into another, all infected files have a virus body embedded somewhere in a host file. To find a virus one only has to scan the host file for a sequence of bytes that is specific to the virus, and cannot be found in a normal program. Such stable byte sequences are called ‘search strings’.

Then, it was realised that scanning the files from top to bottom is not really necessary because viruses always affected the entry point of executable files and so viral code could be found quicker by following the entry point and then looking for a search string. Then, encrypted and polymorphic viruses appeared. They either had very short stable byte sequences or none at all. As a response to the appearance of such viruses two techniques were developed – X-raying and emulation. X-raying is brute-force decryption of the virus body based on the knowledge of the encrypted plain text (the search string). Emulation is a simulated execution of the viral code. It goes under program control rather than under the real processor (and so is much slower then real execution).

Also, for detecting polymorphic viruses, scanners sometimes use a method of frequency analysis of the processor’s opcodes found in the program. It is rather effective for some polymorphic viruses when they use only a limited set of opcodes in their decryptors or have strong preferences for particular opcodes. For example, a lot of complex polymorphic viruses never use the main DOS service interrupt (21h) in their decryptors, while most legitimate programs use it frequently and near the beginning. So, an opcode frequency table for many viruses would have 0 for CD 21. Thus we can eliminate a lot of legitimate programs from being unnecessarily analysed and emulated.

Later, we saw a few viruses that used so called entry obfuscation techniques. That means the entry point of the host program is not affected but the virus modifies the code of the host program far from the entry point. The viruses that are most difficult to detect properly are those which combine polymorphism and entry point obfuscation. Fortunately, entry-obfuscating viruses are not very reliable and so do not present many problems for AV developers. They are also more difficult to write, so they are rare.

When the first scanners appeared, nobody knew that the number of viruses would grow so fast and for so long. The sheer volume of viruses required some techniques to optimise the performance so that the scanning time would not grow as quickly as the number of viruses.

One of the most difficult types of objects for scanners to deal with are viruses written in high level languages (HLL viruses) – Pascal, C, Basic, etc. The problem is that HLL programs have lots of common byte sequences and it is not easy to select a search string that is specific to a virus and cannot be found in a normal HLL program. This also explains why detection for HLL viruses produces more false alarms. The solution is either to base the detection on the data areas of the file (such as relocation table or text strings) or checksum significant areas of the file.

Sometimes it is more efficient to use checksums than search strings. The reason is simple – a 32bit checksum takes four bytes. A decent search string for a HLL program can hardly be shorter then 8–12 bytes. Of course, scanners cannot afford to checksum each and every file (for speed reasons), so they have to apply at least some sort of additional checking (elimination) before going into checksumming loops. For example, a file size and/or file type (COM/EXE/NE/PE/etc.) can be used for such elimination purpose. Before starting to checksum (which is a time consuming operation) the scanner may also check for the presence of a short search string which ought to be present.

Recently, the virus menace is being taken over by the Trojan threat1. For example, in USENET newsgroups postings, backdoor Trojans are as common as viruses2 (particularly common is Backdoor-G2 also known as SubSeven – it was posted to newsgroups 835 times between May and June 2000). As backdoor Trojans are reused so frequently, scanners are very effective against them. These days most decent scanners detect many popular Trojans. When detection of Trojan horses was first implemented in scanners the techniques developed for HLL viruses appeared to be very useful because most contemporary Trojans are HLL programs.

Over the last five years scanner technology changed dramatically. This is mainly because new threats have emerged – macro viruses, password-stealing and backdoor Trojans, network-aware worms, script viruses. Also, new, complex containers appeared – CAB files, OLE containers, MS-Install (MSI) files, RAR archives, PE-packed files, etc. To offer sufficient protection, scanners have to be able to go into these objects and scan the contents (at least in on-demand mode; for an on-access scanner, going into packed objects does not make much sense and will cause unacceptable delays). Modern OSes have complicated what should have been relatively simple scanning processes by introducing new file types and associations between data files and executables (like VBS scripts associated with WSCRIPT.EXE). All mentioned factors lead to a huge growth in the complexity of AV software. The first scanners could have been written in Pascal (e.g. FindVirus v.1) and assembly language (TbScan). These days, all of them are developed in C or C++ and have modular sources portable to many different platforms and OSes.

The development of the Internet caused email-borne viruses and worms. The most significant shift is that scanners were always developed to check for viruses in files. However, some contemporary viruses do not have to exist in file form to pose danger. For example, script viruses that live in the body of emails (like, for example, JS/Kak) can activate from within an email. Only after such script is activated in Outlook will this virus attempt to modify the local file system to install itself. Before that, only an email gateway scanner has the ability to catch the virus. This is a rather important change that should be acknowledged.

2 An engine and an AV database

Any AV scanner comprises the engine and the virus detection database. They work together and are truly inseparable. In cases when bits of the database are in machine code, the distinction between where the engine stops and the database starts is nearly lost. In some implementations the engine only serves as a loader for the database and all the functionality is implemented in the database. In other cases, the engine serves as a library of commonly used functions and the database simply exercises these functions to perform the task of finding a virus. Generally, we can say that the engine is less volatile, while the AV database is very volatile. That would reflect the habit of people to upgrade the engine rarely, but update the database religiously and as frequently as possible. That strategy is poor – updating the engine is also very important. It goes without saying that to achieve the best detection rates both the latest engine and the latest database update should be used. That is particularly true for the scanners where the engine carries a lot of functionality.

Deciding on how to distribute scanner functionality between the engine and the database is not easy. The greatest flexibility is achieved when the engine is simple and the database carries huge chunks of the executable code. However, such setups may have stability problems because frequent updates affect the executable code. When the engine is updated less frequently the stability is better but the flexibility in covering new threats is reduced. The optimal solution is to combine the best of the two approaches. It is good to have an infrequently updated scanning engine with no active code in the database, but with the ability to implement tasks of whatever complexity in the database (via an interpreted p-code or via some sort of scripting language). That gives the necessary flexibility whilst not compromising the stability of the software.

3 Heuristic scanning

When the number of viruses reached several hundred it was realised that for scanners to be more effective they have to be able to detect new, unknown viruses. There are two ways of catching new viruses – the first is to select very generic search strings that would ensure detection of huge groups of related viruses. The second way is to build a rule-based system that would apply heuristic rules and produce an overall score. If the program does lots of suspicious things the score is high and the program is likely to be a virus. There are two different ways of applying heuristic rules: static and dynamic. The static method checks the presence of a suspicious code fragments (whether they are executed or not). The dynamic method emulates the program and checks which actions are really performed (that is simulation of a virus execution in a virtual environment, frequently called a ‘sandbox’ or an emulator buffer).

Neither the static nor dynamic method is ‘better’ – both have advantages and disadvantages. For example, the static method can trigger on the remnants of viral code that are not really executed, or it can miss some suspicious actions that become visible only as execution occurs. Dynamic methods, however, are simply slow. The best results are achieved by combining both methods. With the growing flexibility of AV databases, heuristic detection can now be implemented as part of a database so that it can be updated more easily. That is certainly a good thing as it enables the bugs and false alarms to be fixed quickly.

4 Detection and cleaning: Generic detection and cleaning

Should AV scanners clean out infections or not? The answer seems obvious because why would users want to know if they ‘have’ a problem? Most would much rather know that they ‘had’ a problem. Of course, restoring files from backups is optimal, but what if a backup policy has not been established or the tape has tangled? It is much better if a user has a choice of whether to employ a backup or the ability of a scanner to clean the infection(s). The ability of a scanner to perform disinfection may also improve user efficiency if the alternative is to call the IT department in order to restore from backups.

The question for an AV producer is where to stop in distinguishing between different viruses. One may say that if we have a particular virus detected heuristically we do not need to do any more work on it. Detecting and stopping a virus at the point of entry results in the scanner not having to perform disinfection at all. Such an approach, of course, make some sense and there are scanners on the market (and some reasonably successful) that do not offer, say, cleaning of infected programs. Such a producer would have a huge chunk of viruses detected but not cleanable, because they have not built into the database the ability to distinguish between these viruses and clean them appropriately. This approach means the detection resolution is very crude and in most cases inadequate for good disinfection.

Let us imagine a family of two viruses, different by only one bit in their body. An AV producer may want to distinguish between such variants (for the sake of scientific classification, for example). Of course, such differentiation is not necessary to ensure perfect disinfection. The product would have a very fine variant resolution but would not be able to clean new variants before they are added to the AV database. In other words, cleaning in such a product would be based on a particular virus variant rather than on generic family cleaning.

Even if different variants in the database of such a scanner are mapped onto a single repair routine – the users do not have full advantage of this generic repair, as variant checking would prevent a scanner from cleaning a new variant. And if generic repair is automatic for the whole family, there is not much benefit to having exact variant identification (see more on this topic in section 6).

A third producer may say ‘We want to have generic detection for as many families with similar cleaning as possible’. In practice, the decision is made easily because generic cleaning is almost always acceptable and desirable. So, the last alternative becomes the only viable one. The advantage of this last approach is that a great deal of new viruses are detected and cleaned. That means users have a proactive scanner – something that was not available a few years ago. For binary infectors the achievable detection and cleaning rate is above 50%. For macro viruses and script viruses that level can be as high as 90% (because they are much simpler and easier to handle). As macro viruses and script viruses are most common in the field, the overall cleaning rate for a new, ‘typical’ field virus would be above 80%.

5 Why do different scanners disagree?

We frequently see samples submitted to our research facility with the question – ‘Why do two different scanners report the same sample differently, or one reports while other does not?’. There is a plethora of reasons. Let us try to analyse them:

  1. Misses. This is obviously the case when one scanner is behind in detection (to compare the detection rates properly, of course, both scanners should have been updated at the same time).
  2. False alarms. This is when one scanner has a poor design or the database has not undergone sufficient testing.
  3. Ghost positives. If a virus was in the file and has been subsequently removed, one scanner may consider the file clean while another can still trigger on some unremoved fragments.
  4. Virus name discrepancies. This is when two different vendors use different classification schemes or naming standards (some vendors never change the names of viruses they detect). To resolve most of these discrepancies go to Project Vgrep at
  5. Classification problems. This is when, say, one producer calls a Trojan what another producer classifies as a joke and refuses to detect because it is harmless.
  6. Technological differences.
    1. For example, a viral macro has a wrong name and cannot replicate: one engine can ignore the macro name (and so detect the crippled virus) while another can check the name (and rightfully decide that this macro is not dangerous).
    2. For example, the code for a PE-infector is found at the entry point of a DOS program (this cannot run, of course).
    3. When a file has appended the virus body with no control transferred to that body (so the file runs and never infects but has some ‘baggage’ at the end).
  7. Different capabilities. That is when two engines can be equally good in detecting a particular virus, but one can have better handling of some obscure or unusual file formats. For example, not all engines may be able to unpack MS-Compress file format, or UPX-packed PE executables, or something embedded into .PPT files. In this case, the on-demand scanner that is lacking such functionality would miss a virus inside a non-trivial container.

6 What is best for the user?

This question is really rhetorical, as the major selling point for a user is not the design of an AV engine, but its ability to detect and clean viruses. If two products detect and clean viruses equally well, then the better design would mainly affect the speed of operation and the probability of false alarms. The better design could also be seen by the platform coverage – if the same AV engine can be used on a PC, Dec-Alpha, Macintosh and AS400 that means something about the portability and stability of the engine.

However, let us imagine for a while that all scanners have equally good detection rates (there is a grain of truth in this assumption – as according to the test carried by the Virus Test Centre at Hamburg University there are more and more scanners in their ‘very good’ category, for details see In other words, the detection rates of most scanners are high and good enough for practical purposes. The tests mentioned, however, are based on scanning a collection of known viruses so they do not tell us much how well products detect and protect from new threats. The natural assumption, of course, would be that the best overall detection is achieved by scanners with the best combination of heuristic, generic and specific detection methods. Running tests of slightly outdated scanners over contemporary virus collections can test this. However, no such test results are available.

Of course, on top of decent detection rates (of known viruses), users would want the best proactive protection (heuristic detection), plus the automatic cleaning of new (and known) viruses. They would like precise variant reporting so that in case of an outbreak they could check with their vendor if they need to expect anything nasty (like when the virus corrupts data or sends any data out). However, for viruses that do not do nasty things the necessity of knowing the exact variant suffix is questionable.

The total number of viruses known at the moment exceeds 54,000 and this number grows at a rate of over 100 a week. It is not realistic to expect AV producers to research all of them and document their findings. Only those viruses which are found in the field or are unusual in some way get analysed and described. Also, different producers may assign different names to virus variants (the most common is the CARO naming standard and macro viruses classification maintained by ‘Vmacro’ group). This means precise identification of a virus on a user site, which, in many cases, may result in an AV vendor requesting a sample of a virus just to be sure that there is no mix-up with the variant naming.

For different users, scanner requirements can be very different. Some people are actively surfing the Internet, downloading and running various files. Others may only use a PC for word processing. However, for any category, once they keep the AV database updated the ability to detect and clean new viruses is the most important. Otherwise the scanner would quite simply fail in its main goal – to protect from malware. The argument can be about where to have a scanner – on a desktop, on a mail gateway, or on an Exchange server. However, all in all, the scanner always has to be effective at finding known and new viruses and cleaning them. It also has to be reasonably quick. The speed is particularly important for on-access scanning. With on-demand or gateway scanning any minor delay is acceptable, whereas for on-access scanning, a 10 second delay, for example, is too long.

7 New kinds of malware

Modern malware is written mostly in HLL or script languages. Old detection methods are not very suitable for these kinds of files – neither HLL viruses nor BAT infectors were common in the past. A common problem encountered with modern macro viruses and worms written in script languages is that their source is readily available for modification. These viruses carry their own source so we see many more variants produced accidentally (when people modify a virus using the Visual Basic Editor) or deliberately (when people want to produce a new virus, perhaps to see how AV software would react to the modification).

As an example, just after the original VBS/LoveLetter outbreak we saw about 40 rewrites of the original worm within a month. Since we employed generic detection of these scripts, we only had to update our VBS/LoveLetter detection three or four times to accommodate all the new variants (some of which were apparently written in an attempt to defeat it!).

From a scientific point of view, any modification may be a new virus because what travels would be a different object. However, the ability of scanners to detect and clean on a ‘family basis’, ignoring unimportant changes, is becoming extremely important these days.

With backdoor and password-stealing Trojans becoming more common, and developing rapidly over the last couple of years1, the necessity to develop heuristic and generic approaches against these threats also increases. This is, however, a complex task because developing good generic detection requires much manual analysis, while automated analysis for the purpose of heuristics is not simple and very time consuming for programs written in HLL languages. The first steps down that route, however, have been taken and a lot of malware is caught using generic and heuristic drivers these days.

8 Scanner speed

To say scanner speed is important is an understatement. Scanner speed is probably as important as the ability to detect viruses. Slow scanners simply cannot be used effectively – they do not reduce the cost of computer ownership (and that is the goal of any service software) and should not be used. Just imagine the problem of a network administrator who is to schedule a daily server scan that cannot complete within 24 hours because the scanner is not able to scan the server quickly enough. Which AV engine peculiarities are most important to achieve good performance? I believe there are four most important points: database design, file access, utilisation of the emulator and computing checksums.

The virus detection database should be designed properly. The scanner should be smart enough to have the virus definitions sorted somehow so that only a few are applied to a particular file. The first, and probably the most important, step is to ensure that the virus definitions are applied to the right type of files. It does not make sense to look for a virus infecting only COM files in EXE files. Or if a scanner is, for example, scanning memory for the WM/Npad virus, it is an example of poor design as macro viruses do not infect memory. Even within the right file type, database records should be organised correctly (e.g. in a binary tree or with hashes and quick lookup tables).

The second important thing to achieve high data processing rate is to read as little from the file as possible before starting to analyse it seriously. Optimally, if a scanner can read the first cluster of the file (you cannot really read less if you do not access the disk hardware directly and under modern OSes it is virtually impossible) and determine whether file is clean or not, that would be exceptionally quick. It is not usually possible to decide if the file is clean after analysing just the first cluster but certainly the less disk I/O that is performed during the analysis, the better. The reason for this is obvious – disk operations are relatively slow and amount to approximately 50% of the total time spent performing a scan.

The third, but certainly not the least important, component that causes ‘slowdown’ is the emulator. Emulation is used to decrypt polymorphic and encrypted viruses in order to be able to detect some constant parts of the virus body. As some viruses require very long emulation it is important to ensure that clean, legitimate, common programs are not emulated for long. Emulation should therefore be avoided when it is not essential. So the AV database should consist of rules that describe differences between viruses and clean programs. These rules (let us call them ‘eliminations’) should be applied as soon as possible and preferably before the emulation is started to prevent as many legitimate programs from being emulated as possible. However, it sometimes happens that an innocent file passes all the eliminations and enters a long emulation loop – simply because it happened to be similar enough to some other virus and there was no rule defined to distinguish this file from a virus early enough.

The fourth slow component involves computing the checksum. Any checksum calculation is a loop that continues for as long as the area that you are checksumming. The longer the area, the longer the computation time. And you cannot optimise that – if you are to calculate a checksum over 10,000 bytes it would take twice as long as if you do it over 5,000 bytes.

When creating an AV database record it is important to write a virus definition in a way that would minimise the disk accesses and find good rules to eliminate virus definitions that may require a lot of emulation and checksumming. Finding good elimination rules is a complex and time-consuming operation and it requires deep analysis of malware.

Also, when a scanner is analysing the file, it should make sure that before it starts checksumming it performs some kind of elimination. For example, if a module of a macro virus has a specific CRC, the engine should first check if the module size and/or name is what should be expected for this virus. Otherwise lots of macros would be unnecessarily checksummed, thereby reducing the scan speed and increasing the potential for false alarms.

9 Combining heuristic and virus-specific approaches

As we just saw, good elimination is important but achieving this is sometimes tricky. However, the brute-force method can be used here too – heuristic elimination. Any heuristic analysis produces a lot of information about the file in question. If the scanner runs heuristics over all files anyway, it would certainly make sense to store all this data (which heuristic rules were satisfied, how many times, where and when etc) and match it to corresponding data collected over known viruses. If there is a mismatch, the scanner can be certain that some viruses cannot be present in the analysed file. That would be a good heuristic elimination.

One easy way to speed up a scanner is to turn on all heuristic capabilities and exclude all specific database records that are covered by the heuristics. That may well increase the speed (if heuristic analysis is quick enough, of course). However, people prefer to know which particular virus they have detected, so for a scanner it would be advantageous if, after finding a virus heuristically, it could switch to a specific detection and report the name. In essence, this approach is very similar to specific detection using the heuristic elimination.

Combining a heuristic approach with specific detection methods also has the advantage of reducing false alarm rates. For example, if a script file invokes Outlook and goes through its address book, it is suspicious. However, some legitimate programs may do that too. If, however, such a script also has a string ‘barok’ (which is a nickname of the VBS/LoveLetter author), then that is extremely suspicious.


Clever engine design, careful programming of an AV database, and the combination of generic and heuristic detection methods can achieve stunning results in detecting and cleaning the majority (up to 80–90%) of new field viruses. A few years ago, heuristic detection rates for DOS viruses were at that level! These days, when macro viruses and script worms are the most common items of malware, the achievable cleaning rates can be much higher because such viruses/worms are far simpler.


  1. Dr I. Muttik, ‘Trojans – The New Threat’, Proceedings of IVPC’98 International Conference: ‘Protecting the Workplace of the Future’, 28–29 April 1998, Orlando, Florida.
  2. D. Gryaznov and P. Nolan, personal communications.
  4. ‘Scanner test April 2000’, Hamburg University Virus Test Center Report,, April 2000.
[Back to index] [Comments]
By accessing, viewing, downloading or otherwise using this content you agree to be bound by the Terms of Use! aka