Creating a custom known file type for R-Studio. Determining a file type by signature What is a file signature

The concept " Magic number"in programming has three meanings:

  • Data signature
  • Selected unique values, which should not be the same as other values ​​(eg UUID)
  • Bad programming practice.

Data signature

Magic number, or signature, - an integer or text constant used to uniquely identify a resource or data. Such a number in itself does not have any meaning and can cause confusion if it appears in the program code without the appropriate context or comment, while an attempt to change it to another, even one close in value, can lead to completely unpredictable consequences. For this reason, such numbers were ironically called magic numbers. Currently, this name is firmly established as a term. For example, any compiled Java language class starts with the hexadecimal "magic number" 0xCAFEBABE. The second widely known example is any executable file OS Microsoft Windows with the extension .exe begins with the byte sequence 0x4D5A (which corresponds to the ASCII characters MZ - the initials of Mark Zbikowski, one of the creators of MS-DOS). A lesser-known example is the uninitialized pointer in Microsoft Visual C++ (since the 2005 version of Microsoft Visual Studio), which has the address 0xDEADBEEF in debug mode.

In UNIX-like operating systems The file type is usually determined by the file signature, regardless of the extension of its name. They provide a standard file utility to interpret the file signature.

Bad programming practice

Also called “magic numbers” is a bad programming practice when a numeric value occurs in the source text and it is not obvious what it means. For example, a snippet like this, written in Java, would be bad:

drawSprite(53, 320, 240);

final int SCREEN_WIDTH = 640 ; final int SCREEN_HEIGHT = 480 ; final int SCREEN_X_CENTER = SCREEN_WIDTH / 2 ; final int SCREEN_Y_CENTER = SCREEN_HEIGHT / 2 ; final int SPRITE_CROSSHAIR = 53 ; ... drawSprite(SPRITE_CROSSHAIR, SCREEN_X_CENTER, SCREEN_Y_CENTER);

Now it’s clear: this line displays a sprite - the crosshair of the sight - in the center of the screen. In most programming languages, all values ​​used for such constants will be calculated at compile time and substituted into the places where the values ​​are used. Therefore, such a change in the source text does not degrade the performance of the program.

In addition, magic numbers are a potential source of errors in a program:

  • If the same magic number is used more than once in a program (or could potentially be used), then changing it will require edits to each occurrence (instead of just one edit to the value of the named constant). If not all occurrences are corrected, at least one error will occur.
  • In at least one of the occurrences, the magic number may be misspelled initially, and this is quite difficult to detect.
  • The magic number may depend on an implicit parameter or another magic number. If these dependencies, not explicitly identified, are not satisfied, at least one error will occur.
  • When modifying occurrences of one magic number, it is possible to mistakenly change another magic number that is independent but has the same numerical value.

Magic numbers and cross-platform

Sometimes magic numbers harm cross-platform code. The fact is that in C, on 32- and 64-bit operating systems, the size of the char , short and long long types is guaranteed, while the size of int , long , size_t and ptrdiff_t can change (for the first two, depending on the preferences of the compiler developers , for the last two - depending on the bit capacity of the target system). In old or poorly written code, there may be “magic numbers” that indicate the size of a type - when moving to machines with a different bit capacity, they can lead to subtle errors.

For example:

const size_t NUMBER_OF_ELEMENTS = 10 ; long a[NUMBER_OF_ELEMENTS]; memset(a, 0, 10 * 4); // incorrect - long is assumed to be 4 bytes, magic number of elements is used memset(a, 0, NUMBER_OF_ELEMENTS * 4); // incorrect - long is assumed to be 4 bytes memset(a, 0, NUMBER_OF_ELEMENTS * sizeof(long)); // not entirely correct - duplication of the type name (if the type changes, you will have to change it here too) memset (a , 0 , NUMBER_OF_ELEMENTS * sizeof (a [ 0 ])); // correct, optimal for dynamic arrays of non-zero size memset(a, 0, sizeof(a)); // correct, optimal for static arrays

Numbers that are not magic

Not all numbers need to be converted to constants. For example, the code for

Search by scanning files of known types (or, as is often said, search for files by signature) is one of the most effective ones used in the R-Studio data recovery utility. Using a given signature allows you to restore files of a certain type in cases where information on the directory structure and file names is partially or completely missing (damaged).

Typically, the disk partition table is used to determine the location of files. If you compare a disk with a book, the partition table will be similar to its table of contents. When scanning, R-Studio searches for known file types in the disk partition table using certain specified signatures. This is made possible by the fact that virtually every file type has a unique signature or data pattern. File signatures are found at a specific location at the beginning of the file and in many cases also at the end of the file. When scanning, R-Studio matches the found data with signatures of known file types, which allows them to be identified and their data recovered.

Using technology for scanning known file types, R-Studio allows you to recover data from disks that have been reformatted and whose partition tables have been overwritten. Moreover, if a disk partition is overwritten, damaged or deleted, then scanning known file types is the only option.

But almost everything has its drawbacks, and the known file types used in R-Studio are no exception. So, when scanning known file types, R-Studio allows you to recover only unfragmented files, but, as already mentioned, in most cases this is the latest possible method.

R-Studio already includes signatures of the most common file types (view full list files of known types can be found in the R-Studio Online Help section.)

If necessary, the user can add new file types to R-Studio. For example, if you need to find files of a unique type, or those developed after the last release date of R-Studio, you can add your own signatures to the files of known types. This process will be discussed next.

Custom Files of Known Types
Custom file signatures of known file types are stored in XML file e specified in the Settings dialog box. Adding a signature consists of two parts:

  1. Determination of the file signature located at the beginning of the file and, if present, at the end of the file.
  2. Generate an XML file containing a file signature and other information about the file type.

All this can be done using R-Studio. At the same time, you do not need to be an expert in the field of composing (editing) XML documents or in the field of hexadecimal editing - in this guide (article), which is aimed at the user himself entry level, all stages of this process will be discussed in detail.

Example: Adding a signature for an MP4 file (XDCam-EX Codec)
Let's look at adding a file signature using the example of an .MP4 file created using Sony XDCAM-EX. You can use it, for example, in case of damage to the SD card for files that you have not yet managed to save on your computer’s hard drive.

First Stage: Determining File Signature
To determine the file signature, consider examples of files of the same format.

Let these be four video files from Sony XDCAM-EX:
ZRV-3364_01.MP4
ZRV-3365_01.MP4
ZRV-3366_01.MP4
ZRV-3367_01.MP4

For ease of consideration, let these be small files. Larger files are more difficult to view in hexadecimal.

1. Open the files in R-Studio. To do this, right-click on each file and select View/Edit from the context menu.

2. Let's compare the files. We will look for the same pattern found in all four files. He will appear file signature. Typically, file signatures are found at the beginning of the file, but sometimes at the end.

3. Define the file signature at the beginning of the file. In our example, it is located at the very beginning of the file. Note that this does not always happen - often the file signature is at the beginning of the file, but not at the first line (offset).

From the images below, it appears that the contents of all four files are different, but they all start with the same file signature.


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it

The highlighted area in the images is a file signature of this type files. It is presented in both text and hexadecimal format.

In text form, the file signature looks like this:
....ftypmp42....mp42........free

Dots (“.”) indicate characters that cannot be represented in text form. Therefore, it is also necessary to provide the hexadecimal form of the file signature:
00 00 00 18 66 74 79 6D 70 34 32 00 00 00 00 6D 70 34 32 00 00 00 00 00 00 00 08 66 72 65 65

4. In the same way, we define the file signature, but at the very end of the file. It may be a different file signature, a different length.

The images below highlight the file signature at the end of the file:


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it

Please note that the data before the selected area (file signature) is the same in all four files. This is technical information that is not a file signature, but indicates that all four pictures (files) were taken using the same camera with the same parameters. It is usually possible to distinguish matching patterns with technical information from a file signature. In our example, in the last line before the start of the file signature, we see the text ‘RecordingMode type=”normal”’, which clearly indicates that this is some kind of file parameter, and not a signature. Always pay special attention to this line so as not to mistakenly include technical information part of the file signature.

In our case, the file signature is the following text:
...
Let us remind you that dots indicate characters that cannot be represented in text form.

In hexadecimal, the file signature looks like this:
3N 2F 4E 6F 6E 52 65 61 6N 54 69 6A 65 4A 65 74 61 3E 0D 0A 00
Please note: the signature will not always be at the end of the file.

Second Stage: Creating an XML file describing a Known File Type
Now, having defined the file signature, you can create an XML file and include the corresponding file type in R-Studio. This can be done in two ways:

2.1 Using built-in graphics editor file signatures:
Select the Settings item from the Tools menu, in the Settings dialog box that opens, click the Known Files Types tab and then click the Edit User’s File Types button.

Click on the image to enlarge it

Click the Create File Type button in the Edit User's File Types dialog box.
Set the following options:

  • Id - a unique digital identifier. This number will be chosen randomly; The only thing is that it should not match the digital identifier of any other file type.
  • Group Description - the group in which the found files will be located in R-Studio. You can set either new group, or choose one of those that already exist. For us this will be the group “Multimedia Video (Multimedia: Video)”.
  • Description - short description file type. In our example, you can use, for example, "Sony cam video, XDCam-EX".
  • Extension - extension of files of this type. In our case - mp4.

The Features parameter is optional, in our case we do not need to use it.

Click on the image to enlarge it

Next, you need to enter the start and end file signature. To do this, select Begin and then context menu the Add Signature command.

Click on the image to enlarge it

Then double click on the field<пустая сигнатура> () and enter the appropriate text.

Click on the image to enlarge it

Then create the final file signature. Be sure to enter 21 in the From column.

Click on the image to enlarge it

You have successfully created your own signature for known file types.

Now you need to save it. There are two ways: you can either save it to the default file specified on the Main tab of the Settings dialog box by clicking the Save button. Or click on the Save As... button and save the signature to some other file.

2.2 Manually creating an XML file describing a Known File Type:
For creating this file Let's use XML version 1.0 and UTF-8 encoding. Don't despair if you don't know what it is - just open any text editor(for example, Notepad.exe) and enter the following text in the first line:

Next we will create an XML tag that defines the file type (FileType). Taking into account the previously described XML attributes, the tag will look like this:

Let's insert it right after

Next, we define the file signature (tag ). The initial signature (at the beginning of the file) will be inside the tag without any attributes. We use the text type of the signature, but at the same time replace with hexadecimal characters that cannot be represented in text form. Before each hexadecimal character we insert "\x" Thus the tag with a file signature will look like this:

If present, you must also define the ending signature (at the end of the file). This uses the same tag, but with a "from" element and an "end" attribute. It will look like this:

Remember that the final file signature did not contain non-text characters, but did have slashes and triangle brackets. To avoid confusion and errors in XML syntax, we will replace the characters "/", " in the signature<" и ">" their hexadecimal values.

At the end, after the file signatures, there must be closing tags FileType and FileTypeList:

So the entire file should look like this:


\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp42\x00\x00\x00\x00\x00\x00\x00\x08free
\x3C\x2FNonRealTimeMeta\x3E\x0D\x0A\x00

Remember: XML syntax is case sensitive, so the correct tag would be , but not .

Let's save the file in text format with the extension .xml. For example: SonyCam.xml.

We have successfully created our own signature for known file types. This example is quite sufficient to understand the basic principles of creating a custom file type. More experienced users can use XML version 2.0. You can read more about this in the R-Studio online Help section.

Step 3: Checking and Adding a File Describing a Known File Type
The next step is to add (upload) your XML file to R-Studio. In this case, it will be automatically checked.

Let's load the XML file created at the previous stage into R-Studio. To do this, select the Settings item in the Tools menu. In the User’s file types area of ​​the Main tab of the Settings dialog box, add the XML file we created (SonyCam.xml). Click the Apply button.

Click on the image to enlarge it

2. Answer Yes to the request to download a new file type.

Click on the image to enlarge it

3. To verify that the file type was successfully loaded, click on the Known File Types tab of the Settings dialog box. Remember that we added the file type to the Multimedia Video group (Multimedia: Video). Having expanded this group (folder), we should see an element with the description we specified when creating the XML file: Sony cam video, XDCam-EX (.mp4).

Click on the image to enlarge it


Click on the image to enlarge it

If there are any errors in the file syntax, you will see a corresponding message:

Click on the image to enlarge it

In this case, check your XML file again for errors. Remember: XML syntax is case sensitive and every tag must have a closing tag at the end.

Step 4: Testing the File Describing a Known File Type
To check the correctness of the custom file type we created, let's try to find our .mp4 files on a removable USB flash drive.

1. Under Windows Vista or Windows 7, perform a full (not quick) formatting of the disk or use a disk space cleaning utility (for example, R-Wipe & Clean) to complete removal all data available on the disk. Let USB disk formatted in FAT32 (the size of the searched files does not exceed 2 GB).

2. Let's copy it test files to disk and reboot the computer so that the contents of the cache memory are saved on disk. You can also disable external drive and then connect it again.

3. In the OS, the drive will be defined as, for example, logical drive F:\.

4. Let's launch R-Studio. Select our drive (F:\) and click the Scan button

Click on the image to enlarge it

5. In the Scan dialog box, in the (File System) area, click on the Change... button and uncheck all the boxes. This way we will disable searching for file systems and files using the partition table.
Click on the image to enlarge it

6. Check the Extra Search for Known File Types checkbox. This will allow R-Studio to search for known file types when scanning.

7. To start scanning, click the Scan button.

8. Let's wait while R-Studio scans the disk. The Scan Information tab will display the scanning progress (progress).


Click on the image to enlarge it

9. After scanning is complete, select the Extra Found Files element and double-click on it.


Click on the image to enlarge it

10. Our test files will be located in the Sony cam video, XDCam-EX folder (or in a folder with another name corresponding to the file type description specified in the Second Stage).


Click on the image to enlarge it

You see that the file names, dates and locations (folders) were not restored because this information stored in the file system. Therefore, R-Studio will automatically display each file with a new name.

However, it is clear that the contents of the files are not damaged. To verify this, let’s open them in the appropriate program, for example VLC media player.


Click on the image to enlarge it

Conclusion
R-Studio's ability to scan for known file types allows you to recover data even from a disk whose file systems have either been overwritten. You can search for files quite effectively using their signatures, which is especially useful if you know exactly the type of files being restored, as in our example. The ability to create custom file types allows you to add any file that has a specific file signature to the list of known file types.

Many may have heard of files such as rarjpeg. This is a special type of file, which is a jpeg image and a rar archive glued together closely. It is an excellent container for hiding the fact of transmitting information. You can create a rarjpeg using the following commands:

UNIX: cat image1.jpg archive.rar > image2.jpg
WINDOWS: copy /b image1.jpg+archive.rar image2.jpg

Or if you have a hex editor.

Of course, to hide the fact of transmitting information, you can use not only the JPEG format, but also many others. Each format has its own characteristics, due to which it may or may not be suitable for the role of container. I will describe how you can find pasted files in the most popular formats or indicate the fact of gluing.

Methods for detecting merged files can be divided into three groups:

  1. Method for checking the area after the EOF marker. Many popular file formats have a so-called end-of-file marker, which is responsible for displaying the desired data. For example, photo viewers read all bytes up to this marker, but the area after it is ignored. This method is ideal for the following formats: JPEG, PNG, GIF, ZIP, RAR, PDF.
  2. Method for checking file size. The structure of some formats (audio and video containers) allows you to calculate the actual file size and compare it with the original size. Formats: AVI, WAV, MP4, MOV.
  3. Method for checking CFB files. CFB or Compound File Binary Format is a document format developed by Microsoft, which is a container with its own file system. This method is based on detecting anomalies in a file.

Is there life after the end of a file?

JPEG

To find the answer to this question, it is necessary to delve into the specifications of the format, which is the “ancestor” of merged files and understand its structure. Any JPEG starts with the signature 0xFF 0xD8.

After this signature there is service information, optionally an image icon and, finally, the compressed image itself. In this format, the end of the image is marked with a two-byte signature 0xFF 0xD9.

PNG

The first eight bytes of the PNG file are occupied by the following signature: 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A. End signature that ends the data stream: 0x49, 0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82.

RAR

Common signature for all rar archives: 0x52 0x61 0x72 0x21 (Rar!). After it comes information about the archive version and other related data. It was experimentally determined that the archive ends with the signature 0x0A, 0x25, 0x25, 0x45, 0x4F, 0x46.

Table of formats and their signatures:
The algorithm for checking for gluing in these formats is extremely simple:

  1. Find the initial signature;
  2. Find the final signature;
  3. If there is no data after the final signature, your file is clean and does not contain attachments! Otherwise, it is necessary to look for other formats after the final signature.

GIF and PDF

A PDF document may have more than one EOF marker, for example due to incorrect document generation. The number of final signatures in a GIF file is equal to the number of frames in it. Based on the features of these formats, it is possible to improve the algorithm for checking the presence of attached files.
  1. Point 1 is repeated from the previous algorithm.
  2. Point 2 is repeated from the previous algorithm.
  3. When you find the final signature, remember its location and look further;
  4. If you reach the last EOF marker in this way, the file is clean.
  5. If the file does not end with an end signature, goto is the location of the last end signature found.
A large difference between the file size and the position after the last end signature indicates the presence of a sticky attachment. The difference can be more than ten bytes, although other values ​​can be set.

ZIP

The peculiarity of ZIP archives is the presence of three different signatures: The structure of the archive is as follows:
Local File Header 1
File Data 1
Data Descriptor 1
Local File Header 2
File Data 2
Data Descriptor 2
...
Local File Header
File Data n
Data Descriptor n
Archive decryption header
Archive extra data record
Central directory
Most interesting is the central directory, which contains metadata about the files in the archive. The central directory always starts with signature 0x50 0x4b 0x01 0x02 and ends with signature 0x50 0x4b 0x05 0x06, followed by 18 bytes of metadata. Interestingly, empty archives consist only of the final signature and 18 zero bytes. After 18 bytes comes the archive comment area, which is an ideal container for hiding the file.

To check a ZIP archive, you need to find the end signature of the central directory, skip 18 bytes and look for signatures of known formats in the comment area. Big size The comment also indicates the fact of gluing.

Size matters

AVI

The structure of an AVI file is as follows: each file begins with a RIFF signature (0x52 0x49 0x46 0x46). On byte 8 there is an AVI signature that specifies the format (0x41 0x56 0x49 0x20). The block at offset 4, consisting of 4 bytes, contains the initial size of the data block (byte order - little endian). To find out the block number containing the next size, you need to add the header size (8 bytes) and the size obtained in the 4-8 byte block. This calculates the total file size. It is acceptable that the calculated size may be smaller than the actual file size. Once the size is calculated, the file will only contain zero bytes (necessary to align the 1Kb boundary).

Example of size calculation:


WAV

Like AVI, a WAV file begins with a RIFF signature, however, this file has a signature from byte 8 - WAVE (0x57 0x41 0x56 0x45). File size is calculated in the same way as AVI. The actual size must completely match the calculated one.

MP4

MP4 or MPEG-4 is a media container format used to store video and audio streams, also providing for the storage of subtitles and images.
At offset 4 bytes there are signatures: file type ftyp (66 74 79 70) (QuickTime Container File Type) and file subtype mmp4 (6D 6D 70 34). For recognition hidden files, we are interested in the ability to calculate the file size.

Let's look at an example. The size of the first block is at offset zero, and it is 28 (00 00 00 1C, Big Endian byte order); it also indicates the offset where the size of the second data block is located. At offset 28 we find the next block size equal to 8 (00 00 00 08). To find the next block size, you need to add the sizes of the previous blocks found. Thus, the file size is calculated:

MOV

This widely used format is also an MPEG-4 container. MOV uses a proprietary data compression algorithm, has a structure similar to MP4 and is used for the same purposes - for storing audio and video data, as well as related materials.
Like MP4, any mov file has a 4-byte ftyp signature at offset 4, however, the next signature has the value qt__ (71 74 20 20). The rule for calculating the file size has not changed: starting from the beginning of the file, we calculate the size of the next block and add it up.

The method of checking this group of formats for the presence of “sticky” files is to calculate the size according to the rules given above and compare it with the size of the file being checked. If the current file size is much smaller than the calculated one, then this indicates the fact of gluing. When checking AVI files, it is accepted that the calculated size may be smaller than the file size due to the presence of added zeros to align the border. In this case, it is necessary to check for zeros after the calculated file size.

Checking Compound File Binary Format

This file format, developed by Microsoft, is also known as OLE (Object Linking and Embedding) or COM (Component Object Model). DOC, XLS, PPT files belong to the group of CFB formats.

A CFB file consists of a 512-byte header and sectors of equal length that store data streams or service information. Each sector has its own non-negative number, with the exception of special numbers: “-1” - numbers the free sector, “-2” - numbers the sector closing the chain. All sector chains are defined in the FAT table.

Let's assume that an attacker modified a certain .doc file and pasted another file at the end of it. There are a few in various ways detect it or indicate an anomaly in the document.

Abnormal file size

As mentioned above, any CFB file consists of a header and sectors of equal length. To find out the sector size, you need to read a two-byte number at offset 30 from the beginning of the file and raise 2 to the power of this number. This number must be equal to either 9 (0x0009) or 12 (0x000C), respectively, the file sector size is 512 or 4096 bytes. After finding the sector, you need to check the following equality:

(FileSize - 512) mod SectorSize = 0

If this equality is not satisfied, then you can point out the fact of gluing the files. However, this method has a significant drawback. If an attacker knows the sector size, then he just needs to paste his file and another n bytes so that the size of the pasted data is a multiple of the sector size.

Unknown sector type

If the attacker knows about a method to bypass the previous check, then this method can detect the presence of sectors with undefined types.

Let's define equality:

FileSize = 512 + CountReal * SectorSize, where FileSize is the file size, SectorSize is the sector size, CountReal is the number of sectors.

We also define the following variables:

  1. CountFat – number of FAT sectors. Located at offset 44 from the beginning of the file (4 bytes);
  2. CountMiniFAT – number of MiniFAT sectors. Located at offset 64 from the beginning of the file (4 bytes);
  3. CountDIFAT – number of DIFAT sectors. Located at offset 72 from the beginning of the file (4 bytes);
  4. CountDE – number of Directory Entry sectors. To find this variable, you need to find the first sector DE, which is at offset 48. Then it is necessary to obtain a complete representation of DE from FAT and count the number of DE sectors;
  5. CountStreams – number of sectors with data streams;
  6. CountFree – number of free sectors;
  7. CountClassified – number of sectors with a certain type;
CountClassified = CountFAT + CountMiniFAT + CountDIFAT + CountDE + CountStreams + CountFree

Obviously, if CountClassified and CountReal are unequal, we can conclude that files may be merged.

My boss gave me a rather interesting task. In a short time, write an executable file analyzer that would be able to find virus bodies based on signatures and determine the packer/cryptor used. The finished prototype appeared within a couple of hours.

Author's word

Signature analysis

Searching for a malicious object using signatures is something that any antivirus can do. In general, a signature is a formalized description of certain characteristics by which it can be determined that the file being scanned is a virus and a well-defined virus.

There are various techniques here. An alternative is to use a signature made up of N bytes of a malicious object. In this case, you can do not a stupid comparison, but a comparison using a certain mask (like looking for bytes EB ?? ?? CD 13). Or set additional conditions like “such and such bytes must be at the entry point to the program” and so on. The signature of the malware is a particular matter.

In the same way, some signs are described by which one can determine that the executable file is packed with one or another cryptor or packer (for example, the banal ASPack). If you carefully read our magazine, then you have definitely heard about such a tool as PEiD, which is capable of identifying the most commonly used packers, cryptors and compilers (the database has a large number of signatures) for the PE file transferred to it. Alas, new versions of the program have not been released for a long time, and recently a message appeared on the official website that the project will not have further development. It's a pity, because the capabilities of PEiD (especially considering the plugin system) could very well be useful to me. After a short analysis, it became clear that this was not an option. But after digging through English-language blogs, I quickly found what suited me. YARA Project (code.google.com/p/yara-project).

What is YARA?

From the very beginning, I was convinced that somewhere on the Internet there was already an open source development that would take on the task of determining the correspondence between a certain signature and the file being examined. If I could find such a project, then it could easily be put on the rails of a web application, add different signatures there and get what was required of me. The plan began to seem even more realistic when I read the description of the YARA project.

The developers themselves position it as a tool to help malware researchers identify and classify malicious samples. The researcher can create descriptions for different types malware using text or binary patterns that describe the formalized characteristics of malware. This is how signatures are obtained. In fact, each description consists of a set of lines and some logical expression, on the basis of which the analyzer’s triggering logic is determined.

If the conditions of one of the rules are met for the file being examined, it is determined accordingly (for example, such and such a worm). A simple example of a rule to understand what we are talking about:

rule silent_banker: banker
{
meta:
description = "This is just an example"
thread_level = 3
in_the_wild = true
strings:
$a = (6A 40 68 00 30 00 00 6A 14 8D 91)
$b = (8D 4D B0 2B C1 83 C0 27 99 6A 4E 59 F7 F9)
$c = "UVODFRYSIHLNWPEJXQZAKCBGMT"
condition:
$a or $b or $c
}

In this rule we tell YARA that any file that contains at least one of the sample strings described in the variables $a, $b, $c should be classified as a silent_banker trojan. And this is a very simple rule. In reality, rules can be much more complex (we'll talk about this below).
Even the list of projects that use it speaks about the authority of the YARA project, and this is:

  • VirusTotal Malware Intelligence Services (vt-mis.com);
  • jsunpack-n (jsunpack.jeek.org);
  • We Watch Your Website (wewatchyourwebsite.com).

All code is written in Python, and the user is offered both the module itself for use in their development, and simply an executable file to use YARA as a stand-alone application. As part of my work, I chose the first option, but for simplicity in this article we will simply use the analyzer as a console application.

After some digging, I quickly figured out how to write rules for YARA, as well as how to attach virus signatures from the freeware and packers from PEiD to it. But we'll start with the installation.

Installation

As I already said, the project is written in Python, so it can easily be installed on Linux, Windows, and Mac. At first, you can just take the binary. If we call the application in the console, we will get the rules for launching.

$yara
usage: yara ... ... FILE | PID

That is, the format for calling the program is as follows: first there is the name of the program, then a list of options, after which the file with the rules is indicated, and at the very end - the name of the file being examined (or the directory containing the files), or the process identifier. Now I would like to explain in a good way how these very rules are drawn up, but I don’t want to immediately burden you with dry theory. Therefore, we will do things differently and borrow other people’s signatures so that YARA can perform one of the tasks we have set - a full-fledged detection of viruses by signatures.

Your own antivirus

The most important question: where to get the database of signatures of known viruses? Antivirus companies actively share such databases among themselves (some more generously, others less). To be honest, at first I even doubted that somewhere on the Internet someone would openly post such things. But, as it turned out, there are good people. A suitable database from the popular ClamAV antivirus is available to everyone (clamav.net/lang/en). In the "Latest Stable Release" section you can find a link to latest version antivirus product, as well as links to download ClamAV virus databases. We will be primarily interested in the main.cvd (db.local.clamav.net/main.cvd) and daily.cvd (db.local.clamav.net/daily.cvd) files.

The first contains the main database of signatures, the second contains the most complete database this moment base with various additions. Daily.cvd, which contains more than 100,000 malware impressions, is quite sufficient for this purpose. However, the ClamAV database is not a YARA database, so we need to convert it to the desired format. But how? After all, we don’t yet know anything about either the ClamAV format or the Yara format. This problem has already been taken care of before us by preparing a small script that converts the ClamAV virus signature database into a set of YARA rules. The script is called clamav_to_yara.py and written by Matthew Richard (bit.ly/ij5HVs). Download the script and convert the databases:

$ python clamav_to_yara.py -f daily.cvd -o clamav.yara

As a result, in the clamav.yara file we will receive a signature database that will be immediately ready for use. Let's now try the combination of YARA and the ClamAV database in action. Scanning a folder using a signature is performed with one single command:

$ yara -r clamav.yara /pentest/msf3/data

The -r option specifies that the scan should be performed recursively across all subfolders of the current folder. If there were any virus bodies in the /pentest/msf3/data folder (at least those that are in the ClamAV database), then YARA will immediately report this. In principle, this is a ready-made signature scanner. For greater convenience, I wrote a simple script that checked ClamAV database updates, downloaded new signatures and converted them to the YARA format. But these are already details. One part of the task is completed, now you can start drawing up rules for identifying packers/cryptors. But to do this I had to deal with them a little.

Play by the rules

So, a rule is the main mechanism of a program that allows you to assign a given file to a certain category. The rules are described in a separate file (or files) and in appearance are very similar to the struct() construction from the C/C++ language.

rule BadBoy
{
strings:
$a = "win.exe"
$b = "http://foo.com/badfi le1.exe"
$c = "http://bar.com/badfi le2.exe"
condition:
$a and ($b or $c)
}

In principle, there is nothing complicated in writing rules. In this article, I touched only on the main points, and you will find the details in the manual. For now, the ten most important points:

1. Each rule begins with the keyword rule, followed by the rule identifier. Identifiers can have the same names as variables in C/C++, that is, they can consist of letters and numbers, and the first character cannot be a number. Maximum length identifier name - 128 characters.

2. Typically, rules consist of two sections: a definition section (strings) and a condition section (condition). The strings section specifies data on the basis of which the condition section will decide whether a given file satisfies certain conditions.

3. Each line in the strings section has its own identifier, which begins with the $ sign - in general, like a variable declaration in PHP. YARA supports regular strings enclosed in double quotes("") and hexadecimal strings enclosed in braces(()), as well as regular expressions:

$my_text_string = "text here"
$my_hex_string = ( E2 34 A1 C8 23 FB )

4. The condition section contains all the logic of the rule. This section must contain a Boolean expression that determines when a file or process matches the rule. Typically, this section refers to previously declared lines. And the string identifier is treated as a boolean variable that returns true if the string was found in the file or process memory, and false otherwise. The above rule specifies that files and processes containing the string win.exe and one of the two URLs should be categorized as BadBoy (by the rule name).

5. Hexadecimal strings allow three constructs that make them more flexible: wildcards, jumps, and alternatives. Substitutions are places in a string that are unknown and can be replaced by any value. They are indicated by the symbol “?”:

$hex_string = ( E2 34 ?? C8 A? FB )

This approach is very convenient when specifying strings whose length is known, but the content may vary. If part of a string can be of different lengths, it is convenient to use ranges:

$hex_string = ( F4 23 62 B4 )

This entry means that in the middle of the line there can be from 4 to 6 different bytes. You can also implement an alternative choice:

$hex_string = ( F4 23 (62 B4 | 56) 45 )

This means that in place of the third byte there can be 62 B4 or 56, such an entry corresponds to the lines F42362B445 or F4235645.

6. To check that a given string is at a specific offset in a file or process address space, the at operator is used:

$a at 100 and $b at 200

If the string can be within a certain address range, the in operator is used:

$a in (0..100) and $b in (100..fi lesize)

Sometimes situations arise when you need to specify that a file should contain a certain number from a given set. This is done using the of operator:

rule OfExample1
{
strings:
$foo1 = "dummy1"
$foo2 = "dummy2"
$foo3 = "dummy3"
condition:
2 of ($foo1,$foo2,$foo3)
}

The above rule requires that the file contain any two lines from the set ($foo1,$foo2,$foo3). Instead of specifying a specific number of lines in the file, you can use the variables any (at least one line from a given set) and all (all lines from a given set).

7. Well, the last interesting possibility that needs to be considered is applying one condition to many rows. This feature is very similar to the of operator, only more powerful is the for..of operator:

for expression of string_set: (boolean_expression)

This entry should be read like this: of the strings specified in the string_ set, at least the expression pieces must satisfy the boolean_expression condition. Or, in other words: the boolean_expression is evaluated for each string in the string_set, and the expressions from them must return True. Next we will look at this construction using a specific example.

Making PEiD

So, when everything has become more or less clear with the rules, we can begin to implement a detector of packers and cryptors in our project. At first, as source material, I borrowed the signatures of well-known packers from the same PEiD. In the plugins folder there is a file userdb.txt, which contains what we need. There were 1850 signatures in my database.

Quite a lot, so in order to fully import them, I advise you to write some kind of script. The format of this database is simple - the usual one is used text file, which stores records like:


signature = 50 E8 ?? ?? ?? ?? 58 25 ?? F0 FF FF 8B C8 83 C1 60 51 83 C0 40 83 EA 06 52 FF 20 9D C3
ep_only = true

The first line specifies the name of the packer, which will be displayed in PEiD, but for us it will be the rule identifier. The second is the signature itself. The third is the ep_only flag, which indicates whether to search for a given line only at the entry point address, or throughout the entire file.

Well, let's try to create a rule, say, for ASPack? As it turns out, there is nothing complicated about this. First, let's create a file to store the rules and call it, for example, packers.yara. Then we search the PEiD database for all signatures that include ASPack in their names and transfer them to the rule:

rule ASPack
{
strings:
$ = ( 60 E8 ?? ?? ?? ?? 5D 81 ED ?? ?? (43 | 44) ?? B8 ?? ?? (43 | 44) ?? 03 C5 )
$ = ( 60 EB ?? 5D EB ?? FF ?? ?? ?? ?? ?? E9 )
[.. cut..]
$ = ( 60 E8 03 00 00 00 E9 EB 04 5D 45 55 C3 E8 01 )
condition:
for any of them: ($at entrypoint)
}

All found records have the ep_only flag set to true, that is, these lines must be located at the entry point address. Therefore, we write the following condition: “for any of them: ($at entrypoint)”.

Thus, the presence of at least one of the given lines at the entry point address will mean that the file is packed with ASPack. Please also note that in this rule all lines are specified simply using the $ sign, without an identifier. This is possible because in the condition section we do not access any specific ones, but use the entire set.

To check the functionality of the resulting system, just run the command in the console:

$ yara -r packers.yara somefi le.exe

Having fed a couple of applications packaged with ASPack there, I was convinced that everything worked!

Ready prototype

YARA turned out to be an extremely clear and transparent tool. It wasn’t difficult for me to write a webadmin for it and set it up to work as a web service. With a little creativity, the dry results of the analyzer are already colored in different colors, indicating the degree of danger of the detected malware. A small update of the database, and for many of the cryptors a short description is available, and sometimes even unpacking instructions. The prototype has been created and works perfectly, and the bosses are dancing with delight!

The function code (FC) in the telegram header identifies the telegram type, such as Request telegram (Request or Send/Request) and Acknowledgment or Response telegram (Acknowledgement frame, Response frame). In addition the function code contains the actual transmission function and control information that prevent loss and duplication of messages, or the station type with FDL status .

7 6 5 4 3 2 1 0 FC: Function Code Request
1 Request Telegram
X FCV = Alternating bit switched on
X href=”http://profibus.felser.ch/en/funktionscode.htm#aufruffolgebit”>FCB = Alternating bit (from frame count)
1 0 (0x0) CV = Clock Value()
1 other Reserved
0 0 (0x0) TE = Time Event (Clock synchronization)
0 3 (0x3) SDA_LOW = Send Data Acknowledged - low priority
0 4 (0x4) SDN_LOW = Send Data Not acknowledged - low priority
0 5 (0x5) SDA_HIGH = Send Data Acknowledged - high priority
0 6 (0x6) SDN_HIGH = Send Data Not acknowledged
0 7 (0x7) MSRD = Send Request Data with Multicast Reply
0 9 (0x9) Request FDL Status
0 12(0xC) SRD low = Send and Request Data
0 13(0xD) SRD high = Send and Request Data
0 14(0xE) Request Ident with reply
0 15 (0xF) Request LSAP Status with reply 1)
0 other Reserved

1) this value is in the last version of the standard not defined anymore but only reserved

7 6 5 4 3 2 1 0 FC: Function Code Response
0 Response telegram
0 Reserved
0 0 Slave
0 1 Master not ready
1 0 Master ready, without token
1 1 Master ready, in token ring
0 (0x0) OK
1 (0x1) UE = User Error
2 (0x2) RR = No resources
3 (0x3) RS = SAP not enabled
8 (0x8) DL = Data Low (normal case with DP)
9 (0x9) NR = No response data ready
10(0xA) DH = Data High (DP diagnosis pending)
12(0xC) RDL = Data not received and Data Low
13(0xD) RDH = Data not received and Data High
other Reserved

Frame Count Bit The frame count bit FCB (b5) prevents message duplication by the acknowledging or responding station(responder) and any loss by the calling station (initiator). Excluded from this are requests without acknowledgement (SDN) and FDL Status, Ident and LSAP Status requests.

For the security sequence, the initiator must carry an FCB for each responder. When a Request telegram (Request or Send/Request) is sent to a responder for the first time, or if it is re-sent to a responder currently marked as non-operational, the FCB must be set as defined in the responder. The initiator achieves this in a Request telegram with FCV=0 and FCB=1. The responder must assess a telegram of this kind as the first message cycle and store the FCB=1 together with the initiator’s address (SA) (see following table). This message cycle will not be repeated by the initiator. In subsequent Request telegrams to the same responder, the initiator must set FCV=1 and change the FCB with each new Request telegram. Any responder that receives a Request telegram addressed to it with FCV=1 must evaluate the FCB. If the FCB has changed when compared with the last Request telegram from the same initiator (same SA), this is valid confirmation that the preceding message cycle was concluded properly. If the Request telegram originates from a different initiator (different SA), evaluation of the FCB is no longer necessary. In both cases, the responder must save the FCB with the source SA until receipt of a new telegram addressed to it. In the case of a lost or impaired acknowledgement or response telegram, the FCB must not be changed by the initiator in the request retry: this will indicate that the previous message cycle was faulty. If the responder receives a Request telegram with FCV=1 and the same FCB as the last Request telegram from the same initiator (same SA), this will indicate a request retry. The responder must in turn retransmit the acknowledgement or response telegram held in readiness. Until the above-mentioned confirmation or receipt of a telegram with a different address (SA or DA) that is not acknowledged (Send Data with No Acknowledge, SDN) the responder must hold the last acknowledgement or response telegram in readiness for any possible request retry . In the case of Request telegrams that are not acknowledged and with Request FDL Status, Ident, and LSAP Status, FCV=0 and FCB=0; evaluation by the responder is no longer necessary.

b5 b4 Bit position
FCB FCV Condition Meaning Action
0 0 DA = TS/127 Request without acknowledgement
Request FDL Status/ Ident/ LSAP Status
Delete last acknowledgement
0/1 0/1 DA#TS Request to another responder
1 0 DA = TS First request FCBM:= 1
SAM:=SA
Delete last acknowledgement / response
0/1 1 DA = TS
SA = SAM
FCB#FCBM
New Request Delete last acknowledgement / response
FCBM:=FCB
Hold acknowledgement / response in readiness for retry
0/1 1 DA = TS
SA = SAM
FCB = FCBM
Retry Request FCBM:=FCB
Repeat acknowledgement / response and continue to hold in readiness
0/1 1 DA = TS
SA#SAM
New initiator FCBM:=FCB
SAM:= SA Hold acknowledgement / response in readiness for retry

FCBM stored FCB in memory SAM stored SA in memory




Top