Document Grinding Definition

Document Grinding Definition

Document grinding is the process of analyzing documents to extract meaningful data. The term is often associated with computer hacking, since hackers may “grind” documents to reveal confidential data. However, document grinding is also used for nonmalicious purposes. Examples include identifying unknown file types and viewing file metadata.

It is possible to perform document grinding on both plain text and binary files.

Text Files

Grinding text files is a simple process since they store data as plain text. You can search for characters and strings within a text document using a tool like grep or another search utility. Since text processing is a relatively fast computer operation, it may be possible to grind several large documents in less than a second.

Common text file types targeted for document grinding include log files (.LOG.TXT) and configuration files (.CONF.CNF). If a hacker gains access to a web server, for example, he may search these files for usernamespasswords, and other confidential data.

Binary Files

Binary files may contain some plain text, but they also store binary data — 1s and 0s. It is more difficult to grind binary data since it cannot be searched with a text search tool. Additionally, many binary files are saved in a proprietary file format, which is difficult to parse without the corresponding application. Therefore binary document grinding typically focuses on the header and footer of a document, which may contain plain text. It may also aim to extract file metadata.

Many binary files contain information about the file type in the header of the file. For example, in the sample image, the letters “PNG” in the header indicate the file is a PNG image. This information is useful for identifying the file type since it does not have a file extension. Similarly, digital photos often contain hidden EXIF data saved when the photo was taken. An image-viewing program or a document grinding script may be able to detect and extract this information.