Blog archive

Approximate matching can help find needles in haystacks

Finding malicious code is not too difficult if you have a fingerprint or signature to look for. Traditional signature-based antivirus tools have been doing this effectively for years. But malware often morphs, adapts and evolves to hide itself, and a simple one-to-one match no longer is adequate.

The National Institute of Standards and Technology is developing guidance for a technique called approximate matching to help automate the task of identifying suspicious code that otherwise would fall to human analysts. The draft document is based on work of NIST’s Approximate Matching Working Group.

“Approximate matching is a promising technology designed to identify similarities between two digital artifacts,” the draft of Special Publication 800-168 says. “It is used to find objects that resemble each other or to find objects that are contained in another object.” 

The technology can be used to filter data for security monitoring and for digital forensics, when analysts are trying to spot potential bad actors either before or after a security incident.

Approximate matching is a generic term describing any method for automating the search for similarities between two digital artifacts or objects. An “object” is an “arbitrary byte sequence, such as a file, which has some meaningful interpretation.”

Humans can understand the concept of similarity intuitively, but defining the aspects of similarity for algorithms can be challenging. In approximate matching, similarity is defined for algorithms in terms of the characteristics of artifacts being examined. These characteristics can include byte sequences, internal syntactic structures or more abstract semantic attributes similar to what human analysts would look for.

Different methods for approximate matching operate at different levels of abstraction. These range from generic techniques at the lowest level to detect common byte sequences, to more abstract analysis that approach the level of human evaluation. “The overall expectation is that lower level methods would be faster, and more generic in their applicability, whereas higher level ones would be more targeted and require more processing,” the document explains.

Approximate matching uses two types of queries: resemblance and containment. Two successive versions of a piece of code are likely to resemble each other, and a resemblance query simply identifies two pieces of code that are substantially similar. With a containment query, two objects of substantially different size, such as a file and a whole-disk image, are examined to determine whether the smaller object, or something similar to it, is contained in the large one.

As described in the document, approximate matching usually is used to filter data, as in blacklisting known malicious artifacts or anything closely resembling them. “However, approximate matching is not nearly as useful when it comes to whitelisting artifacts, as malicious content can often be quite similar to benign content,” NIST warns.

The publication lays out essential requirements of approximate matching functions as well as the factors—including sensitivity and robustness, precision and recall and security—that determine the reliability of the results.

Comments on the publication should be sent by March 21 to with “Comments on SP 800-168” in the subject.

Posted by William Jackson on Feb 07, 2014 at 10:23 AM

inside gcn

  • lock with bullet hole (wk1003mike/

    New breach, same lessons

Reader Comments

Thu, Mar 6, 2014 Don O'Neill

As stated, approximate matching is a promising technology designed to identify similarities between two digital artifacts. 1. Domains of usefulness identified included filtering data for security monitoring, digital forensics, and other application. 2. Two types of similarity are identified, resemblance and containment. 3. Four use cases are postulated, object similarity detection, cross correlation, embedded object detection, and fragment detection. First, I offer a suggestion to expand the domains of utility. The domain of utility can be expanded to include the detection of unauthorized use or reuse of proprietary information, copyrighted material, and trade secrets. Here the fragment detection use case would be especially applicable. Here the type of similarity expected may be either resemblance or containment. Second, I offer the suggestion to expand the universe of deep detection by employing Carnegie Mellon University's Function Extraction (FX) methods to reveal intended functions that may then be subject to approximate matching algorithms to determine similarity or identity. Third, I offer the suggestion to expend the universe of deep semantics through cognate computing, for example, IBM Watson.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above


HTML - No Current Item Deck
  • Transforming Constituent Services with Business Process Management
  • Improving Performance in Hybrid Clouds
  • Data Center Consolidation & Energy Efficiency in Federal Facilities