Using File Compression to Measure Similarity

Buck Shlegeris is CEO of Redwood Research.

See Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitányi The Similarity Metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new “normalized information distance,” based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric 

See also Rudi L Cilibrasi and Paul M B Vitányi Fast Phylogeny of SARS-CoV-2 by Compression.

The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. ... The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly. 

Subscribe to Computerphile

Comments

Popular posts from this blog

Steven Johnson - So You Think You Know How to Take Derivatives?

Welsh Republic Podcast Talking With Kars Collective on Armenia Azerbaijan Conflict

Hitachi HD44780U LCD Display Fonts