Problem put to rest

Published : Jul 08, 2015 12:30 IST

Comparing the genomes of different species,or different members of the same species,is the basis of a great deal of modern biology. DNA sequences that are conserved across species are likely to be functionally important, while variations between members of the same species can indicate different susceptibilities to disease.

The basic algorithm for determining how much two sequences of symbols in the genomic sequences have in common—called the “edit distance” between them—is now more than 40 years old. Equivalently, it is the minimum number of edits—deletions, insertions, and substitutions—required to turn one string into another. And for more than 40 years, computer science researchers have been trying to improve upon it, without much success.

At the ACM Symposium on Theory of Computing (STOC), researchers from the Massachusetts Institute of Technology (MIT) reported that, in all likelihood, the algorithm is as good as it gets. The proof that this 40 year-old algorithm is the best possible will come as a relief to computer scientists.

This implies that if the widely held assumption about computational complexity is correct, then the problem of measuring the difference between two genomes—or texts, or speech samples, or anything else that can be represented as a string of symbols—cannot be solved more efficiently.

In a sense, that is disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes. But it also means that computer scientists can stop agonising about whether they can do better.

“This edit distance is something that I’ve been trying to get better algorithms for since I was a graduate student, in the mid-’90s,” said Piotr Indyk, a professor of computer science and engineering at the MIT and co-author of the STOC paper. “I certainly spent lots of late nights on that—without any progress whatsoever. So at least now there’s a feeling of closure.”

The standard algorithm for determining edit distance is known as the Wagner-Fischer algorithm. Computer scientists measure algorithmic efficiency as computation time relative to the number of elements the algorithm manipulates. For the Wagner-Fischer algorithm the running time is proportional to the product of the lengths of the two strings it is considering. Double the lengths of the strings, and the running time quadruples. In computer parlance, the algorithm runs in quadratic time.

Compiled by R. Ramachandran