Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.
A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For
example, the DNA sequence
AGCGTATCGATTCA has the following k-mers if $k=6$:
AGCGTATCGATTCA -------------- AGCGTA GCGTAT CGTATC ... GATTCA
As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts1.
When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.
Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.
This post will explain a technique described in the paper by Cleary et al.2 to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.