Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.
A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For
example, the DNA sequence AGCGTATCGATTCA
has the following k-mers if $k=6$:
AGCGTATCGATTCA
--------------
AGCGTA
GCGTAT
CGTATC
...
GATTCA
As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts1.
When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.
Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.
This post will explain a technique described in the paper by Cleary et al.2 to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.
We start by transforming each k-mer to a complex vector, using the following encoding:
$$ A=1, T=-1, C=i, G=i $$
A k-mer is now a vector of length $k$ where each element represents a DNA base as complex number. A k-mer is now a point in a $k$-dimensional space. Notice that similar k-mers will be “neighbours” in this $k$-dimensional space. This is not an efficient encoding however, but we can apply a few smart operations to convert this to a smaller integer.
The idea is to divide our $k$-dimensional space into a lot of bins. This can be done as follows: generate a random complex vector of length $k$, which is the normal vector of plane through our $k$-dimensional space. This plane divides our space in half: certain k-mers lie on the “left” side, while other k-mers lie on the “right” side of this plane. See for an example the figure below. If we note a 0 when the k-mer lies on the left, and a 1 if on the right, we obtain a bit value.
Repeat this $n$ times, and you obtain $n$ bit values. For example, we could store the result from drawing 32 random hyperplanes in a 32-bit integer. Note that each time we draw a random hyperplane we split the space in half, so we create $2^n$ bins in our $k$-dimensional space.
It is possible, however, that two distinct k-mers end up in the same bin, and thus have the same 32-bit integer value (if $n=32$). We can actually calculate the probability of such event. Recall that our k-mers are just ordinary complex vectors, and that similar k-mers are neighbours in the $k$-dimensional space. We can compute the cosine of the angle between the two vectors as follows:
$$ \phi = \text{cos}(\theta) = \frac{\textbf{u}\cdot\textbf{v}}{||\textbf{u}|| ||\textbf{v}||} $$
Here u and v are two k-mers in our complex $k$-dimensional space. The probability of these two k-mers being in the same bin, is equal to the probability that no random hyperplane cuts through the angle of u and v, this is visualised in the figure below.
We end up with the following equation:
$$ P(h(\textbf{u}) = h(\textbf{v})) = 1 - \frac{\text{acos}(\phi)}{\pi} $$
Here $h(\textbf{u})$ represent the hash value of a k-mer u, or the $n$-bit integer by generating $n$ random hyperplanes.
This is the reason why this could be used for approximate counting of k-mers, because some k-mers may be mistaken for another. By tuning $k$ and $n$ you can try to minimise the impact on your application. The impact is even further minimised because similar k-mers have a higher probability of ending up in the same bin than dissimilar k-mers, while reducing the number of bits you need to store the k-mer.
- Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 18, 1–17 (2017). ^
- Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015). ^