Using locality sensitive hashing to compactly represent k-mers

Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.

A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence AGCGTATCGATTCA has the following k-mers if $k=6$:


As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts1.

When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.

Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.

This post will explain a technique described in the paper by Cleary et al.2 to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.

We start by transforming each k-mer to a complex vector, using the following encoding:

$$ A=1, T=-1, C=i, G=i $$

A k-mer is now a vector of length $k$ where each element represents a DNA base as complex number. A k-mer is now a point in a $k$-dimensional space. Notice that similar k-mers will be “neighbours” in this $k$-dimensional space. This is not an efficient encoding however, but we can apply a few smart operations to convert this to a smaller integer.

The idea is to divide our $k$-dimensional space into a lot of bins. This can be done as follows: generate a random complex vector of length $k$, which is the normal vector of plane through our $k$-dimensional space. This plane divides our space in half: certain k-mers lie on the “left” side, while other k-mers lie on the “right” side of this plane. See for an example the figure below. If we note a 0 when the k-mer lies on the left, and a 1 if on the right, we obtain a bit value.

A few (poorly drawn) examples of splitting a $k$-dimensional space in half
by a hyperplane.
A few (poorly drawn) examples of splitting a $k$-dimensional space in half by a hyperplane.

Repeat this $n$ times, and you obtain $n$ bit values. For example, we could store the result from drawing 32 random hyperplanes in a 32-bit integer. Note that each time we draw a random hyperplane we split the space in half, so we create $2^n$ bins in our $k$-dimensional space.

It is possible, however, that two distinct k-mers end up in the same bin, and thus have the same 32-bit integer value (if $n=32$). We can actually calculate the probability of such event. Recall that our k-mers are just ordinary complex vectors, and that similar k-mers are neighbours in the $k$-dimensional space. We can compute the cosine of the angle between the two vectors as follows:

$$ \phi = \text{cos}(\theta) = \frac{\textbf{u}\cdot\textbf{v}}{||\textbf{u}|| ||\textbf{v}||} $$

Here u and v are two k-mers in our complex $k$-dimensional space. The probability of these two k-mers being in the same bin, is equal to the probability that no random hyperplane cuts through the angle of u and v, this is visualised in the figure below.

The probability that a hyperplane cuts through the angle of two
similar k-mers is lower than with two very dissimilar k-mers.
The probability that a hyperplane cuts through the angle of two similar k-mers is lower than with two very dissimilar k-mers.

We end up with the following equation:

$$ P(h(\textbf{u}) = h(\textbf{v})) = 1 - \frac{\text{acos}(\phi)}{\pi} $$

Here $h(\textbf{u})$ represent the hash value of a k-mer u, or the $n$-bit integer by generating $n$ random hyperplanes.

This is the reason why this could be used for approximate counting of k-mers, because some k-mers may be mistaken for another. By tuning $k$ and $n$ you can try to minimise the impact on your application. The impact is even further minimised because similar k-mers have a higher probability of ending up in the same bin than dissimilar k-mers, while reducing the number of bits you need to store the k-mer.

  1. Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 18, 1–17 (2017). ^
  2. Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015). ^

We use Disqus for comments. By enabling comments you agree that third party services may place tracking cookies on your computer. Disqus allows you to opt out from data sharing on this page .