Using locality sensitive hashing to compactly represent k-mers
K-mers are frequently used in bioinformatics to profile and compare genomes. This article describes a technique for compactly representing long k-mers with locality-sensitive hashing.
Comparing (DNA) sequences is one of the core tasks in bioinformatics, and the classic approach is to align these sequences. This is, however, a relatively slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is to compare sequences based on their k-mer profiles.
A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence AGCGTATCGATTCA has the following k-mers if $k=6$:
As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L−k+1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts.
When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Therefore, there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.
Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. However, this is a bit wasteful because we only have four characters in DNA. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.
This post will explain a technique described in the paper by Cleary et al. for to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, we must also deal with sequencing errors and expect some k-mers to be false. In a lot of cases, using approximate k-mer counts is appropriate.
We start by transforming each k-mer to a complex vector, using the following encoding: $$A=1,T=−1,C=i,G=i$$
A k-mer is now a vector of length $k$, where each element represents a DNA base as a complex number. A k-mer is now a point in a k-dimensional space. Notice that similar k-mers will be “neighbours” in this k-dimensional space. This is not an efficient encoding, but we can apply a few smart operations to convert this to a smaller integer.
The idea is to divide our k-dimensional space into a lot of bins. This can be done as follows: generate a random complex vector of length k, which can then be interpreted as the normal vector of a plane through our k-dimensional space. This plane divides our space in half: certain k-mers lie on the “left” side, while other k-mers lie on the “right” side of this plane. See the figure below for an example. If we note a zero when the k-mer lies on the left and a one if on the right, we obtain a bit value.
Figure 1. A few (poorly drawn) examples of splitting a k-dimensional space in half by a hyperplane.
Repeat this n times, and you obtain n bit values. For example, we could store the result from drawing 32 random hyperplanes in a 32-bit integer. Note that we split the space in half each time we draw a random hyperplane, creating $2^n$ bins in our $k$-dimensional space.
It is possible, however, that two distinct k-mers end up in the same bin and thus have the same 32-bit integer value (if $n$=32). We can actually calculate the probability of such an event. Recall that our k-mers are just ordinary complex vectors and that similar k-mers are neighbours in the k-dimensional space. We can compute the cosine of the angle between the two vectors as follows: $$ \phi = \cos(\theta) = \frac{\textbf{u}\cdot\textbf{v}}{||\textbf{u}||||\textbf{v}||} $$
Here u and v are two k-mers in our complex k-dimensional space. The probability of these two k-mers being in the same bin, is equal to the probability that no random hyperplane cuts through the angle of u and v. This is visualised in the figure below.
Figure 2. The probability that a hyperplane cuts through the angle of two similar k-mers is lower than with two very dissimilar k-mers.
We end up with the following equation: $$ P(h(\textbf{u}) = h(\textbf{v})) = 1 - \frac{\arccos(\phi)}{\pi} $$
Here, $h(u)$ represents the hash value of a k-meru, or the n-bit integer by generating $n$ random hyperplanes.
This is the reason why this could be used for approximate counting of k-mers, because some k-mers may be mistaken for another. By tuning $k$ and $n$ you can try to minimize the impact on your application. The effect is even further minimized because similar k-mers have a higher probability of ending up in the same bin than dissimilar k-mers, while reducing the number of bits you need to store the k-mer.
Partial order alignment is a common method to compute multiple sequence alignments through iterative sequence-to-graph alignment. In this paper, we introduce a new algorithm that accelerates alignment by exploiting graph topology and shared sequence between the query and the graph.
More than half of the women in the US get a urinary tract infection (UTI) in their lifetime, which frequently becomes recurrent. In this paper, we investigated the role of the gut microbiome in facilitating recurrence.
StrainGE is a toolkit for tracking and characterizing low-abundance strains in complex microbial communities. It enables detailed insights into the bacterial strain-level diversity of whole metagenomic sequencing samples.
Comparing (DNA) sequences is one of the core tasks in bioinformatics, and the classic approach is to align these sequences. This is, however, a relatively slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is to compare sequences based on their k-mer profiles.
A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence
AGCGTATCGATTCA
has the following k-mers if $k=6$:As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L−k+1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts.
When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Therefore, there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.
Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. However, this is a bit wasteful because we only have four characters in DNA. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.
This post will explain a technique described in the paper by Cleary et al. for to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, we must also deal with sequencing errors and expect some k-mers to be false. In a lot of cases, using approximate k-mer counts is appropriate.
We start by transforming each k-mer to a complex vector, using the following encoding: $$A=1,T=−1,C=i,G=i$$
A k-mer is now a vector of length $k$, where each element represents a DNA base as a complex number. A k-mer is now a point in a k-dimensional space. Notice that similar k-mers will be “neighbours” in this k-dimensional space. This is not an efficient encoding, but we can apply a few smart operations to convert this to a smaller integer.
The idea is to divide our k-dimensional space into a lot of bins. This can be done as follows: generate a random complex vector of length k, which can then be interpreted as the normal vector of a plane through our k-dimensional space. This plane divides our space in half: certain k-mers lie on the “left” side, while other k-mers lie on the “right” side of this plane. See the figure below for an example. If we note a zero when the k-mer lies on the left and a one if on the right, we obtain a bit value.
Repeat this n times, and you obtain n bit values. For example, we could store the result from drawing 32 random hyperplanes in a 32-bit integer. Note that we split the space in half each time we draw a random hyperplane, creating $2^n$ bins in our $k$-dimensional space.
It is possible, however, that two distinct k-mers end up in the same bin and thus have the same 32-bit integer value (if $n$=32). We can actually calculate the probability of such an event. Recall that our k-mers are just ordinary complex vectors and that similar k-mers are neighbours in the k-dimensional space. We can compute the cosine of the angle between the two vectors as follows: $$ \phi = \cos(\theta) = \frac{\textbf{u}\cdot\textbf{v}}{||\textbf{u}||||\textbf{v}||} $$
Here u and v are two k-mers in our complex k-dimensional space. The probability of these two k-mers being in the same bin, is equal to the probability that no random hyperplane cuts through the angle of u and v. This is visualised in the figure below.
We end up with the following equation: $$ P(h(\textbf{u}) = h(\textbf{v})) = 1 - \frac{\arccos(\phi)}{\pi} $$
Here, $h(u)$ represents the hash value of a k-mer u, or the n-bit integer by generating $n$ random hyperplanes.
This is the reason why this could be used for approximate counting of k-mers, because some k-mers may be mistaken for another. By tuning $k$ and $n$ you can try to minimize the impact on your application. The effect is even further minimized because similar k-mers have a higher probability of ending up in the same bin than dissimilar k-mers, while reducing the number of bits you need to store the k-mer.
Read Next
Fast and exact gap-affine partial order alignment with POASTA
Partial order alignment is a common method to compute multiple sequence alignments through iterative sequence-to-graph alignment. In this paper, we introduce a new algorithm that accelerates alignment by exploiting graph topology and shared sequence between the query and the graph.
Gut microbiome dysbiosis linked with recurrent UTIs
More than half of the women in the US get a urinary tract infection (UTI) in their lifetime, which frequently becomes recurrent. In this paper, we investigated the role of the gut microbiome in facilitating recurrence.
StrainGE: Strain Genome Explorer
StrainGE is a toolkit for tracking and characterizing low-abundance strains in complex microbial communities. It enables detailed insights into the bacterial strain-level diversity of whole metagenomic sequencing samples.
Visualizing the height of the Netherlands