Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.

A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence AGCGTATCGATTCA has the following k-mers if $k=6$:


As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts1.

When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.

Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.

This post will explain a technique described in the paper by Cleary et al.2 to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.


On this date 65 years ago, February 1st 1953, the Netherlands experienced its greatest flood till date, the North Sea Flood of 1953. This is still one of the biggest disasters the Netherlands has ever experienced, with thousands of casualties and lots of people who lost their homes.

The Netherlands earns its name from the fact that large parts of the country lie below sea level. To make sure our country doesn’t flood, we have built lots of barriers, dams and dykes to keep the water out, and we want to prevent anything like the flood of 1953 from happening ever again.

In the beginning of this year, several of these barriers and dams were put to the test when a heavy storm reached the Netherlands which resulted in very high water levels. Our five biggest dams and barriers needed to be closed at the same time, a first since their construction. We can ask ourselves the question whether this will happen more often now that sea levels are rising due to global warming 1. A higher base line sea level increases the chance for even higher water levels when it storms. To get an idea what areas would be affected the most by a possible flood, we have created a visualisation project that shows the height of the Netherlands in comparison to the sea level.


Part of my Google Summer of Code project involves porting several arrow heads from Glumpy to Vispy. I also want to make a slight change to them: the arrow heads in Glumpy include an arrow body, I want to remove that to make sure you can put an arrow head on every type of line you want.

Making a change like that requires that you understand how those shapes are drawn. And for someone without a background in computer graphics this took some thorough investigation of the code and the techniques used. This article is aimed at people like me: good enough programming skills and linear algebra knowledge, but almost no former experience with OpenGL or computer graphics in general.


More Posts


PHASM: Haplotype-aware de novo genome assembly

A de novo genome assembler written in Python that leverages the assembly graph to output DNA sequences for each haplotype.

Raspberry Pi TLC5940 library

A C++ library to control the TLC5940 LED driver from your Raspberry Pi