Tutorial

Using locality sensitive hashing to compactly represent k-mers

Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.

A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence AGCGTATCGATTCA has the following k-mers if $k=6$:

AGCGTATCGATTCA
--------------
AGCGTA
GCGTAT
CGTATC
...
GATTCA


As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts1.

When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.

Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.

This post will explain a technique described in the paper by Cleary et al.2 to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.

Drawing arbitrary shapes with OpenGL points

Part of my Google Summer of Code project involves porting several arrow heads from Glumpy to Vispy. I also want to make a slight change to them: the arrow heads in Glumpy include an arrow body, I want to remove that to make sure you can put an arrow head on every type of line you want.

Making a change like that requires that you understand how those shapes are drawn. And for someone without a background in computer graphics this took some thorough investigation of the code and the techniques used. This article is aimed at people like me: good enough programming skills and linear algebra knowledge, but almost no former experience with OpenGL or computer graphics in general.

Controlling a servo with an AVR microcontroller

Welcome to my first AVR tutorial on this site! We’ll be doing something basic today, namely controlling a servo. There are a lot of tutorials on how to control it with an Arduino, but less tutorials using only a bare AVR chip. In this tutorial we’ll be using the ATTiny44, a small and cheap microprocessor, which also contains a 16 bit timer, which will make our life a bit easier.

Servos are often used to move robot arms and things alike, because they can rotate a specific amount of degrees very precisely, depending on the pulsewidth you feed it with the microcontroller. They can also be used as a motor (you’ll need special ‘continuous rotation’ servos for that), you’ll often find them in RC cars.

So lets get started, and see how you actually control a servo!