Tutorial | Lucas van Dijk

Using locality sensitive hashing to compactly represent k-mers

Tutorial Mar 26, 2018 5 min read

Comparing (DNA) sequences is one of the core tasks in bioinformatics and the classic approach is to align these sequences. This is, however, a relative slow process and not always computationally feasible, especially if you want to compare more than two DNA sequences. An alternative approach is compare sequences based on their k-mer profiles.

A k-mer of a string $S$ is defined as any substring of $S$ of length $k$. For example, the DNA sequence AGCGTATCGATTCA has the following k-mers if $k=6$:

AGCGTATCGATTCA
--------------
AGCGTA
 GCGTAT
  CGTATC
   ...
        GATTCA

As you can see, obtaining all k-mers is easy: slide a window of size $k$ along your sequence, yielding a k-mer at each position. A sequence of length $L$ has $L - k + 1$ k-mers. A common task is to count how often each k-mer occurs and compare genomes based on these counts. The main idea is that similar genomes have similar k-mer counts¹.

When dealing with the scale of genomes, storing counts for all these different k-mers can take up quite a lot of memory. First, the number of distinct k-mers grows exponentially with the length of $k$. In the case of DNA sequences, our alphabet size is 4: A, C, T, G. Then there are $4^k$ possibilities of length $k$. The value of $k$ depends on your application and organism, but values ranging from 5 to 32 are common.

Next, think how we would store the k-mer itself. We could store each letter as ASCII character, requiring 8-bits per character. This is a bit wasteful, however, because in DNA we only have 4 different characters. An optimisation would be to use 2 bits per character: A=00, C=01, T=10, G=11. This would allow us to store a k-mer of length 32 in a 64-bit integer. Still, this may not be good enough. I’ve seen cases for $k=23$ where it went up to more than 100 GB, and that’s quite a lot of memory even if you have access to a decent compute cluster.

This post will explain a technique described in the paper by Cleary et al.² to reduce the memory consumption for storing k-mers. The main insight is that we often don’t need the exact count of each k-mer, and can take some shortcuts by missing some k-mers. Because of the exponential number of different k-mers and because genomes are often large, missing a few k-mers will not have a huge impact. Furthermore, when dealing with whole genome sequencing datasets, you also have to deal with sequencing error, and expect some k-mers to be false. In a lot of cases using approximate k-mer counts is appropriate.

Drawing arbitrary shapes with OpenGL points

Tutorial Jun 13, 2015 10 min read

Part of my Google Summer of Code project involves porting several arrow heads from Glumpy to Vispy. I also want to make a slight change to them: the arrow heads in Glumpy include an arrow body, I want to remove that to make sure you can put an arrow head on every type of line you want.

Making a change like that requires that you understand how those shapes are drawn. And for someone without a background in computer graphics this took some thorough investigation of the code and the techniques used. This article is aimed at people like me: good enough programming skills and linear algebra knowledge, but almost no former experience with OpenGL or computer graphics in general.

Controlling a servo with an AVR microcontroller

Tutorial Oct 16, 2012 5 min read

Welcome to my first AVR tutorial on this site! We’ll be doing something basic today, namely controlling a servo. There are a lot of tutorials on how to control it with an Arduino, but less tutorials using only a bare AVR chip. In this tutorial we’ll be using the ATTiny44, a small and cheap microprocessor, which also contains a 16 bit timer, which will make our life a bit easier.

Servos are often used to move robot arms and things alike, because they can rotate a specific amount of degrees very precisely, depending on the pulsewidth you feed it with the microcontroller. They can also be used as a motor (you’ll need special ‘continuous rotation’ servos for that), you’ll often find them in RC cars.

So lets get started, and see how you actually control a servo!

Multithreading with C++11: protecting data

Tutorial Jun 14, 2012 9 min read

Welcome to this second part in a series of articles about multithreading with C++11. In the previous part, I briefly explained what a thread is, and how to create one with the new C++ thread library. This time, we will be writing a lot more code, so open up your favourite IDE if you want to try the examples while you’re reading.

In the previous article we also saw that sometimes, the output wasn’t completely right when running multiple threads simultaneously. Today, we’ll see that there are some other problems with sharing a resource between threads, and of course, provide some solutions to these problems.

Introduction to threads with C++11

Tutorial May 3, 2012 6 min read

The free lunch is over. The time that our complex algorithm was running extremely slow on a computer, but ran extremely fast a few years later, because the processor speeds exploded, is gone. The trend with current processors is to add more cores, rather than increasing the clock frequency.

As programmer, you should be aware of this. Of course, processors will always perform a bit better each year, but the growth in performance is slowing down. Currently, a lot of programs can benefit the most by using multiple threads, because of today’s multicore processors.

In this article I’ll briefly explain what a thread is, and how you can create them with the new threading library in C++11. I’m planning to write multiple articles about this topic, each going a little bit more in depth.