The major component at this stage is a utility to align short reads to a reference genome, a common first step when analysing short read data. An alignment can contain any number of SNPs, insertions and deletions, up to a user specified cutoff.
One cute thing I'm doing with this is some extreme bit-parallelism. A single alignment can be writen as a logical expression. Either two things align, or they don't (if they do you go back and work out how, but this is infrequent and is therefore allowed to be slow). We could perform each of the operations required to evaluate this logical expression for each pair of sequences we want to align, but it is possible to do much better. Instead, we can pack 32 (or 64 or 128, depending on architecture) independent alignments into a machine word, and perform the operations on this. Bingo, 32x parallelism!
We may be able to do better still with vector operations. Even with bit-parallelism, we can still get pipeline stalls waiting for operations to complete. It may actually be better to work with many words worth of alignements at once. Performing a vector logical operation isn't going to stall the pipeline. When performing vector operations, the CPU will be able to easily predict what data it will need to load ahead of time. Any housekeeping gets amortized across more alignments. It will at the least be no worse to do this.
This assumes we are able to pump data in and out of memory as fast as we can operate on it. From what I have read on the memory hierachy (someone please correct me if I am wrong), while latency gets progressively worse the further down the cache hierachy one goes, throughput only starts to suffer once you hit main memory. And with vector operations, it's throughput we're interested in here. When your vectors are large enough, latency doesn't matter.
Suddenly, Python and Numpy become a competitive option.
The other cute thing: It really screams on a PlayStation 3. I'm trying to convince my department that it needs a cluster of them.