<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#">
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog_code"/>

<title>pfh's blog: code section</title>
<modified>2012-10-28T20:30:14Z</modified>
<tagline>Code, largely Pythonic</tagline>
<author><name>Paul Harrison</name><email>pfh@logarithmic.net</email></author>
<entry>
<title>Demakein 0.2 release: shawms</title>
<issued>2012-10-28T20:30:14Z</issued>
<modified>2012-10-28T20:30:14Z</modified>
<id>https://logarithmic.net/pfh/blog/01351456214</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01351456214"/>
<content type="text/html" mode="escaped">

I've just released version 0.2 of my woodwind instrument maker. This version adds a shawm designer/maker. The bore shape optimization for the shawms was rather tricky, so the numerical optimzer has also undergone a complete overhaul.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh/design&quot;&gt;Demakein homepage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://pypi.python.org/pypi/demakein&quot;&gt;Demakein PyPI page&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Having printed a shawm, you can use my drinking-straw reed making method to make a reed for it:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://www.instructables.com/id/Practice-double-reed-from-a-drinking-straw/?ALLSTEPS&quot;&gt;Drinking straw reed making on Instructable&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;
&lt;div style=&quot;text-align: left&quot;&gt;
&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130136303/&quot; title=&quot;Alto tapered pflute, wood by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8189/8130136303_4e5b2c146f_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto tapered pflute, wood&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130135299/&quot; title=&quot;Alto tapered pflute, wood by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8465/8130135299_07c99db17b_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto tapered pflute, wood&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130156180/&quot; title=&quot;Alto straight folk flute, ABS plastic by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8329/8130156180_977fb3b343_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto straight folk flute, ABS plastic&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130131027/&quot; title=&quot;Alto straight folk flute, ABS plastic by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8053/8130131027_46505bfa24_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto straight folk flute, ABS plastic&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130217419/&quot; title=&quot;Alto shawm, ABS plastic by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8050/8130217419_0d6171ec50_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto shawm, ABS plastic&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130244686/&quot; title=&quot;Alto shawm, ABS plastic by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8323/8130244686_9428e41904_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Alto shawm, ABS plastic&quot;&gt;&lt;/a&gt;

&lt;a href=&quot;http://www.flickr.com/photos/pfh/8130238830/&quot; title=&quot;Soprano tapered folk flute, black strong and flexible by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8196/8130238830_346c0cdb85_n.jpg&quot; width=&quot;320&quot; height=&quot;214&quot; alt=&quot;Soprano tapered folk flute, black strong and flexible&quot;&gt;&lt;/a&gt;

&lt;/div&gt;

</content>
</entry>
<entry>
<title>Announcing Demakein</title>
<issued>2012-10-02T11:59:56Z</issued>
<modified>2012-10-02T11:59:56Z</modified>
<id>https://logarithmic.net/pfh/blog/01349179196</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01349179196"/>
<content type="text/html" mode="escaped">

I have finally gotten the latest iteration of my woodwind design software to a state where I can release it.

&lt;p&gt;The notable feature of this iteration is that it is able to produce shapes for 3D printing or CNC milling.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh/design&quot;&gt;Demakein&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://pypi.python.org/pypi/demakein&quot;&gt;Demakein on PyPI&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://www.thingiverse.com/pfh/things&quot;&gt;Some instrument designs on Thingiverse&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Demakein can currently design flutes in a number of styles. I will be adding other instruments such as shawms and reedpipes in the next few weeks.

&lt;p&gt;Demakein uses the &lt;a href=&quot;http://www.vicbioinformatics.com/software.nesoni.shtml&quot;&gt;the nesoni tool writing framework&lt;/a&gt; to provide a collection of tools that can either be run from the command line or from Python, and that can be easily customized by subclassing in Python. For example you could subclass one of the flute designer tools and adjust some of its parameters.

&lt;p&gt;
&lt;a href=&quot;http://www.flickr.com/photos/pfh/8046710845/&quot; title=&quot;Soprano pflute, 3D printed by Paul Francis Harrison, on Flickr&quot;&gt;&lt;img src=&quot;http://farm9.staticflickr.com/8037/8046710845_dffb11d3d5.jpg&quot; width=&quot;500&quot; height=&quot;333&quot; alt=&quot;Soprano pflute, 3D printed&quot;&gt;&lt;/a&gt;
</content>
</entry>
<entry>
<title>Talking to C++ for the incurably lazy</title>
<issued>2012-09-23T00:13:24Z</issued>
<modified>2012-09-23T00:13:24Z</modified>
<id>https://logarithmic.net/pfh/blog/01348359204</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01348359204"/>
<content type="text/html" mode="escaped">

&lt;div style=&quot;float: right; padding: 1em;&quot;&gt;&lt;img src=&quot;https://logarithmic.net/pfh-files/blog/01348359204/segment-1-of-3-sketch.png&quot;&gt;&lt;/div&gt;
So I had been eyeing the 
&lt;a href=&quot;http://www.cgal.org/&quot;&gt;Computational Geometry Algorithms Library&lt;/a&gt; for some time.
It is a very beautiful thing, it's the kind of thing C++ was designed to allow and it really shows C++'s full potential. 
It's the library that powers the popular CAD program &lt;a href=&quot;http://www.openscad.org/&quot;&gt;OpenSCAD&lt;/a&gt;, amongst other things.
However, I rather like Python.

&lt;p&gt;I saw that the makers of PyPy had come out with a 
&lt;a href=&quot;http://cffi.readthedocs.org/en/latest/index.html&quot;&gt;C Foreign Function Interface&lt;/a&gt;. Again the stuff they're doing with PyPy is very beautiful, and this is characteristically elegant, it takes Python to first-class citizen status in the programming world, and under PyPy has the potential to be blisteringly fast.

&lt;p&gt;But CGAL is not written in C, could never be written in C, is deeply, inextricably a thing of C++. C might sometimes be enough for code that internally uses C++, but here I really needed a way to talk C++ like a native. Now there are &lt;a href=&quot;http://morepypy.blogspot.com.au/2010/07/cern-sprint-report-wrapping-c-libraries.html&quot;&gt;plans&lt;/a&gt; to collide Python and C++ using libraries used in high energy physics, but these are not yet in an easily usable state. So in the meantime I came up my own. 

&lt;p&gt;It is not beautiful.

&lt;p&gt;But it gets the job done. It compiles snippets of code to interface C to C++ on the fly, using cmake. The compiled code is loaded on the fly using CFFI. Return types are inferred using typeid() and a utility called &amp;quot;c++filt&amp;quot;. Simple types are converted to python equivalents, more complex types are boxed.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://bazaar.launchpad.net/~paul-francis-harrison/+junk/design-instrument/view/head:/make-instrument/cpp.py&quot;&gt;cpp.py&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Goodbye Blender, with your inexact and fallible floating point numbers and your rough-and-ready meshes. Hello mind-melting perfect world of CGAL!

&lt;p&gt;&lt;img src=&quot;https://logarithmic.net/pfh-files/blog/01348359204/mill-lower2-sketch.png&quot;&gt;
</content>
</entry>
<entry>
<title>Software to position the finger holes on a wind instrument</title>
<issued>2010-03-27T15:02:55Z</issued>
<modified>2010-03-27T15:02:55Z</modified>
<id>https://logarithmic.net/pfh/blog/01269702175</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01269702175"/>
<content type="text/html" mode="escaped">

I wrote some software to calculate the positions of finger holes on a cylindrical wind instrument. If this is something you want to do, &lt;a href=&quot;https://logarithmic.net/pfh/design&quot;&gt;check it out&lt;/a&gt;.

&lt;p&gt;This is to accompany a lesson I am giving on &lt;a href=&quot;https://logarithmic.net/pfh-files/design/Medieval_Wind_Instruments.pdf&quot;&gt;medieval and renaissance wind instruments&lt;/a&gt; at Suth Moot II this easter.


&lt;p&gt;&lt;br&gt;&lt;b&gt;Update 10/4/2010:&lt;/b&gt; New version, with cross-fingering optimization.</content>
</entry>
<entry>
<title>(melodic] clojure</title>
<issued>2009-12-27T22:48:41Z</issued>
<modified>2009-12-27T22:48:41Z</modified>
<id>https://logarithmic.net/pfh/blog/01261954121</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01261954121"/>
<content type="text/html" mode="escaped">


&lt;p&gt;Lisp! An elegant language from a more civilized age. 
Yet there has always been the matter of having to
count brackets, the difficulty of seeing
which bracket goes with which.

&lt;p&gt;Users of Lisp and its children cope with this in various ways:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;Indentation.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;&lt;ul&gt;&lt;li&gt;Writing only small functions.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;&lt;ul&gt;&lt;li&gt;Turning common nested chains of function calls into a single function call. 
For example, &lt;span class=&quot;mono&quot;&gt;cond&lt;/span&gt; can stand for a chain of &lt;span class=&quot;mono&quot;&gt;if&lt;/span&gt;s, and
&lt;span class=&quot;mono&quot;&gt;let&lt;/span&gt; for a series of assignments.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;&lt;ul&gt;&lt;li&gt;Arranging the order of parameters to a function so that the most complex
parameter is the final one.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;This final method is effective, but still requires counting back the brackets at
the end of a function. Text editors help with this, but would it not be nice if you
did not need the text editor's help to understand your code?

&lt;p&gt;Suppose we add a tiny bit of syntactic sugar that lets us express a list

&lt;p&gt;&lt;span class=indent&gt;&lt;span class=&quot;mono&quot;&gt;(a b c)&lt;/span&gt;&lt;/span&gt;

&lt;p&gt;as

&lt;p&gt;&lt;span class=indent&gt;&lt;span class=&quot;mono&quot;&gt;(a b] c&lt;/span&gt;&lt;/span&gt;

&lt;p&gt;with the final element of the list &lt;i&gt;after&lt;/i&gt; the closing bracket.

&lt;p&gt;I tried this in &lt;a href=&quot;http://clojure.org/&quot;&gt;Clojure&lt;/a&gt;, 
and found the result pleasing. 
Consider quicksort (from &lt;a href=&quot;http://rosettacode.org/wiki/Quicksort#Clojure&quot;&gt;Rosetta Code&lt;/a&gt;). 

&lt;p&gt;&lt;span class=indent&gt;&lt;pre&gt;
(defn qsort [L]
  (if (empty? L) 
      '()
      (let [[pivot &amp;amp; L2] L]
           (lazy-cat (qsort (filter (fn [y] (&amp;lt; y pivot)) L2))
                     (list pivot)
                     (qsort (filter (fn [y] (&amp;gt;= y pivot)) L2))))))

&lt;/pre&gt;&lt;/span&gt;

&lt;p&gt;See how it drifts right. 
See the final &lt;span class=&quot;mono&quot;&gt;))))))&lt;/span&gt;.
See how the simpler case is handled first in the &lt;span class=&quot;mono&quot;&gt;if&lt;/span&gt;,
putting the meat at the end. Now let's apply sugar:

&lt;p&gt;&lt;span class=indent&gt;&lt;pre&gt;
(defn qsort [L]]
(if (empty?] L '()]
(let[ [pivot &amp;amp; L2] L
    ]]
(lazy-cat (qsort] (filter (fn[y]] (&amp;lt; y pivot)] L2
          (list] pivot
          (qsort] (filter (fn[y]] (&amp;gt;= y pivot)] L2)
&lt;/pre&gt;&lt;/span&gt;

&lt;p&gt;Note how the function is revealed to be a series of qualifications -- this is &lt;span class=&quot;mono&quot;&gt;qsort&lt;/span&gt;, what to do in the
base case, some variable definitions -- to a final statement of how to decompose sorting
into two smaller sorting problems. Within each line, you can also see similar transformation
into series of qualifications. No more &lt;span class=&quot;mono&quot;&gt;))))))&lt;/span&gt;, the structure can be seen at a glance.

&lt;p&gt;The &lt;span class=&quot;mono&quot;&gt;let&lt;/span&gt; here is a little ugly, it would be nicer to be able to write:

&lt;p&gt;&lt;span class=indent&gt;&lt;span class=&quot;mono&quot;&gt;(let variable-name value]&lt;/span&gt;&lt;/span&gt;

&lt;p&gt;and simply use multiple let statements for multiple assignments.

&lt;p&gt;You'll also note I left the comparisons unsugared. I feel the order of parameters would ideally be
reversed for numerical operators. For example, &lt;span class=&quot;mono&quot;&gt;(&amp;lt; pivot] y&lt;/span&gt;. Or
for an average &lt;span class=&quot;mono&quot;&gt;(/ (count] values] (sum] values&lt;/span&gt;. The denominator is usually
less meaty than the numerator.


&lt;p&gt;&lt;br&gt;Another example, the &lt;a href=&quot;http://rosettacode.org/wiki/Mandelbrot_set#Clojure&quot;&gt;Mandelbrot Set task from Rosetta Code&lt;/a&gt;:

&lt;p&gt;&lt;span class=indent&gt;&lt;pre&gt;
(ns mandelbrot
  (:refer-clojure :exclude [+ * &amp;lt;])
  (:use (clojure.contrib complex-numbers)
        (clojure.contrib.generic [arithmetic :only [+ *]]
                                 [comparison :only [&amp;lt;]]
                                 [math-functions :only [abs]])))

(defn mandelbrot? [c]]
(loop [ iters 1
        z c
      ]]
(if (&amp;lt; 20] iters  
    true]
(if (&amp;lt; 2] (abs] z
    false]
(recur (inc] iters] (+ c] (* z] z

(defn mandelbrot []]
(apply str]
(interpose \newline]
(for [y (range 1 -1 -0.05)]]
(apply str]
(for [x (range -2 0.5 0.0315)]]
(if (mandelbrot?] (complex x y)
    &amp;quot;#&amp;quot;] 
&amp;quot; &amp;quot;

(println] (mandelbrot)
&lt;/pre&gt;&lt;/span&gt;


&lt;p&gt;&lt;br&gt;&lt;h2&gt;&lt;a name=&quot;1&quot;&gt;Download&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Here is the patch to Clojure:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01261954121/melodic.patch&quot;&gt;melodic.patch&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Here is a Clojure jar with the patch applied (compiled from git repository as at 19 December 2009):

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01261954121/melodic-clojure.jar&quot;&gt;melodic-clojure.jar&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;If you have installed Java, you can use it with:

&lt;p&gt;&lt;span class=indent&gt;&lt;span class=&quot;mono&quot;&gt;java -cp melodic-clojure.jar clojure.main&lt;/span&gt;&lt;/span&gt;

&lt;p&gt;&lt;h2&gt;&lt;a name=&quot;2&quot;&gt;Why &lt;i&gt;Melodic&lt;/i&gt; Clojure?&lt;/a&gt;&lt;/h2&gt;
My initial application for this notation was writing
musical melodies. For example, this is the music for the 
&lt;a href=&quot;https://logarithmic.net/pfh/blog/01254176998&quot;&gt;Buffens&lt;/a&gt;:

&lt;p&gt;&lt;span class=indent&gt;&lt;pre&gt;
F ((G] A] (A] Bb 
A ((G] A] (F] G 
F ((G] A] (A] Bb 
A ((F] G] (G] F

c (d] (c] Bb
A ((G] A] (Bb] c
c (d] (c] Bb
A ((F] G] (G] F
&lt;/pre&gt;&lt;/span&gt;

&lt;p&gt;After desugaring, the brackets indicate the division of time before each note.  
Before desugaring, the brackets indicate the emphasis pattern, the prosody.
The deeper bracketed notes are given reduced emphasis.


&lt;p&gt;&lt;br&gt;--&amp;gt; See &lt;a href=&quot;http://en.wikipedia.org/wiki/Metrical_phonology&quot;&gt;Metrical Phonology&lt;/a&gt;. For Melodic Clojure perhaps follow these rules: 
&lt;ul&gt;&lt;li&gt;&lt;span class=&quot;mono&quot;&gt;(a b c]&lt;/span&gt; is a compound, the Compound Stress Rule applies, giving emphasis to &lt;span class=&quot;mono&quot;&gt;a&lt;/span&gt;. Example: &lt;span class=&quot;mono&quot;&gt;(apply +]&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;&lt;span class=&quot;mono&quot;&gt;(a] (b] c&lt;/span&gt; is a phrase, the Nuclear Stress Rule applies, giving emphasis to &lt;span class=&quot;mono&quot;&gt;c&lt;/span&gt;. Example: &lt;span class=&quot;mono&quot;&gt;(println] (reverse] x&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;Where a-b-c is more like a compound than a phrase, prefer sugarless &lt;span class=&quot;mono&quot;&gt;(a b c)&lt;/span&gt; to &lt;span class=&quot;mono&quot;&gt;(a b] c&lt;/span&gt;. Example: accessing a member of an object, &lt;span class=&quot;mono&quot;&gt;(table :leg)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
</content>
</entry>
<entry>
<title>New bioinformatics software: nesoni, and updated treemaker key-value store</title>
<issued>2009-08-05T11:14:02Z</issued>
<modified>2009-08-05T11:14:02Z</modified>
<id>https://logarithmic.net/pfh/blog/01249470842</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01249470842"/>
<content type="text/html" mode="escaped">

I just put some code I've been working on at work up on the web. Nesoni is Python / Cython software for analysing high-throughput sequencing data. If you have huge piles of high-throughput sequencing data, especially of bacterial genomes, you might find it useful.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://vicbioinformatics.com/nesoni.shtml&quot;&gt;Nesoni&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;It's still fairly alpha. There's not a lot of documentation, but it's getting everyday use.


&lt;p&gt;&lt;br&gt;Perhaps of more interest to those of you who don't have to deal with huge steaming piles of high-throughput sequencing data, nesoni includes an updated version of the &amp;quot;treemaker&amp;quot; key value store I described in a previous &lt;a href=&quot;https://logarithmic.net/pfh/blog/01244327900&quot;&gt;blog post&lt;/a&gt;. Treemaker is useful for huge steaming piles of data of any kind! Treemaker excels at fast random writes. A simple example application would be an always live, always up-to-date &lt;a href=&quot;http://en.wikipedia.org/wiki/Inverted_index&quot;&gt;inverted index&lt;/a&gt;.

&lt;p&gt;This new version can perform merges in the background, using the multiprocessing module. It can happily keep several CPUs busy when under heavy load.

&lt;p&gt;Treemaker is my secret sauce. For example, I've recently been using it to map paired end sequencing data onto a de Bruijn graph layout, using an inverted index from k-mers to read-pairs. This isn't yet part of nesoni, but should hopefully be part of a future version. The &lt;a href=&quot;http://www.ebi.ac.uk/~zerbino/velvet/&quot;&gt;Velvet Assembler&lt;/a&gt; could probably be rewritten using treemaker, to remove its current excessive memory demands. I am, kind of by accident, about a quarter of the way to having done this.
</content>
</entry>
<entry>
<title>A key-value data warehouse</title>
<issued>2009-06-06T22:38:20Z</issued>
<modified>2009-06-06T22:38:20Z</modified>
<id>https://logarithmic.net/pfh/blog/01244327900</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01244327900"/>
<content type="text/html" mode="escaped">



&lt;p&gt;&lt;center&gt;Grand sale! Grand sale!&lt;p&gt;Welcome to Paul's crazy data warehouse.&lt;/p&gt;&lt;p&gt;Everything reduced!&lt;/p&gt;&lt;/center&gt;

&lt;p&gt;So here we are in 2009. Disks got big, but they didn't get fast. 
We have so much memory now that it takes a titanic amount of data to even start hitting the disk.
But when you hit that wall, oh boy do you feel it.

&lt;p&gt;I happen to be working with sequencing data that's just brushing up against that wall. Annoying.

&lt;p&gt;Scattered reads and writes are agonizingly slow now. It's like the era of tape all over again.
Flash storage solves this for the read side of things, but you still want coherent writes.

&lt;p&gt;So, I've written myself a key-value store with this in mind. 
It's the flavour of the month thing to do.

&lt;p&gt;&lt;b&gt;Basic concept:&lt;/b&gt;

&lt;p&gt;Merge sort can be performed with serial reads and writes. We'll maintain a collections of sorted lists, progressively merging lists together in the background.
Writing data is simply a matter of adding a new list to this mix.
Reading is a matter of checking each list in turn, from latest to earliest, for the key you want.

&lt;p&gt;&lt;b&gt;Refining things a little:&lt;/b&gt;

&lt;p&gt;For random access to a database, you want items that are accessed with about the same
likelihood to be stored near each other (ie in the same disk block). This way, your memory
cache mostly contains frequently accessed items.

&lt;p&gt;So I store each list as a balanced binary tree, with each level of the tree stored in
a separate file. The list merging operation only ever needs to append to the end of each of these files, so this is still coherent writing.

&lt;p&gt;Levels of the tree nearer the root will be accessed more often, and should end up in the memory
cache. In-order traversal of the list remains fairly nice, since each level remains in order.

&lt;p&gt;This scheme is &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.5650&quot;&gt;cache oblivious&lt;/a&gt;, it works well with any size disk block. Which is handy, because &lt;a href=&quot;http://portal.acm.org/citation.cfm?id=864056.864058&quot;&gt;you don't know jack about disks&lt;/a&gt;.

&lt;p&gt;There is some room for improvement here, perhaps a trie.

&lt;p&gt;&lt;b&gt;Some nice properties:&lt;/b&gt;

&lt;p&gt;Writing is fast. Chuck it on disk, merge at your leisure. In parallel, because this is 2009 and
CPUs have stopped getting faster but there are plenty of them.

&lt;p&gt;Multi-Version Concurrency Control is trivially easy. With only a little work, it will be impossible
for the database to become corrupted. And it's backup-friendly.

&lt;p&gt;You don't need to look up a record in order to modify it.
For example, if you want to append to a record or add to a counter, you can set things up
so this is deferred until merge-time. Each key effectively becomes its own little &lt;i&gt;reduce&lt;/i&gt;.
For example, you can implement a very nice word counting system this way.
Which is exactly what I need for my sequencing data.

&lt;p&gt;&lt;b&gt;Implementation:&lt;/b&gt;

&lt;p&gt;My implementation is not terribly optimized, but it is written in Cython,
and is at least within the ballpark of reasonable. You'll need a recent Cython, and
for Python 2.5 you'll also need the &amp;quot;processing&amp;quot; library.

&lt;p&gt;I batch up key-value pairs in an in-memory dict until I have one million of them, before sorting them and writing them to disk. Python hashtables are fast, it would be silly not to use them until we are forced not to. One million items seems to be about optimal. This is a long way
from filling available memory... something to do with cache size perhaps?

&lt;p&gt;The performance on my machine is about 10x faster for writing than bsddb, 2x faster for reading, but still about 10-20x slower than an in-memory Python dict.

&lt;p&gt;It's far from polished. Just proof of concept for the moment.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01244327900/treemaker-0.2.tar.gz&quot;&gt;treemaker-0.2.tar.gz&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;As examples, version 0.2 contains:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;General_store, supporting set, delete, and append operations. {set x} + {append y} -&amp;gt; {set xy}, {append x} + {set y} -&amp;gt; {set y}, {set/append x} + {delete} -&amp;gt; {delete}, {delete} + {set/append y} -&amp;gt; {set y}&lt;/li&gt;&lt;/ul&gt;
&lt;ul&gt;&lt;li&gt;Counting_store, for word counting.&lt;/li&gt;&lt;/ul&gt;



&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;Update 2/8/2009:&lt;/b&gt; &lt;a href=&quot;http://papersincomputerscience.org/2009/04/27/efficient-online-index-construction-for-text-databases/&quot;&gt;This paper&lt;/a&gt; appears to be describing the same idea.

&lt;p&gt;&lt;b&gt;Update 5/8/2009:&lt;/b&gt; &lt;a href=&quot;http://vicbioinformatics.com/nesoni.shtml&quot;&gt;nesoni&lt;/a&gt; includes an updated version of treemaker.
</content>
</entry>
<entry>
<title>Querying intervals efficiently in SQL + Python revisited</title>
<issued>2009-02-21T06:24:34Z</issued>
<modified>2009-02-21T06:24:34Z</modified>
<id>https://logarithmic.net/pfh/blog/01235197474</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01235197474"/>
<content type="text/html" mode="escaped">


&lt;a href=&quot;https://logarithmic.net/pfh/blog/01234937824&quot;&gt;My previous post&lt;/a&gt; on querying intervals generated quite a lot of interest, but my solution was a hideous thing involving trinary numbers. I now have a better solution.

&lt;p&gt;To recap: I am interested in efficiently finding, from a set of intervals, all intervals that contain a given value. I have quite a lot of intervals, so I want to use an SQL database to store and query these intervals.

&lt;p&gt;By efficiently, I mean faster than O(n). Simply scanning the list is too slow, and indexing the upper and/or lower bound still yields an O(n) search.

&lt;p&gt;It would furthermore be nice to be able to find all intervals that intersect a given query interval. My new solution allows this.

&lt;p&gt;&lt;b&gt;My new solution:&lt;/b&gt;

&lt;p&gt;For each interval, store the upper and lower bounds, and the size of the interval &lt;i&gt;rounded down to the nearest power of two&lt;/i&gt; (I'll call this value &lt;span class=&quot;mono&quot;&gt;size&lt;/span&gt;). Create indexes on &lt;span class=&quot;mono&quot;&gt;(size,lower)&lt;/span&gt; and &lt;span class=&quot;mono&quot;&gt;(size,upper)&lt;/span&gt;.

&lt;p&gt;To perform a query, we consider each possible value of &lt;span class=&quot;mono&quot;&gt;size&lt;/span&gt; (up to the maximum size interval you expect). Since &lt;span class=&quot;mono&quot;&gt;size&lt;/span&gt; is at least half the true size of the interval, intervals that contain a certain value &lt;i&gt;must&lt;/i&gt; satisfy either

&lt;p&gt;&lt;span class=&quot;mono&quot;&gt;    lower &amp;lt;= value &amp;lt;= lower+size&lt;/span&gt;

&lt;p&gt;or

&lt;p&gt;&lt;span class=&quot;mono&quot;&gt;    upper-size &amp;lt;= value &amp;lt;= upper&lt;/span&gt;

&lt;p&gt;or possibly both. We can efficiently find intervals satisfying these two conditions using the two indexes we have created. No mess, no fuss!

&lt;p&gt;If your query itself is an interval, the two conditions become:

&lt;p&gt;&lt;span class=&quot;mono&quot;&gt;    lower &amp;lt;= query_upper and query_lower &amp;lt;= lower+size&lt;/span&gt;

&lt;p&gt;or

&lt;p&gt;&lt;span class=&quot;mono&quot;&gt;    upper-size &amp;lt;= query_upper and query_lower &amp;lt;= upper&lt;/span&gt;


&lt;p&gt;&lt;br&gt;Implementation in Python/SQLite:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01235197474/interval_query.py&quot;&gt;interval_query.py&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;SQLite needs a little coddling when implementing this. You need to convert &lt;span class=&quot;mono&quot;&gt;value &amp;lt;= lower+size&lt;/span&gt; to &lt;span class=&quot;mono&quot;&gt;value-size &amp;lt;= lower&lt;/span&gt;, etc, for it to use the indexes properly. &lt;span class=&quot;mono&quot;&gt;explain query plan&lt;/span&gt; is your friend.


&lt;p&gt;&lt;br&gt;&lt;b&gt;Update:&lt;/b&gt; &lt;a href=&quot;http://www.personal.psu.edu/iua1&quot;&gt;Istvan Albert&lt;/a&gt; pointed me to this paper on &lt;a href=&quot;http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btl647v1?papetoc&quot;&gt;Nested Containment Lists&lt;/a&gt;, which solve the same problem and are also amenable to SQL database implementation, and also &lt;a href=&quot;http://groups.google.com/group/pygr-dev/browse_thread/thread/cc52498e17326fca?hl=enQb921b2ca84de79&quot;&gt;this discussion of python interval database/query implementations&lt;/a&gt; on the pygr mailing list.


&lt;p&gt;&lt;br&gt;&lt;b&gt;Performance notes:&lt;/b&gt;
&lt;br&gt;The python file above generates a test data set of 1,000,000 intervals and performs 100 queries on it. On my machine, my method is 20 times faster than just using an index on &lt;span class=&quot;mono&quot;&gt;(upper, lower)&lt;/span&gt;.

&lt;p&gt;I've also tried this on set of 15,000,000 alignments of Illumina short reads to a reference bacterial genome. On my machine, with my method a query retrieving 500 alignments takes 0.009 seconds, as compared to 27 seconds just using an index on &lt;span class=&quot;mono&quot;&gt;(lower, upper)&lt;/span&gt;, or 49 seconds with no index at all.</content>
</entry>
<entry>
<title>Querying intervals efficiently in SQL + Python</title>
<issued>2009-02-18T06:17:04Z</issued>
<modified>2009-02-18T06:17:04Z</modified>
<id>https://logarithmic.net/pfh/blog/01234937824</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01234937824"/>
<content type="text/html" mode="escaped">


&lt;b&gt;Update 21/2/2009:&lt;/b&gt; See my new solution to this problem &lt;a href=&quot;https://logarithmic.net/pfh/blog/01235197474&quot;&gt;here&lt;/a&gt;.


&lt;p&gt;&lt;br&gt;I have a problem where I need to be able to efficiently[1] find all of the intervals that contain a particular (integer) value. There are data-structures to do this efficiently, for example the &lt;a href=&quot;http://en.wikipedia.org/wiki/Interval_tree&quot;&gt;interval tree&lt;/a&gt;. However, I have &lt;i&gt;quite a lot&lt;/i&gt; of intervals, and I don't want to have to code up a complicated disk-based data structure, so I am constrained to use a database.

&lt;p&gt;The solution I came up with is to assign each interval a particular numerical code, such that a query of a small set of such codes will guarantee finding each interval that contains a given value.

&lt;p&gt;I'm not sure if this is novel. It seems a fairly obvious thing to do, so probably not. If not, can anyone tell me what it is called?


&lt;p&gt;&lt;br&gt;The code is this: I produce a trinary number from the binary representation of the lower and upper bounds of the interval. Starting with the most significant digit I map 0-&amp;gt;1 and 1-&amp;gt;2 until the first digit that varies between the lower and upper bounds. From there on I just produce 0s.

&lt;p&gt;An example may make this clear:

&lt;p&gt;&lt;pre&gt;
Lower  00010101010
Upper  00010101111

Output 00021212000&lt;/pre&gt;

&lt;p&gt;To perform a query, I similarly convert the binary query number to trinary. The query set is then this trinary number with each least significant digit converted to zero in turn.

&lt;p&gt;An example:

&lt;p&gt;&lt;pre&gt;
Query  00010101100

Output 00021212211
       00021212210
       00021212200
       00021212000
       00021210000
       00021200000
       00021000000
       00020000000
       00000000000&lt;/pre&gt;

&lt;p&gt;Here's the source code:

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01234937824/range_codes.py&quot;&gt;range_codes.py&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;The source code includes code to test this idea using SQLite.


&lt;p&gt;&lt;br&gt;&lt;b&gt;Update:&lt;/b&gt; This is hideous. Sorry. I am working on a &lt;i&gt;much&lt;/i&gt; nicer scheme, stay tuned.

&lt;p&gt;[1] Clarification: By efficiently I mean faster than O(n).</content>
</entry>
<entry>
<title>Sketchifier</title>
<issued>2009-01-18T20:50:23Z</issued>
<modified>2009-01-18T20:50:23Z</modified>
<id>https://logarithmic.net/pfh/blog/01232311823</id>
<link rel="alternate" type="text/html" href="https://logarithmic.net/pfh/blog/01232311823"/>
<content type="text/html" mode="escaped">
&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01232311823/bird.jpg&quot;&gt;&lt;img src=&quot;https://logarithmic.net/pfh-files/blog/01232311823/sketch_bird.jpg&quot;&gt;&lt;/a&gt;

&lt;p&gt;Click image to see original.

&lt;p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;https://logarithmic.net/pfh-files/blog/01232311823/sketch.py&quot;&gt;sketch.py&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Requires Python 2, Python Imaging Library, Numpy.

&lt;p&gt;The basic principle is to draw lines but not corners or dots. Lines, and corners and dots can be detected by the eigenvalues and determinant of the Hessian respectively. I also need to blur the image for various reasons, which I attempt to do in as scale-free a way as possible by using a long tailed convolution kernel (of course, you know I'm obsessed by these things), a t-distribution to be precise.
</content>
</entry>
</feed>
