21 April 2008

Left-leaning red-black trees are hard to implement

Back in 2002, I needed balanced trees for a project I was working on, so I used the description and pseudo-code in Introduction to Algorithms to implement red-black trees. I vaguely recall spending perhaps two days on implementation and testing. That implementation uses C preprocessor macros in order to make it possible to link data structures into one or more red-black trees without requiring container objects.

About the same time, Niels Provos added a similar implementation to OpenBSD, which was imported into FreeBSD, so when I imported jemalloc into FreeBSD, I switched from my own red-black tree implementation to the standard one. Unfortunately, both implementations use nodes that include four pieces of information: parent, left child, right child, and color (red or black). That typically adds up to 16 or 32 bytes on 32- and 64-bit systems, respectively. A few months ago I fixed some scalability issues Stuart Parmenter found in jemalloc by replacing linear searches with tree searches, but that meant adding more tree links. These trees now take up ~2% of all mapped memory, so I have been contemplating ways to reduce the overhead.

A couple of weeks ago, I came across some slides for a talk that Robert Sedgewick recently gave on left-leaning red-black trees. His slides pointedly disparage the use of parent pointers, and they also make left-leaning red-black trees look simple to implement. Left-leaning red-black trees maintain a logical 1:1 correspondence with 2-3-4 B-trees, which is a huge help in understanding seemingly complex tree transformations.

Last Monday, I started implementing left-leaning red-black trees, expecting to spend perhaps 15 hours on the project. I'm here more than 60 hours of work later to tell you that left-leaning red-black trees are hard to implement, and contrary to Sedgewick's claims, their implementation appears to require approximately the same amount of code and complexity as standard red-black trees. Part of the catch is that although standard red-black trees have additional cases to deal with due to 3-nodes that can lean left or right, left-leaning red-black trees have a universal asymmetry between the left and right versions of the algorithms.

If memory overhead weren't my primary concern for this project, I would have dropped red-black trees in favor of treaps. Unfortunately, treaps require either recursive implementation or parent pointers, and they also require an extra "priority" field, whereas red-black trees can be implemented without recursion or parent pointers, and it is possible to stuff the red-black bit in the least significant bit of one of the left/right pointers.

For the curious or those in need of such a beast, here is my left-leaning red-black tree implementation. One point of interest is that my benchmarks show it to be ~25% slower than my standard red-black tree implementation. The red-black bit twiddling overhead only accounts for about 1/5 of the slowdown. I attribute the other 4/5 to the overhead of transforming the tree on the down pass, rather than lazily fixing up tree structure violations afterward.

[26 April 2008] I did some further experimentation to understand the performance disparity between implementations. The benchmarks mentioned above were flawed, in that they always searched for the most recently inserted item. Since top-down insertion/deletion is more disruptive than lazy fixup, the searches significantly favored the old implementation. I fixed the benchmarks to compute the times for random searches, random insertions/deletions, and in-order tree traversal.

The old rb.h and sys/tree.h perform essentially the same for all operations. The new rb.h takes almost twice as long for insertion/deletion, is the same speed for searches, and is slightly faster for iteration. Red/black bit twiddling overhead accounts for ~6% of insertion/deletion time, and <3% of search time.

I am actually quite pleased with these benchmark results, because they show that for random inputs, left-leaning red-black trees do not noticeably suffer from the fact that tree height is O(3h) rather than O(2h), where h is the height of an equivalent fully balanced tree.

03 April 2008

Using Mercurial patch queues for daily development

I recently watched a video (slides) of Bryan O'Sullivan speaking about Mercurial. The presentation was mainly a (great) introduction to Mercurial, but I was surprised to learn that Mercurial patch queues could be useful even when using a repository that I have full commit access to. In a nutshell, Bryan described how he uses patch queues to checkpoint his work without cluttering the permanent revision history. Checkpointing is mainly useful to me when I am about to try a risky programming solution on top of reasonable code that only partially implements a feature. Historically, I have archived my entire sandbox at such critical points, but patch queues are a much cleaner solution; they make it possible to separate work into distinct patches and checkpoint regularly without performing heavyweight archiving operations. Note that reverting to an earlier state is much easier with patch queues, which makes failed experiments much less costly. This all sounds great, but it took me several hours and a lot of mistakes to actually figure out how to use patch queues in this fashion, so I'm recording the solution here with the hope that it will be useful to others.

The first step is to enable the mq extension (see Configuration directions), though it is enabled by default on my Ubuntu 7.10 systems, and in fact following the standard configuration directions blindly causes some strange warnings.

Following is a terse example of how to perform every operation that I find useful when using patch queues for daily development:
~> hg init pizza # Create repository.
~> cd pizza
~/pizza> echo "pepperoni" > ingredients
~/pizza> echo "black olives" >> ingredients
~/pizza> echo "hand-tossed" > crust
~/pizza> hg add ingredients crust
~/pizza> hg commit -m "Initial pizza recipe."
~/pizza> hg qinit # Initialize unversioned patch queue repository.
~/pizza> hg qnew more-ingredients.patch # Create new working patch.
~/pizza> echo "mushrooms" >> ingredients
~/pizza> hg qrefresh # Checkpoint before creating a new patch.
~/pizza> hg qnew specify-sauce.patch
~/pizza> hg qseries # Look at the patch queue.
more-ingredients.patch
specify-sauce.patch
~/pizza> hg qapplied # See which patches are applied.
more-ingredients.patch
specify-sauce.patch
~/pizza> echo "tomato" > sauce
~/pizza> hg add sauce
~/pizza> hg qrefresh
~/pizza> hg qpop # Pop patch, in order to work on more-ingredients.patch again.
Now at: more-ingredients.patch
~/pizza> hg qapplied
more-ingredients.patch
~/pizza> hg qunapplied
specify-sauce.patch
~/pizza> echo "green peppers" >> ingredients
~/pizza> hg diff
diff -r 4f3f2d833e6f ingredients
--- a/ingredients Thu Apr 03 15:38:01 2008 -0700
+++ b/ingredients Thu Apr 03 15:40:55 2008 -0700
@@ -1,3 +1,4 @@ pepperoni
pepperoni
black olives
mushrooms
+green peppers
~/pizza> hg qrefresh
~/pizza> hg qpush # Go back to specify-sauce.patch.
applying specify-sauce.patch
Now at: specify-sauce.patch
~/pizza> echo "marinara" > sauce
~/pizza> hg qdiff
diff -r aadd3ecd3c8e sauce
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/sauce Thu Apr 03 15:43:00 2008 -0700
@@ -0,0 +1,1 @@
+marinara
~/pizza> hg qrefresh
~/pizza> hg qpop
Now at: more-ingredients.patch
~/pizza> hg qrefresh -e # Edit commit message.
~/pizza> hg qpush
applying specify-sauce.patch
Now at: specify-sauce.patch
~/pizza> hg qrefresh -e
~/pizza> hg log
changeset: 2:6714598e1ccc
tag: qtip
tag: tip
tag: specify-sauce.patch
user: Jason Evans
date: Thu Apr 03 15:44:32 2008 -0700
summary: Specify which sauce to use.

changeset: 1:cc9c1fdf1038
tag: qbase
tag: more-ingredients.patch
user: Jason Evans
date: Thu Apr 03 15:44:01 2008 -0700
summary: Specify more ingredients.

changeset: 0:d3ee82132d36
tag: qparent
user: Jason Evans
date: Thu Apr 03 15:34:29 2008 -0700
summary: Initial pizza recipe.

~/pizza> hg qseries
more-ingredients.patch
specify-sauce.patch
~/pizza> hg qdelete -r more-ingredients.patch # Commit patch.
~/pizza> hg qdelete -r specify-sauce.patch
~/pizza> hg log
changeset: 2:6714598e1ccc
tag: tip
user: Jason Evans
date: Thu Apr 03 15:44:32 2008 -0700
summary: Specify which sauce to use.

changeset: 1:cc9c1fdf1038
user: Jason Evans
date: Thu Apr 03 15:44:01 2008 -0700
summary: Specify more ingredients.

changeset: 0:d3ee82132d36
user: Jason Evans
date: Thu Apr 03 15:34:29 2008 -0700
summary: Initial pizza recipe.

~/pizza> hg qnew modify-crust.patch
~/pizza> echo "deep dish" > crust
~/pizza> hg qrefresh
~/pizza> hg qnew experiment.patch
~/pizza> echo "pesto" > sauce
~/pizza> hg qrefresh
~/pizza> hg log
changeset: 4:173178d3d17d
tag: qtip
tag: tip
tag: experiment.patch
user: Jason Evans
date: Thu Apr 03 16:16:10 2008 -0700
summary: [mq]: experiment.patch

changeset: 3:4961f95336c5
tag: modify-crust.patch
tag: qbase
user: Jason Evans
date: Thu Apr 03 16:14:55 2008 -0700
summary: [mq]: modify-crust.patch

changeset: 2:6714598e1ccc
tag: qparent
user: Jason Evans
date: Thu Apr 03 15:44:32 2008 -0700
summary: Specify which sauce to use.

changeset: 1:cc9c1fdf1038
user: Jason Evans
date: Thu Apr 03 15:44:01 2008 -0700
summary: Specify more ingredients.

changeset: 0:d3ee82132d36
user: Jason Evans
date: Thu Apr 03 15:34:29 2008 -0700
summary: Initial pizza recipe.

~/pizza> hg qpop
Now at: modify-crust.patch
~/pizza> hg qdelete experiment.patch # Discard horrible experiment.
~/pizza> hg log
changeset: 3:4961f95336c5
tag: qtip
tag: tip
tag: modify-crust.patch
tag: qbase
user: Jason Evans
date: Thu Apr 03 16:14:55 2008 -0700
summary: [mq]: modify-crust.patch

changeset: 2:6714598e1ccc
tag: qparent
user: Jason Evans
date: Thu Apr 03 15:44:32 2008 -0700
summary: Specify which sauce to use.

changeset: 1:cc9c1fdf1038
user: Jason Evans
date: Thu Apr 03 15:44:01 2008 -0700
summary: Specify more ingredients.

changeset: 0:d3ee82132d36
user: Jason Evans
date: Thu Apr 03 15:34:29 2008 -0700
summary: Initial pizza recipe.

~/pizza> cat sauce
marinara
The trickiest parts of the above are committing/deleting with the qdelete command, and editing the commit message with qrefresh. I omitted the many ways of messing up the order of operations, so tread lightly and experiment with a toy repository before you use this mode of operation for real.