18 April 2010

Red-black trees revisited

A couple of months ago I was looking at CPU usage profiles for one of the server applications at work, trying to figure out how to reduce the 28% of run time attributed to jemalloc. For the application under scrutiny, red-black tree insertion/deletion accounted for 6% of total run time, and I knew from previous benchmarking that jemalloc's left-leaning red-black trees are approximately half as fast as the traditional implementations that utilize parent pointers. Therefore I set out to create a balanced tree implementation that is both fast and compact.

I previously claimed that:
Insertion/deletion is fastest for the red-black tree implementations that do lazy fixup (rb_old and RB). rb_new uses a single-pass algorithm, which requires more work.
Well, in fact it isn't quite that simple. While it is true that the single-pass algorithm requires some extra node rotations, the overhead of traversing down/up the tree is also significant. A recursive implementation I experimented with performed approximately the same number of node rotations as rb_old and RB, but the traversal overhead of full down/up passes dominated.

A major advantage of the rb_old and RB insertion/deletion implementations is that as soon as balance is restored, they terminate. Recursive implementation makes early termination infeasible, so I developed a non-recursive implementation, rb_newer, that uses a stack-allocated array to store the path from the root to the current position. The essence of the algorithm is the same as for recursive implementation, with the advantage that early termination requires no call stack unwinding.

The following table shows benchmark results that directly correspond to those reported in a previous blog post (Ubuntu 9.10 instead of 8.10 though):

NNODES,
NSETS,
NSEARCH,
NITER
(focus)
rb_newerrb_newrb_oldRB
1500,25,0,0
(ins/del)
5.95
7.87
3.93
4.13
125,25,10,0
(search)
17.62
17.14
18.16
16.94
250,25,0,100
(iteration)
18.73
18.88
20.48
19.47

The main result to note is that rb_newer is substantially faster than rb_new at insertion/deletion, though it only closes half the performance gap between rb_new and rb_old. Despite my best efforts, rb_old (a direct translation of the pseudo-code in Introduction to Algorithms, by Cormen et al.) remains markedly faster, thanks to the simplicity afforded by parent pointers.

jemalloc uses rb_newer now, though the total memory savings as compared to rb_old is only about 0.4%. Nonetheless this seems like a worthwhile space savings, given how typical applications use malloc.

11 April 2010

Stand-alone jemalloc 1.0.0

Stand-alone jemalloc 1.0.0 is finally released. There are many interesting features to talk about at some point (thread-local caching, heap profiling, introspection, yet another red-black tree implementation, etc.), but in the meanwhile, enjoy!