18 April 2010

Red-black trees revisited

A couple of months ago I was looking at CPU usage profiles for one of the server applications at work, trying to figure out how to reduce the 28% of run time attributed to jemalloc. For the application under scrutiny, red-black tree insertion/deletion accounted for 6% of total run time, and I knew from previous benchmarking that jemalloc's left-leaning red-black trees are approximately half as fast as the traditional implementations that utilize parent pointers. Therefore I set out to create a balanced tree implementation that is both fast and compact.

I previously claimed that:
Insertion/deletion is fastest for the red-black tree implementations that do lazy fixup (rb_old and RB). rb_new uses a single-pass algorithm, which requires more work.
Well, in fact it isn't quite that simple. While it is true that the single-pass algorithm requires some extra node rotations, the overhead of traversing down/up the tree is also significant. A recursive implementation I experimented with performed approximately the same number of node rotations as rb_old and RB, but the traversal overhead of full down/up passes dominated.

A major advantage of the rb_old and RB insertion/deletion implementations is that as soon as balance is restored, they terminate. Recursive implementation makes early termination infeasible, so I developed a non-recursive implementation, rb_newer, that uses a stack-allocated array to store the path from the root to the current position. The essence of the algorithm is the same as for recursive implementation, with the advantage that early termination requires no call stack unwinding.

The following table shows benchmark results that directly correspond to those reported in a previous blog post (Ubuntu 9.10 instead of 8.10 though):


The main result to note is that rb_newer is substantially faster than rb_new at insertion/deletion, though it only closes half the performance gap between rb_new and rb_old. Despite my best efforts, rb_old (a direct translation of the pseudo-code in Introduction to Algorithms, by Cormen et al.) remains markedly faster, thanks to the simplicity afforded by parent pointers.

jemalloc uses rb_newer now, though the total memory savings as compared to rb_old is only about 0.4%. Nonetheless this seems like a worthwhile space savings, given how typical applications use malloc.


At May 15, 2010 at 7:30 AM , Blogger ned14 said...

Dear Jason,

I'm the author of nedmalloc (http://www.nedprod.com/programs/portable/nedmalloc/). Firstly, congratulations on getting jemalloc ported over to Windows - nedmalloc may well have lost its crown as the fastest free portable memory allocator, but once v1.10 is out I'll run some benchmarks and we'll find out! (though I should add that Microsoft have very substantially improved the system allocator in Windows 7, and nedmalloc's advantage has been slashed - so much so that in good conscience I could no longer recommend the added inconvenience of its use).

Secondly, many thanks for all your effort invested into red-black tree C implementations - you have saved me, and I am sure many others a lot of work. Right now I am in the process of adding a user mode page allocator to nedmalloc which effectively moves the problem of page mapping into userspace. This lets you solve one of the big performance killers in modern code - repeated large array extension - by spacing out large array allocations in address space and extending them by simply appending pages, thus avoiding memory copies and address relocation (which would require array member copy construction in C++). This has been enabled by the development of two new malloc APIs, one for C and one for C++, which I intend to submit for standardisation. I will of course ask you for your comments beforehand, so expect that sometime in June after my current contract is completed.

Obviously this raises the problem of how best to partition that address space, and this is how I found your code. Iteration performance is probably less important for a virtual address space allocator than a normal allocator, so that left me comparing old traditional R+B sys/tree.h, your rb_old and your RB latest under the assumption that my VA allocator would do 33% searching, 33% inserting and 33% deleting.

Running your tests on a 1.6Ghz Intel Atom 220 (my standard benchmarking CPU, I always try to aim low) on Debian, I obtain the following:

RB Tree (sys/tree.h):
ins/del: 8.91
search: 31.68
iteration: 41.03

LLRB Tree old:
ins/del: 8.20
search: 35.40
iteration: 45.45

LLRB Tree latest:
ins/del: 13.15
search: 34.18
iteration: 42.46

In other words, on Atom, your latest implementation is considerably slower at insertion/deletion, much more so than your benchmarks above.

I then wondered if RB_COMPACT was causing this:

LLRB Tree latest (no RB_COMPACT):
ins/del: 12.58
search: 32.61
iteration: 41.61

Hmm, it would seem not.

Any idea what might be going wrong? There must be something in the new implementation which is causing the Atom to pipeline stall.


At May 22, 2010 at 7:04 PM , Blogger Jason said...

Wow, the performance results are strikingly different on Atom than on the systems I used during development (Opteron and Core Duo). I don't know much about Atom, but if it has a relatively weak branch predictor, that might partially explain the difference.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home