25 July 2008

Treaps versus red-black trees

In a previous blog entry, I discussed the difficulties associated with implementing left-leaning red-black trees. A couple of readers commented that treaps might be superior to red-black trees, and as part of some recent jemalloc optimization work, I had occasion to implement treaps in order to measure tree operation overhead.

Most of this article discusses performance, but let me first mention implementation difficulty. It took me about 90 hours to design/implement/test/benchmark left-leaning red-black trees, and less than 10 hours for treaps. Search/insert/delete for red-black trees is O(lg n), versus O(n) for treaps. However, the average case for treaps is (lg n), and the chances of worst case behavior are vanishingly small, thanks to (pseudo-)randomness. Thus, real-world performance differences are only incremental. To be fair, I made red-black trees harder by avoiding recursion. Regardless however, treaps are way easier to implement than red-black trees.

As for benchmarking, I wrote functionally identical benchmark programs for three red-black tree implementations and two treap implementations. The tree implementations are:
  • rb_new: Left-leaning red-black trees.
  • rb_old: Standard red-black trees.
  • RB: Standard red-black trees, as implemented by the *BSD sys/tree.h.
  • trp_hash: Treaps, with priorities computed via pointer hashing.
  • trp_prng: Treaps, with priorities computed via pseudo-random number generation (PRNG).
The benchmark programs iteratively generate permutations of NNODES nodes, for NSETS node sets. For each node set, the programs iteratively build and tear down a tree using the first [1..NNODES] nodes in the set. Each insert/remove operation is accompanied by NSEARCH rounds of searching for every object in the tree, and NITER rounds of iterating over every object in the tree. Don't worry too much about the details; in short the benchmark programs can be configured to predominantly benchmark insertion/deletion, searching, and/or iteration.

The following table summarizes benchmark results as measured on a 2.2 GHz amd64 Ubuntu 8.10 system. The benchmarks were all compiled with "gcc -O3", and the times are user+system time (fastest of three runs):

NNODES,
NSETS,
NSEARCH,
NITER
(focus)
rb_newrb_oldRBtrp_hashtrp_prng
1500,25,0,0 (ins/del)7.603.994.2517.577.58
125,25,10,0 (search)17.7418.6116.6017.8417.77
250,25,0,100 (iteration)18.4521.0619.1920.4520.40

Insertion/deletion is fastest for the red-black tree implementations that do lazy fixup (rb_old and RB). rb_new uses a single-pass algorithm, which requires more work. trp_prng is about the same speed as rb_new, but trp_hash is way slower, due to the repeated hash computations that are required to avoid explicitly storing node priorities.

Search performance is similar for all implementations, which indicates that there are no major disparities in tree balance.

Iteration performance is similar for all implementations, even though they use substantially different algorithms. If tree size were much larger, rb_old and RB would suffer, since they use an O(n lg n) algorithm, whereas rb_new and trp_* use O(n) algorithms. rb_new uses a rather complicated iterative algorithm, but trp_* use recursion and callback functions due to the weak upper bound on tree depth.

Sadly, there is no decisive winner, though any of the five tree implementations is perfectly adequate for the vast majority of applications. The winners according to various criteria are:
  • Space: rb_new and trp_hash.
  • Speed (insertion/deletion): rb_old and RB.
  • Ease of implementation: trp_prng.

24 July 2008

Overzealous use of my red-black tree hammer

When Firefox 3 was released, jemalloc was left disabled for the OS X version, essentially because OS X's malloc implementation did as good a job as jemalloc (in terms of both speed and memory usage), and we didn't think it worth introducing potential regressions due to changed memory layout. Recently I have been working on a memory reserve system that allows Firefox to simplify its error handling with regard to out-of-memory errors. Since the memory reserve is necessarily deeply integrated with the allocator, we need to use jemalloc on all platforms in order to take advantage of this new facility. This prompted me to take a closer look at jemalloc performance on OS X. In summary:
  1. Replacing the system malloc on OS X is Really Hard(TM).
  2. jemalloc was doing way too much red-black tree manipulation.
On ELF-based systems (pretty much all modern Unix and Unix-like systems except OS X), it is possible to cleanly replace the system malloc, either by directly implementing the appropriate functions (malloc, realloc, free, etc.), or by using the LD_PRELOAD environment variable to preload a dynamic library that contains a malloc implementation. For Windows, replacing malloc is much harder; it is necessary to create a custom CRT. On the bright side, at least it is possible to create a custom CRT, since source code is included with MS Visual Studio.

OS X uses the Mach-O format, and in order to completely replace the system malloc, it would be necessary to compile a custom libSystem. As far as I know, that has not been possible outside the confines of Apple since version 10.3 (2+ years ago). Even if it were possible, there would be all sorts of undesirable aspects to shipping a custom libSystem with Firefox; libSystem is a huge library, and binary compatibility issues would be a constant problem. So, the only remaining viable option is to subvert the malloc zone machinery. There is no supported method for changing the default zone, and furthermore, CoreFoundation directly accesses the default zone. Enough about that though; suffice it to say that I did find ways to subvert the malloc zone machinery.

Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X's allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.

After a bit of code review, it became clear that when I fixed a page allocation bottleneck earlier this year, I was overzealous with the application of red-black trees. It is possible to use constant-time algorithms based on linear page map data structures for splitting/coalescing sequential runs of pages, but I had re-coded these operations entirely using red-black trees. So, I enhanced the page map data structures to support splitting/coalescing, and jemalloc became markedly faster. For example, Firefox sped up by as much as ~10% on JavaScript-heavy benchmarks. (As a side benefit, memory usage went down by 1-2%).

In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.