Overzealous use of my red-black tree hammer
When Firefox 3 was released, jemalloc was left disabled for the OS X version, essentially because OS X's malloc implementation did as good a job as jemalloc (in terms of both speed and memory usage), and we didn't think it worth introducing potential regressions due to changed memory layout. Recently I have been working on a memory reserve system that allows Firefox to simplify its error handling with regard to out-of-memory errors. Since the memory reserve is necessarily deeply integrated with the allocator, we need to use jemalloc on all platforms in order to take advantage of this new facility. This prompted me to take a closer look at jemalloc performance on OS X. In summary:
- Replacing the system malloc on OS X is Really Hard(TM).
- jemalloc was doing way too much red-black tree manipulation.
OS X uses the Mach-O format, and in order to completely replace the system malloc, it would be necessary to compile a custom libSystem. As far as I know, that has not been possible outside the confines of Apple since version 10.3 (2+ years ago). Even if it were possible, there would be all sorts of undesirable aspects to shipping a custom libSystem with Firefox; libSystem is a huge library, and binary compatibility issues would be a constant problem. So, the only remaining viable option is to subvert the malloc zone machinery. There is no supported method for changing the default zone, and furthermore, CoreFoundation directly accesses the default zone. Enough about that though; suffice it to say that I did find ways to subvert the malloc zone machinery.
Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X's allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.
In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.