19 June 2007

Unicode strings for Lyken

Lyken, a programming language I am currently developing, uses Unicode for all strings. Lyken is just one of many languages that has to overcome a set of design challenges associated with Unicode, though at least Lyken itself has no legacy support requirements. However, since Lyken's runtime library is written in C, I still have to devise a way to provide pure Unicode string support in Lyken, without making runtime library development overly cumbersome.

A couple of years ago I decided to use a simplistic internal representation for strings in Lyken. The idea was to maintain an ASCII representation of each string that was purely ASCII, but to also maintain a UCS-4 representation of every string (lazily created for pure ASCII strings). This had a critical problem though: C library interfaces use (char *) strings, thus making it impossible to use non-ASCII strings for many purposes. This problem made it clear that I needed to somehow support UTF-8 in Lyken's runtime library.

One possible approach would be to internally store each string both as UTF-8 and UCS-4, but that is a tremendous waste of memory both for ASCII and non-ASCII strings. Instead, I have decided to just store strings as UTF-8, but that has performance issues for indexed access.

In order to mitigate the indexed access performance issue for UTF-8, I store a lazily initialized table that records the location of every nth character (n=32 for now). Immutable strings make lazy table initialization safe for multi-threaded programs, with no need for synchronization. The table is only needed for non-ASCII strings and is known to be present just past the end of the string itself iff the string's byte/character lengths differ.

I have searched for information on better approaches to solving the indexed access problem for UTF-8 strings, but have found nothing. If you know of anything better, please let me know.