19 June 2007

Unicode strings for Lyken

Lyken, a programming language I am currently developing, uses Unicode for all strings. Lyken is just one of many languages that has to overcome a set of design challenges associated with Unicode, though at least Lyken itself has no legacy support requirements. However, since Lyken's runtime library is written in C, I still have to devise a way to provide pure Unicode string support in Lyken, without making runtime library development overly cumbersome.

A couple of years ago I decided to use a simplistic internal representation for strings in Lyken. The idea was to maintain an ASCII representation of each string that was purely ASCII, but to also maintain a UCS-4 representation of every string (lazily created for pure ASCII strings). This had a critical problem though: C library interfaces use (char *) strings, thus making it impossible to use non-ASCII strings for many purposes. This problem made it clear that I needed to somehow support UTF-8 in Lyken's runtime library.

One possible approach would be to internally store each string both as UTF-8 and UCS-4, but that is a tremendous waste of memory both for ASCII and non-ASCII strings. Instead, I have decided to just store strings as UTF-8, but that has performance issues for indexed access.

In order to mitigate the indexed access performance issue for UTF-8, I store a lazily initialized table that records the location of every nth character (n=32 for now). Immutable strings make lazy table initialization safe for multi-threaded programs, with no need for synchronization. The table is only needed for non-ASCII strings and is known to be present just past the end of the string itself iff the string's byte/character lengths differ.

I have searched for information on better approaches to solving the indexed access problem for UTF-8 strings, but have found nothing. If you know of anything better, please let me know.

3 Comments:

At June 21, 2007 at 3:35 AM , Blogger self said...

Will this help any?

 
At June 21, 2007 at 8:09 AM , Blogger Jason said...

I don't see anything in the Plan 9 libraries that does quite what I'm looking for. Plan 9 does serve as a good example for ubiquitous UTF-8 support though. I found this paper to be a really good read.

 
At May 3, 2010 at 1:41 PM , Blogger Nico said...

Sounds to me like you're indexing strings by codepoint, which is, admittedly, useful enough. However, you also need to consider indexing by character, and indexing by glyph. In Unicode "character" != "codepoint" (I know, you're probably well aware). Although if you always use composed codepoints and never use combining codepoints, then character == codepoint -- but note that NFC is closed to new compositions, so never using combining codepoints != NFC.

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home