jump to navigation

Erlang String Issue 9 September 2008

Posted by Oliver Mason in erlang, programming.
trackback

I could never really understand people complaining about Erlang’s lack of a string data structure; having them represented as lists seemed fine, especially as there is no problem with representing unicode characters that go beyond 255.  The increased size didn’t bother me: people worried about that when unicode first arrived and it didn’t really turn out as a big issue.  And I do a lot of processing of large texts.

However, working on a basic YAML library for Erlang (see code repository) I ran into a problem where I least exected to be one: generating a YAML representation of a data structure.

Parsing basic YAML (just lists and mappings) was a piece of cake in the end (ignoring the more, erm, arcane bits of the YAML spec), and writing it seemed to be trivial.  However, when it comes to writing out a list the problems start.  I cannot differenciate between a list of numbers and a string.  Hence, I cannot decide which way to represent it in YAML, as a string or a list of numbers.

This is only a human-factor issue: writing it as a list of numbers and then reading it in again obviously gives identical results, so for serialisation purposes it’s fine.  But when editing the file I don’t really fancy looking up the ASCII codes for all the letters I want to change.  A purely ‘presentational’ issue, but a tricky one.

Solutions? I could check any list if all the elements are in the range of readable characters, which I assume is what the Erlang VM does when printing something.  But that’s a hack, really.  Another solution is to go the whole hog and introduce a string type, presumably something like ‘{string, [45,64,…]}’.  Then I can easily make the decision how to write it out.  And parsing would be easy as well.

Now, that looks like a good idea, but it would interfere with the way most other programs use strings.  And the string library wouldn’t work.  So I guess I have to write my own.

My current plan is to allow various representations of strings, so that the data structure will actually be ‘{string, TYPE, DATA}’, where TYPE could be ‘list’ (the current way), ‘binary’ (a binary in UTF-8 format), ‘rope’ (a different approach to strings, see description with Java code).  Other representations could be added at a later stage.  There would be a function estring:to_list(string), which would convert the string into a list for use with the string library, and estring:to_string(list) would create a string from the given list.  The default representation would probably be a binary, as that can in turn be used as a component of a rope easily once something gets added to it.

Another alternative would be to use Starling, but I’m not too keen on it as it’s not pure Erlang, using a C/C++ library under the hood.

How do you deal with strings?

UPDATE: Here’s a discussion about the same problem in the context of JSON: http://www.lshift.net/blog/2007/09/13/how-should-json-strings-be-represented-in-erlang

Advertisements

Comments»

1. thomas lackner - 12 September 2008

Why not use a binary for strings and lists for.. lists? That is what some of the JSON libraries do for the same problem, and it seems like a lot of Erlang is going that way – especially with the new faster binary comprehensions. Just normalize for UTF8.

2. ojmason - 12 September 2008

I find binaries awkward to deal with. And, the standard set of string functions don’t work with them, so what I’m basically planning now is a library that can handle strings stored internally as binaries. For compatibility it will also be able to deal with lists, and for potential performance benefits I’ll include some version of ropes.

You get your set of string functions, and it doesn’t matter which internal representation is used.

3. hamish mcdonnell - 23 September 2008

if you’re planning on reimplementing unicode in erlang, you have a lot of pain ahead of you. even if you only do portions of it.

there is a good reason most people use icu (starling included)…

4. Sujan - 12 June 2009

Hey Mason,

I was just wondering if you found any solution for string handling?

Thank you,
Sujan

5. ojmason - 15 June 2009

Not at the moment; I’ve been very busy recently, and Erlang coding has fallen by the wayside a bit. Especially since I now need to catch up on Objective-C.

But with the summer holidays coming up I hope to be able to spend a bit more time on Erlang stuff.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: