jump to navigation

Using Neotoma to parse PEG in Erlang 25 February 2011

Posted by Oliver Mason in erlang, programming.

For a project I need some easy and simple way to read structured data from a text file. Initially I considered JSON, and found a JSON parser for Erlang, but then decided that this was just overkill for what I needed. Ideally there would be a better match between the data structures I needed (lists, atoms, strings) and the file format.

I then decided to use Lisp-like S-expressions; at least a simplified version thereof. The data I read from the file is basically a list which can contain other lists, strings (which technically are also just lists), and atoms. A while ago I wrote a simple Erlang module to process something similar, but that had made certain assumptions that didn’t hold anymore, and I felt something more maintainable was required. And what better way to do that than by using a formal grammar to describe the file format and a tool to generate a parser from that?

A simple and straight forward grammar formalism is PEG, Parsing Expression Grammar, and there is already an Erlang parser available for it, Neotoma by Sean Cribbs. Installation was easy, and so was writing a grammar:

list <- open elem* close;
elem <- list / atom / sstring / dstring;
atom <- [a-z0-9_]+ space*;
dstring <- '"' [^"]* '"' space*;
sstring <- "'" [^']* "'" space*;
open <- '(' space* ;
close <- ')' space* ;
space <- ' ' / '\t' / eol;
eol <- '\r\n' / '\n' / '\r';

A list is something (or nothing) enclosed in quotes (with optional spaces). An element is a choice of things, atoms are lower case letters and digits (at least one), and with strings I allow both double and single quotes. This grammar is saved in a file “terms.peg”:
Eshell V5.7.3 (abort with ^G)
1> neotoma:file("terms.peg").
2> c(terms).

and you’re ready to go. I created four short one-line test files, with the following content:

  1. (atom)
  2. ( “string” )
  3. (foo bar)
  4. (())

This is the output:
3> terms:file("test1").
4> terms:file("test2").
[["(",[" "]],[["\"","string","\"",[" "]]],[")",["\n"]]]
5> terms:file("test3").
[["(",[]],[["foo",[" "]],["bar",[]]],[")",["\n"]]]
6> terms:file("test4").

Not all that helpful, as there is a lot of noise in there, such as the spaces in “test2”, and all the line-breaks. So I need to go back to the AST and extract just those bits from the parse tree that I actually want. In Neotoma you can do this by adding bits of Erlang code to the grammar definition, like so:
list <- open elem* close
`[Open, Elem, Close] = Node, Elem`
atom <- [a-z0-9_]+ space*
`[Atom, Space] = Node, list_to_atom(Atom)`
dstring <- '"' [^"]* '"' space*
`[Quote, Str, Quote, Space] = Node, Str`
sstring <- "'" [^']* "'" space*
`[Quote, Str, Quote, Space] = Node, Str`

(All other lines are unchanged as in the grammar listed above)

What I do here is to split the Node into its component parts, and then discard the bits I don’t want. In the ‘list’ rule I am only interested in the elements, but not in the enclosing brackets, so I just return ‘Elem’. For the ‘atom’ I ignore the spaces and convert the matched character sequence into an atom. Now the output looks like this:
7> neotoma:file("terms.peg").
8> c(terms).
9> terms:file("test1").
10> terms:file("test2").
11> terms:file("test3").
12> terms:file("test4").

Much better, and just what I wanted. The ‘terms.elr’ file that neotoma generated is 7kb in size, just over 220 lines, and just under 8kb compiled.

The only issue is speed and memory consumption: on my 8GB MacBook Pro a file of less than 40k runs out of memory and crashes after 30+ seconds. If I take a part off at the end to make it 35k, the parser succeeds, but needs 35 seconds (hand-timed). So I think I will have to revisit my hand-made parser again after all… :(

I had an email exchange about this with Sean, who informs me that this is a limitation of the memoisation, which creates multiple duplicates as (unoptimised) lists. So, not a fault of neotoma, but of the algorithm in general. There are ways around this, but available time to implement is as always a limiting factor!



1. Thomas Arts - 25 February 2011

In order to speed up you need to change grammar to get rid of ambiguities
For example: eol should be any \r or \n combination

2. ojmason - 25 February 2011

Mmh, I tried that, removing the ‘eol’ rule and changing ‘space’ to
space <- ' ' / '\t' / '\n' / '\r';

And I also changed the order of the ‘elem’ rule so that ‘list’ is at the end, but with a 35k file I cannot get that below (Erlang-timer-timed) 35 seconds.

I kind of hoped that there was something I did wrong, as I really want it to be fast, efficient, and solve my problem. So I don’t know whether that is a general issue with packrat-parsers, or whether it can be solved by improving the grammar (which does look quite straight forward to me, though).

3. Jachym - 25 February 2011

While I appreciate your explorative spirit, and given the description of your needs, I’m afraid I have to say you’re wasting your time. Just use file:consult/1 and be done with it.

The major piece of epiphany here is that you can use Erlang terms to describe anything from configration data (see sys.config or *.app files) through a full-fledged programming language (see erl_parse and friends, or LFE) or all kinds of protocol datagrams (Diameter message = list of {Key, Value} pairs) to … almost anything really.

Anyway, nice neotoma intro. :-)

Oliver Mason - 25 February 2011

Yes, I was aware of that, but–call me old-fashioned–I don’t like the native Erlang format. Too many commas, for example. So, yes, you could argue that it is a waste of time, but on the other hand the S-expression-style format is less specific to Erlang, and as I use a mix of several languages this is also an important consideration.

Glad you like the intro, at least, as that was also a good reason for me to explore neotoma. New tools that look useful always need some smallish project to play around with… and I still hope I made a major stupid mistake with the grammar design that somebody will point out!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: