jump to navigation

Using Neotoma to parse PEG in Erlang 25 February 2011

Posted by Oliver Mason in erlang, programming.

For a project I need some easy and simple way to read structured data from a text file. Initially I considered JSON, and found a JSON parser for Erlang, but then decided that this was just overkill for what I needed. Ideally there would be a better match between the data structures I needed (lists, atoms, strings) and the file format.

I then decided to use Lisp-like S-expressions; at least a simplified version thereof. The data I read from the file is basically a list which can contain other lists, strings (which technically are also just lists), and atoms. A while ago I wrote a simple Erlang module to process something similar, but that had made certain assumptions that didn’t hold anymore, and I felt something more maintainable was required. And what better way to do that than by using a formal grammar to describe the file format and a tool to generate a parser from that?

A simple and straight forward grammar formalism is PEG, Parsing Expression Grammar, and there is already an Erlang parser available for it, Neotoma by Sean Cribbs. Installation was easy, and so was writing a grammar:

list <- open elem* close;
elem <- list / atom / sstring / dstring;
atom <- [a-z0-9_]+ space*;
dstring <- '"' [^"]* '"' space*;
sstring <- "'" [^']* "'" space*;
open <- '(' space* ;
close <- ')' space* ;
space <- ' ' / '\t' / eol;
eol <- '\r\n' / '\n' / '\r';

A list is something (or nothing) enclosed in quotes (with optional spaces). An element is a choice of things, atoms are lower case letters and digits (at least one), and with strings I allow both double and single quotes. This grammar is saved in a file “terms.peg”:
Eshell V5.7.3 (abort with ^G)
1> neotoma:file("terms.peg").
2> c(terms).

and you’re ready to go. I created four short one-line test files, with the following content:

  1. (atom)
  2. ( “string” )
  3. (foo bar)
  4. (())

This is the output:
3> terms:file("test1").
4> terms:file("test2").
[["(",[" "]],[["\"","string","\"",[" "]]],[")",["\n"]]]
5> terms:file("test3").
[["(",[]],[["foo",[" "]],["bar",[]]],[")",["\n"]]]
6> terms:file("test4").

Not all that helpful, as there is a lot of noise in there, such as the spaces in “test2”, and all the line-breaks. So I need to go back to the AST and extract just those bits from the parse tree that I actually want. In Neotoma you can do this by adding bits of Erlang code to the grammar definition, like so:
list <- open elem* close
`[Open, Elem, Close] = Node, Elem`
atom <- [a-z0-9_]+ space*
`[Atom, Space] = Node, list_to_atom(Atom)`
dstring <- '"' [^"]* '"' space*
`[Quote, Str, Quote, Space] = Node, Str`
sstring <- "'" [^']* "'" space*
`[Quote, Str, Quote, Space] = Node, Str`

(All other lines are unchanged as in the grammar listed above)

What I do here is to split the Node into its component parts, and then discard the bits I don’t want. In the ‘list’ rule I am only interested in the elements, but not in the enclosing brackets, so I just return ‘Elem’. For the ‘atom’ I ignore the spaces and convert the matched character sequence into an atom. Now the output looks like this:
7> neotoma:file("terms.peg").
8> c(terms).
9> terms:file("test1").
10> terms:file("test2").
11> terms:file("test3").
12> terms:file("test4").

Much better, and just what I wanted. The ‘terms.elr’ file that neotoma generated is 7kb in size, just over 220 lines, and just under 8kb compiled.

The only issue is speed and memory consumption: on my 8GB MacBook Pro a file of less than 40k runs out of memory and crashes after 30+ seconds. If I take a part off at the end to make it 35k, the parser succeeds, but needs 35 seconds (hand-timed). So I think I will have to revisit my hand-made parser again after all… :(

I had an email exchange about this with Sean, who informs me that this is a limitation of the memoisation, which creates multiple duplicates as (unoptimised) lists. So, not a fault of neotoma, but of the algorithm in general. There are ways around this, but available time to implement is as always a limiting factor!