jump to navigation

Building the database without glare 5 May 2009

Posted by ojmason in Apple, erlang, iphone, objective-c, programming.
1 comment so far

DAY 7 sounds like a lot, but again I’m only working a few hours in the evening after the kids have gone to bed. I have the feeling that there is a lot of ‘boilerplate’ code to write in Objective-C, or Cocoa at least. But then, that might be the problem with GUI-related programming. Erlang doesn’t nearly need as many lines to accomplish something, anything really! But then, the kind of Erlang programs I’ve been working on are basic R&D text-only affairs, not MVC-style user interface programming.

I have today learned how to turn off the iPhone program icon glare (add UIPrerenderIcon=true to info.plist in the bundle), which I think looks better on my icon which had a horizontal line just about where the glare-line was. I also created a SQLite database from a text file. Next I will need to integrate the DB with the table view, for the first part of actual functionality. I keep switching between chapters in the textbook, the one which builds tables and the one which deals with persistent storage. Why did nobody think of doing a database-backed table view? Need to check Apple’s sample code, they’ve got a book-storage one which might do that.

The depressing bit is that I write a little Noddy-program, and it takes me ages. Mainly getting used to Cocoa, Xcode, and Objective-C, but also going back to no garbage collection, and shifting data between classes I know not well at all (NSString, NSArray, etc). I think my niche will be little utilities, rather than glorious games!

Thinking Erlang, or Creating a Random Matrix without Loops 26 February 2009

Posted by ojmason in erlang, misc, programming.
9 comments

For a project, my Erlang implementation of a fast PFNET algorithm, I needed to find a way to create a random matrix of integers (for path weights), with the diagonal being filled with zeroes.  I was wondering how best to do that, and started off with two loops, an inner one for each row, and an outer one for the full set of rows.  Then the problem was how to tell the inner loop at what position the ‘0′ should be inserted.  I was thinking about passing a row-ID, when it suddenly clicked: lists:seq/2 was what I needed!  This method, which I previously thought was pretty useless, creates a list with a sequence of numbers (the range is specified in the two parameters).  For example,

1> lists:seq(1,4).
[1,2,3,4]
2> lists:seq(634,637).
[634,635,636,637]
3> lists:seq(1000,1003).
[1000,1001,1002,1003]

Now I would simply generate a list with a number for each row, and then send the inner loop off to do its thing, filling the slot given by the sequence number with a zero, and others with a random value.

But now it gets even better.  Using a separate (tail-)recursive function for the inner loop didn’t quite seem right, so I thought a bit more about it and came to the conclusion that this is simply a mapping; mapping an integer to a list (a vector of numbers, one of which (given by the integer) is a zero).  So instead of using a function for filling the row, I call lists:seq again and then map the whole thing.  This is the final version I arrived at, and I’m sure it can still be improved upon using list comprehensions:

random_matrix(Size, MaxVal) ->
  random:seed(),
  lists:map(
    fun(X) ->
      lists:map(
          fun(Y) ->
              case Y of
                 X -> 0;
                 _ -> random:uniform(MaxVal)
                 end
              end,
          lists:seq(1,Size))
      end,
    lists:seq(1,Size)).

This solution seems to be far more idiomatic, and I am beginning to think that I finally no longer think in an imperative way of loops, but more in the Erlang-way of list operations.  Initially this is hard to achieve, but with any luck it will become a lot easier once one is used to it.  Elegance, here I come!

Example run:

4> random_matrix(6,7).
[[0,1,4,6,7,4],
 [3,0,5,7,5,4],
 [5,1,0,2,5,2],
 [4,2,4,0,3,1],
 [4,4,3,3,0,1],
 [5,7,3,2,2,0]]

Note: I have used random:seed/0 above, as I am happy for the function to return identical matrices on subsequent runs with the same parameters. To get truly random results, that would have to be left out. However, for my benchmarking purposes it saved me having to save the matrix to a file and read it in, as I can easily generate a new copy of the same matrix I used before.

Fast PFNETs in Erlang 14 February 2009

Posted by ojmason in algorithm, erlang, programming.
1 comment so far

Introduction

Pathfinder Networks (PFNETs) are networks derived from a graph representing proximity data.  Basically, each node is connected to (almost) every other node by a weighted link, and that makes it hard to see what’s going on.  The Pathfinder algorithm prunes the graph by removing links which are weighted higher than another path between the same nodes.

For example: A links to B with weight 5.  A also links to C with weight 2, and C links to B with weight 2.  Adding up the weights, A to B direct is 5, A to C to B is 4.  Result: we remove the link from A to B, as the route via C is shorter.  There are different ways to calculate the path lengths (using the Minkowski r-metric), but you get the general idea.  The resulting PFNET has fewer links and is easier to analyse.

In Schvaneveldt (1990) an algorithm for computing PFNETs is given, but it is rather complex and computationally intensive.  There is an improved algorithm called Binary Pathfinder, but that is apparently more memory intensive.  Not very promising so far, but then along comes A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time, by Quirin, Cordón, Santamaría, Vargas-Quesada, and Moya-Anegón.  This algorithm is a lot faster (by about 450 times), but has one disadvantage: speed is traded in for flexibility.  The original Pathfinder algorithm has two parameters, r (the value of the Minkowski metric to be used) and q (the maximum length of paths to be considered).  The fast algorithm only has r, and always uses the maximum value for q (which is n-1).  I know too little about the application of PFNETs to say whether this is important at all; for the uses I can envisage it does not seem to matter.

As added bonus, the algorithm in pseudo code in Quirin et al. is very short and straightforward.  They’re using a different shortest-path algorithm to identify, erm, shorter paths.  And then it’s very simple to prune the original network.

A picture tells more than 1K words, so here instead of 2000 words the before and after, graph layout courtesy of the graphviz program:

A network example (from Schvaneveldt 1990)

Fig 1: A network example (from Schvaneveldt 1990)

Here, each node is linked to every other node.  Running the PFNET algorith on it, we get the output shown in the second figure.

A PFNET generated from the previous graph

Fig 2: A PFNET generated from the previous graph

If you compare the output with the actual result from Schvaneveldt’s book (p.6/7), you’ll realise that it is not identical, and the reason for that is that the example there limits the path-length, using the parameters (r = 1, q = 2) rather than (r = 1, q = n-1) as in the example shown here.  As a consequence, the link from N1 to N4 (with a weight of 5) disappears, because of the shorter path (N1-N2-N3-N4, weight 4).  But that path is too long if q is just 2, and so it is kept in Schvaneveldt’s example.

Implementation

It is not possible to implement the pseudo-code given in Quirin et al directly, as they use destructive updates of the link matrix, which we obviously cannot do in Erlang.  But the first, naïve implementation, is still quite short.  The input graph is represented as a matrix (n x n) where each value stands for the link weight, with zero being used to indicate non-existing links.  I have written a function that creates a dot file from a matrix, which is then fed into graphviz for generating images as the ones shown above.

There are basically two steps: creating a matrix of shortest paths from the input matrix, and then generating the PFNET by comparing the two matrices; if a certain cell has the same value in both matrices, then it is a shortest path and is kept, otherwise there is a shorter path and it’s pruned. Here is the main function:

find_path(Matrix) ->
    Shortest = loop1(1,length(Matrix),Matrix),
    generate_pfnet(Matrix, Shortest, []).

Next we have the three loops (hence ‘cubic time’!) of the Floyd-Warshall shortest path algorithm to create the shortest path matrix:

loop1(K,N,Matrix) when K > N ->
    Matrix;
loop1(K,N,Matrix) ->
    NewMatrix = loop2(K,1,Matrix,
        Matrix,[]),
    loop1(K+1,N,NewMatrix).

loop2(_,_,_,[],Acc) ->
    lists:reverse(Acc);
loop2(K,I,D,[Row|Rest],Acc) ->
    NewRow = loop3(K,I,1,D, Row, []),
    loop2(K,I+1,D,Rest,[NewRow|Acc]).

loop3(_,_,_,_, [], Acc) ->
    lists:reverse(Acc);
loop3(K,I,J,D, [X|Rest], Acc) ->
    loop3(K,I,J+1,D, Rest,
    [min(X, get(I,K,D) + get(K,J,D))|Acc]).

The final line implements the Minkowski metric with r = 1; this could be expanded to include other values as well, eg r = 2 for Euclidean, or r = ∞ (which seem to be the most common values in use; the latter means using the maximum weight of any path component along the full path).

And here are two utility methods, one to find the smaller of two values, and one to retrieve an element from a matrix (which is a list of rows).  There is something of a hack to deal with the fact that zero does not mean a very small weight, but refers to a non-existing link:

min(X, Y) when X < Y -> X;
min(_X, Y) -> Y.

get(Row, Col, Matrix) ->
    case lists:nth(Col,
      lists:nth(Row, Matrix)) of
        0 -> 999999999;
        X -> X
    end.

And finally, generating the PFNET by comparing the two matrices (again, awful hack included):

generate_pfnet([],[],Result) ->
    lists:reverse(Result);
generate_pfnet([R1|Rest1], [R2|Rest2], Acc) ->
    Row = generate_pfrow(R1,R2,[]),
    generate_pfnet(Rest1, Rest2, [Row|Acc]).

generate_pfrow([],[],Result) ->
    lists:reverse(Result);
generate_pfrow([C|Rest1], [C|Rest2], Acc) ->
    case C of
        999999999 -> C1 = 0;
        _ -> C1 = C
    end,
    generate_pfrow(Rest1, Rest2, [C1|Acc]);
generate_pfrow([C1|Rest1], [C2|Rest2], Acc) ->
    generate_pfrow(Rest1, Rest2, [0|Acc]).

Discussion

So this is the basic code.  It works, but there is scope for improvement.

  • it only currently generates PFNETs with (r = 1, q = n-1)
  • there is no parallelism, hence it’s not really making use of Erlang’s strengths.
  • the three loops don’t look very elegant, and could probably be replaced by list comprehensions

Because of the matrix being updated, it doesn’t look that easy to parallelise the processing, but it would work at the level of updating the individual rows.  If that can be done in parallel, it would probably provide some speed-up (provided the matrix is not of a trivial size).  So the first step would be to change the matrix processing using lists:map/2, and replacing this then by pmap, as described in this article by Dave Thomas.

Once the code is up to scratch with full parallelism, and tested on larger matrices I will probably put it up on Google Code in case other people are interested in using it.  If you have any suggestions, tell me in the comments!

References

  • Schvaneveldt, R. W. (Ed.) (1990) Pathfinder Associative Networks: Studies in Knowledge Organization. Norwood, NJ: Ablex. The book is out of print. A copy can be downloaded from the Wikipedia page at http://en.wikipedia.org/wiki/Pathfinder_Networks
  • Quirin, A; Cordón, O; Santamaría, J; Vargas-Quesada, B; Moya-Anegón, F (2008) “A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time”, Information Processing and Management, 44, p.1611-1623.

Erlang String Issue 9 September 2008

Posted by ojmason in erlang, programming.
5 comments

I could never really understand people complaining about Erlang’s lack of a string data structure; having them represented as lists seemed fine, especially as there is no problem with representing unicode characters that go beyond 255.  The increased size didn’t bother me: people worried about that when unicode first arrived and it didn’t really turn out as a big issue.  And I do a lot of processing of large texts.

However, working on a basic YAML library for Erlang (see code repository) I ran into a problem where I least exected to be one: generating a YAML representation of a data structure.

Parsing basic YAML (just lists and mappings) was a piece of cake in the end (ignoring the more, erm, arcane bits of the YAML spec), and writing it seemed to be trivial.  However, when it comes to writing out a list the problems start.  I cannot differenciate between a list of numbers and a string.  Hence, I cannot decide which way to represent it in YAML, as a string or a list of numbers.

This is only a human-factor issue: writing it as a list of numbers and then reading it in again obviously gives identical results, so for serialisation purposes it’s fine.  But when editing the file I don’t really fancy looking up the ASCII codes for all the letters I want to change.  A purely ‘presentational’ issue, but a tricky one.

Solutions? I could check any list if all the elements are in the range of readable characters, which I assume is what the Erlang VM does when printing something.  But that’s a hack, really.  Another solution is to go the whole hog and introduce a string type, presumably something like ‘{string, [45,64,...]}’.  Then I can easily make the decision how to write it out.  And parsing would be easy as well.

Now, that looks like a good idea, but it would interfere with the way most other programs use strings.  And the string library wouldn’t work.  So I guess I have to write my own.

My current plan is to allow various representations of strings, so that the data structure will actually be ‘{string, TYPE, DATA}’, where TYPE could be ‘list’ (the current way), ‘binary’ (a binary in UTF-8 format), ‘rope’ (a different approach to strings, see description with Java code).  Other representations could be added at a later stage.  There would be a function estring:to_list(string), which would convert the string into a list for use with the string library, and estring:to_string(list) would create a string from the given list.  The default representation would probably be a binary, as that can in turn be used as a component of a rope easily once something gets added to it.

Another alternative would be to use Starling, but I’m not too keen on it as it’s not pure Erlang, using a C/C++ library under the hood.

How do you deal with strings?

UPDATE: Here’s a discussion about the same problem in the context of JSON: http://www.lshift.net/blog/2007/09/13/how-should-json-strings-be-represented-in-erlang

Replacing a stack with concurrency 23 April 2008

Posted by ojmason in NLP, erlang, programming.
2 comments

For some language processing task I needed a reasonably powerful parser (a program to identify the syntactic structure of a sentence). So I dug out my copy of Winograd (1983) (“Language as a Cognitive Process”) and set about implementing an Augmented Transition Network parser in Erlang.

Now, the first thing you learn about natural language is that it is full of ambiguities, and so there will always be several alternatives available, several possible paths through the network which defines the grammar. The traditional solution is to dump all the alternatives on a stack, and look at them when the current path has been finished with. You can either go depth-first, where you complete the current path before you get the next one off the stack, or breadth-first, where you advance all paths by one step at a time, kind of pseudo-parallel.

Having to deal with a stack is tedious, as you need to keep track of the current configuration: which network are you at, what node, what position in the sentence, etc. But then, it occurred to me, there’s an easier way to do it (at least it’s easier in Erlang!): every time you come to a point where you have multiple alternatives, you spawn a new process and pursue all of them in parallel.

The only overhead you need is a loop which keeps track of all the processes currently running. This loop receives the results of successful paths, and gets notified of unsuccessful ones (where the process terminates without having found a valid structure). No need for a stack, and hopefully very efficient processing on multi-core machines as a free side-effect.

I’m still amazed how easy it was to implement. I wouldn’t have fancied doing that in Java or even C. For my test sentences I had about 8 to 10 processes running in parallel most of the time, but it depends on the size of the grammar and the length of the sentence really. What I liked about this was that it seemed the natural way to do in Erlang, where working with processes is just so easy.

And also, another nail in the coffin for the claim that you can’t use Erlang for handling texts easily!

Safe receive in Erlang 20 April 2008

Posted by ojmason in erlang.
1 comment so far

I’m currently working on a system which involves a cascade of processes, and ran into some problems with communication between them. Each of the ’service’ processes spends most of the time in a receive loop, and then performs some action. Communication is done via a simple rpc function that sends off a message and then receives the reply and returns.

This is fairly standard, it seems, and is explained in Armstrong’s Programming Erlang (p.145). However, a problem arises if the process (B) sending a message to (C) also receives a request from (A):

A => B <= C

The message from (A) interferes with the response from (C). Chaos ensues.

A possible solution makes rpc more robust, but has a little overhead: any replies sent include the process id of the replying process. The rpc function can then just react to messages coming from the process it called.

This could look like:


rpc(Pid, Msg, Arg) ->
Pid ! {self(), Msg, Arg},
receive
{Pid, Reply} ->
Reply
end.

Problem solved!

Erlang on the Tipping Point? 2 April 2008

Posted by ojmason in erlang.
add a comment

Yariv wants to rename the Erlang web framework ErlyWeb to ‘Erlang on Rails’, as he believes Erlang is on the tipping point, and such a name change would swing more programmers towards adopting Erlang. Mmmh, I should have read the comments rather than just the article via RSS: it appears to be an April Fool. But that doesn’t change the original point of this post.

While that might be a good thing to do (April Fool or not), for me the more important move would be to unify/restructure the various libraries and their documentation. I find it still tricky to find details about some functionality because I don’t remember whether it’s in stdlib or kernel, and the web-based documentation browser makes it very hard to navigate. It also seems to be rather random to the average programmer without too much insight into the underlying implementation details.

But otherwise I agree with him, in that I too think Erlang might be going somewhere. I will definitely stick with it, and hope for the best that the libraries will sort themselves out. Though I’m not too hopeful because of backwards compatibility issues…

Tags for Erlang 17 March 2008

Posted by ojmason in erlang.
Tags:
2 comments

The ‘ctags’ program creates a ‘tags’ file, which indexes routines/methods/functions in C source files (and a variety of other languages. The editor vile (my ‘IDE’) can make use of those: you position the cursor on the name of a function, press ^], and the editor jumps to its definition (loading the appropriate source file). The tags file consists of the function name, the file name, and a search pattern.

Running ctags on an Erlang source file does not produce any usable results, unsurprisingly. However, the file format is so simple that a straightforward shell-script will do:

#!/bin/sh
rm -f tags
for i in *.erl
do
cat $i | grep "^[a-z0-9_]*(" |
sed "s/^\([^(]*\)(.*$/\1/" | uniq |
awk  -v file=$i '{printf("%s    %s\\
   /^%s(/\n",$0,file,$0);}' >> tags
done

(The double backslash indicates a line break, as the end of the line would otherwise be outside the box and invisible)

This file processes all *.erl files in a directory, extracts the function names (a sequence of letters, numbers, underscores beginning on the first column and followed by an open round bracket), and creates the correct lines for the tags file.

Works fine so far.

Progress – More Erlang 8 March 2008

Posted by ojmason in erlang.
1 comment so far

Erlang practice is slowly moving along. At a speed of about one per day I am currently porting my Java NLP tools over to Erlang. It still takes quite some time getting used to it, especially the flexibility with types.
Everything is a list or a tupel or an atom, so it is sometimes tricky to keep track of what you’re dealing with. Ad-hoc solution: annotate the type.

My implementation of a trie is simply a list, so when looking something up in a trie I pass a list as the first parameter. But what if it’s not a trie-list? Bad things happen within the trie module, far from the place where the actual error is caused. So the exported functions, insert/3 and retrieve/2 now take a tupel rather than a list. The tupel is {trie List}, where the List is what I used to pass previously. But now it is a lot easier to check that the error is actually that the wrong element is being passed.

Learning Erlang 2 March 2008

Posted by ojmason in erlang.
add a comment

After messing about with Scala for a while I found the syntax rather confusing. It’s just too much. Somehow I then had a look at Erlang, can’t quite remember why. But Erlang promises to be robust and easily concurrent, and it is an industrial strength language, being used for telephony programming at Ericsson. And it looks remarkably like Prolog. Now I’m glad I did do Prolog at uni all those years ago, even though it seemed like a waste of time at the time.

Erlang is not exactly famous for its string processing facilities: a string is just a list of numbers. But it also means you can use list processing stuff for dealing with strings. By way of learning I now port/rewrite my corpus access tool in Erlang. That’s the fourth language already, after C, C++, and Java. The Scala version didn’t get started properly, and I think I’ll abandon that now.

There are some pitfalls to bear in mind when working with Erlang. Mainly it is getting used to the way Erlang works. A very useful feature is that functions can return tupels, eg {ok, Result}, or {error, Reason}. One only needs to have a common way of doing that, so that some functions don’t return a tupel while others don’t. Or if they do, keep in mind what function returns what.

Sending messages between processes: Creating processes is dead easy, and I hope it will really mean speed when I run the system on the university’s multi-core machine. Just be careful to remember what messages each process accepts, and don’t include a catch-all one. Especially intermediate processes which get a request from some process, and send one out to another process can sometimes get muddled up with their messages… And don’t open files in ‘raw’ mode without knowing what it means!

So far it is going well. After about a week I have the basic concordancing functionality ported. Of course most of that time was spent debugging and wondering why it didn’t work the way I wanted it. But the files are a lot shorter, and perhaps Erlang does indeed allow a tenfold increase in productivity. One thing I’m unlikely to port is the indexing stuff: that works alright in Java and is a one-off procedure anyway.

I cannot make any judgments on speed: no profiling yet. But once I can do a few more things with the system, I will test-drive it on the Blue Bear computer. Each corpus will be in its own process, so processing them in parallel looks promising.