jump to navigation

Solving Combinatoric Problems with List Comprehensions 10 May 2015

Posted by Oliver Mason in algorithm, erlang.
Tags: , ,
add a comment

My daughter had some maths homework the other day:

You have 4 bags, each full of the numbers 1, 3, 5, and 7 respectively. Take 10 of the numbers that when added up make 37. What numbers are they?

So far so good: that sounds easy enough. But a bit of trial and error quickly leads nowhere. Something can’t be right. So, let’s get the computer to work it out.

As I haven’t done much Erlang recently I thought I’d give it a go. And, during a casual glance at Armstrong’s Programming in Erlang I thought I’d finally understood list comprehensions, so I wrote the following program:
-module(comb).
-export([result/0]).
result() ->
[{A+B+C+D+E+F+G+H+I+J,A,B,C,D,E,F,G,H,I,J}||
A <- [1,3,5,7],
B <- [1,3,5,7],
C <- [1,3,5,7],
D <- [1,3,5,7],
E <- [1,3,5,7],
F <- [1,3,5,7],
G <- [1,3,5,7],
H <- [1,3,5,7],
I <- [1,3,5,7],
J <- [1,3,5,7],
A+B+C+D+E+F+G+H+I+J =:= 37].

I declare a module with one function, `result/0`. This finds me ten variables that can take any of the four specified values and add up to 37. Simples!

The list comprehension has ten generators, and one filter; it will return a tuple with the sum and the individual variables’ values.

Erlang R16B01 (erts-5.10.2) [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.2 (abort with ^G)
1> comb:result().
[]
2>

WTF???! An empty list?! So I try changing the 37 to another value, like 36.
3> comb:result().
[{36,1,1,1,1,1,3,7,7,7,7},
{36,1,1,1,1,1,5,5,7,7,7},
{36,1,1,1,1,1,5,7,5,7,7},
{36,1,1,1,1,1,5,7,7,5,7},
{36,1,1,1,1,1,5,7,7,7,5},
[etc, etc].

So it does work! Only, there doesn’t seem to be an answer to the question. And with a bit of logical reasoning it is obvious: when adding two odd numbers, you get an even number. So adding ten odd numbers also yields an even number, but 37 is odd.

What I learnt from this exercise: thinking about the problem beforehand can save you time, as there was no need to write a program at all. But then, I did get to use list comprehensions, and have learnt how powerful they are. And it neatly shows Erlang’s Prolog roots as well.

Update/Correction to “Elegant IR with Erlang” 14 October 2010

Posted by Oliver Mason in algorithm, erlang, programming.
add a comment

When I tried to actually use my implementation of tf-idf that I described in the previous post, I realised that it’s not quite what I wanted: as it is, I get a different tf-idf value for each token and each document. So with a collection of 1000 documents I get 1000 dictionaries containing the tokens in each text. However, what I really want is ONE dictionary with all the tokens in, and ONE tf-idf value for each token.

Merging the values is tricky, as it involves relative frequencies, so I needed to make some subtle changes. First, the term_freq/1 method now deals with absolute frequencies, and returns a tuple containing the frequency values and the document size in tokens, so that the relative frequencies can easily be computed if required:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    {Dict, Sum};

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1,
           dict:update_counter(Token,1,Dict)).

No change really, only the terminating clause of term_freq/3 has dropped its dict:map to compute the relative values, and instead returns the tuple with the frequency dictionary and the document size.

This also requires a minor change in the inv_doc_freq/3 function, where we need to deal with the tuple and extract the dictionary from it in the second and final clause:

old

inv_doc_freq([Doc|Rest], DocNum, Dict) ->

new

inv_doc_freq([{Doc, _Sum}|Rest], DocNum, Dict) ->

The biggest change, however, is in the combined tf_idf/1 function, as the algorithm has somewhat changed. Originally the function was a full screen in the editor, but I have extracted two functions to make them easier to follow; the gain in clarity will surely outweigh the minute performance penalty…

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    DocLen = total_doc_size(Docs),
    DocTotalFreqs = total_token_freqs(Docs),
    dict:map(
        fun(Key, Value) ->
            dict:fetch(Key, Idfs) * Value / DocLen
            end,
        DocTotalFreqs).

I need to calculate the overall size (in tokens) of the full document collection, and then add up the token frequency over all documents. These have been factored out into separate functions. Then all is left is a map over all tokens to calculate the tf-idf value from the relative frequency in the document collection multiplied by the idf value as calculated earlier.

Computing the total document size is trivial: we loop over the list of term frequency dictionaries and this time extract the lengths, ignoring the actual dictionaries:

total_doc_size(Docs) ->
    lists:foldl(
        fun({_Doc, DocSum}, Total) -> Total + DocSum end,
        0,
        Docs).

And finally, that leaves computing the total frequencies of all tokens.

total_token_freqs(Docs) ->
    lists:foldl(
        fun({Doc, _Sum}, Current) ->
            dict:fold(
                fun(Key, Value, AccIn) ->
                    dict:update_counter(Key,Value,AccIn)
                    end,
                Current,
                Doc)
            end,
        dict:new(),
        Docs).

Here we process the document list (as there are likely to be fewer documents than tokens) and fold each dictionary, adding the tokens with their respective frequencies to our accumulator dictionary.

Apologies for this correction; but sometimes you only really realise that a particular interpretation of an algorithm is not the right one when you actually need to use it. The curse of developing libraries without proper specification of the requirements…

Fast PFNETs in Erlang 14 February 2009

Posted by Oliver Mason in algorithm, erlang, programming.
2 comments

Introduction

Pathfinder Networks (PFNETs) are networks derived from a graph representing proximity data.  Basically, each node is connected to (almost) every other node by a weighted link, and that makes it hard to see what’s going on.  The Pathfinder algorithm prunes the graph by removing links which are weighted higher than another path between the same nodes.

For example: A links to B with weight 5.  A also links to C with weight 2, and C links to B with weight 2.  Adding up the weights, A to B direct is 5, A to C to B is 4.  Result: we remove the link from A to B, as the route via C is shorter.  There are different ways to calculate the path lengths (using the Minkowski r-metric), but you get the general idea.  The resulting PFNET has fewer links and is easier to analyse.

In Schvaneveldt (1990) an algorithm for computing PFNETs is given, but it is rather complex and computationally intensive.  There is an improved algorithm called Binary Pathfinder, but that is apparently more memory intensive.  Not very promising so far, but then along comes A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time, by Quirin, Cordón, Santamaría, Vargas-Quesada, and Moya-Anegón.  This algorithm is a lot faster (by about 450 times), but has one disadvantage: speed is traded in for flexibility.  The original Pathfinder algorithm has two parameters, r (the value of the Minkowski metric to be used) and q (the maximum length of paths to be considered).  The fast algorithm only has r, and always uses the maximum value for q (which is n-1).  I know too little about the application of PFNETs to say whether this is important at all; for the uses I can envisage it does not seem to matter.

As added bonus, the algorithm in pseudo code in Quirin et al. is very short and straightforward.  They’re using a different shortest-path algorithm to identify, erm, shorter paths.  And then it’s very simple to prune the original network.

A picture tells more than 1K words, so here instead of 2000 words the before and after, graph layout courtesy of the graphviz program:

A network example (from Schvaneveldt 1990)

Fig 1: A network example (from Schvaneveldt 1990)

Here, each node is linked to every other node.  Running the PFNET algorith on it, we get the output shown in the second figure.

A PFNET generated from the previous graph

Fig 2: A PFNET generated from the previous graph

If you compare the output with the actual result from Schvaneveldt’s book (p.6/7), you’ll realise that it is not identical, and the reason for that is that the example there limits the path-length, using the parameters (r = 1, q = 2) rather than (r = 1, q = n-1) as in the example shown here.  As a consequence, the link from N1 to N4 (with a weight of 5) disappears, because of the shorter path (N1-N2-N3-N4, weight 4).  But that path is too long if q is just 2, and so it is kept in Schvaneveldt’s example.

Implementation

It is not possible to implement the pseudo-code given in Quirin et al directly, as they use destructive updates of the link matrix, which we obviously cannot do in Erlang.  But the first, naïve implementation, is still quite short.  The input graph is represented as a matrix (n x n) where each value stands for the link weight, with zero being used to indicate non-existing links.  I have written a function that creates a dot file from a matrix, which is then fed into graphviz for generating images as the ones shown above.

There are basically two steps: creating a matrix of shortest paths from the input matrix, and then generating the PFNET by comparing the two matrices; if a certain cell has the same value in both matrices, then it is a shortest path and is kept, otherwise there is a shorter path and it’s pruned. Here is the main function:

find_path(Matrix) ->
    Shortest = loop1(1,length(Matrix),Matrix),
    generate_pfnet(Matrix, Shortest, []).

Next we have the three loops (hence ‘cubic time’!) of the Floyd-Warshall shortest path algorithm to create the shortest path matrix:

loop1(K,N,Matrix) when K > N -> 
    Matrix;
loop1(K,N,Matrix) ->
    NewMatrix = loop2(K,1,Matrix,
        Matrix,[]),
    loop1(K+1,N,NewMatrix).

loop2(_,_,_,[],Acc) ->
    lists:reverse(Acc);
loop2(K,I,D,[Row|Rest],Acc) ->
    NewRow = loop3(K,I,1,D, Row, []),
    loop2(K,I+1,D,Rest,[NewRow|Acc]).

loop3(_,_,_,_, [], Acc) ->
    lists:reverse(Acc);
loop3(K,I,J,D, [X|Rest], Acc) ->
    loop3(K,I,J+1,D, Rest,
    [min(X, get(I,K,D) + get(K,J,D))|Acc]).

The final line implements the Minkowski metric with r = 1; this could be expanded to include other values as well, eg r = 2 for Euclidean, or r = ∞ (which seem to be the most common values in use; the latter means using the maximum weight of any path component along the full path).

And here are two utility methods, one to find the smaller of two values, and one to retrieve an element from a matrix (which is a list of rows).  There is something of a hack to deal with the fact that zero does not mean a very small weight, but refers to a non-existing link:

min(X, Y) when X < Y -> X;
min(_X, Y) -> Y.

get(Row, Col, Matrix) ->
    case lists:nth(Col,
      lists:nth(Row, Matrix)) of
        0 -> 999999999;
        X -> X
    end.

And finally, generating the PFNET by comparing the two matrices (again, awful hack included):

generate_pfnet([],[],Result) ->
    lists:reverse(Result);
generate_pfnet([R1|Rest1], [R2|Rest2], Acc) ->
    Row = generate_pfrow(R1,R2,[]),
    generate_pfnet(Rest1, Rest2, [Row|Acc]).

generate_pfrow([],[],Result) ->
    lists:reverse(Result);
generate_pfrow([C|Rest1], [C|Rest2], Acc) ->
    case C of
        999999999 -> C1 = 0;
        _ -> C1 = C
    end,
    generate_pfrow(Rest1, Rest2, [C1|Acc]);
generate_pfrow([C1|Rest1], [C2|Rest2], Acc) ->
    generate_pfrow(Rest1, Rest2, [0|Acc]).

Discussion

So this is the basic code.  It works, but there is scope for improvement.

  • it only currently generates PFNETs with (r = 1, q = n-1)
  • there is no parallelism, hence it’s not really making use of Erlang’s strengths.
  • the three loops don’t look very elegant, and could probably be replaced by list comprehensions

Because of the matrix being updated, it doesn’t look that easy to parallelise the processing, but it would work at the level of updating the individual rows.  If that can be done in parallel, it would probably provide some speed-up (provided the matrix is not of a trivial size).  So the first step would be to change the matrix processing using lists:map/2, and replacing this then by pmap, as described in this article by Dave Thomas.

Once the code is up to scratch with full parallelism, and tested on larger matrices I will probably put it up on Google Code in case other people are interested in using it.  If you have any suggestions, tell me in the comments!

References

  • Schvaneveldt, R. W. (Ed.) (1990) Pathfinder Associative Networks: Studies in Knowledge Organization. Norwood, NJ: Ablex. The book is out of print. A copy can be downloaded from the Wikipedia page at http://en.wikipedia.org/wiki/Pathfinder_Networks
  • Quirin, A; Cordón, O; Santamaría, J; Vargas-Quesada, B; Moya-Anegón, F (2008) “A new variant of the Pathfinder algorithm to generate large visual science maps in cubic time”, Information Processing and Management, 44, p.1611-1623.