jump to navigation

Update/Correction to “Elegant IR with Erlang” 14 October 2010

Posted by Oliver Mason in algorithm, erlang, programming.
trackback

When I tried to actually use my implementation of tf-idf that I described in the previous post, I realised that it’s not quite what I wanted: as it is, I get a different tf-idf value for each token and each document. So with a collection of 1000 documents I get 1000 dictionaries containing the tokens in each text. However, what I really want is ONE dictionary with all the tokens in, and ONE tf-idf value for each token.

Merging the values is tricky, as it involves relative frequencies, so I needed to make some subtle changes. First, the term_freq/1 method now deals with absolute frequencies, and returns a tuple containing the frequency values and the document size in tokens, so that the relative frequencies can easily be computed if required:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    {Dict, Sum};

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1,
           dict:update_counter(Token,1,Dict)).

No change really, only the terminating clause of term_freq/3 has dropped its dict:map to compute the relative values, and instead returns the tuple with the frequency dictionary and the document size.

This also requires a minor change in the inv_doc_freq/3 function, where we need to deal with the tuple and extract the dictionary from it in the second and final clause:

old

inv_doc_freq([Doc|Rest], DocNum, Dict) ->

new

inv_doc_freq([{Doc, _Sum}|Rest], DocNum, Dict) ->

The biggest change, however, is in the combined tf_idf/1 function, as the algorithm has somewhat changed. Originally the function was a full screen in the editor, but I have extracted two functions to make them easier to follow; the gain in clarity will surely outweigh the minute performance penalty…

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    DocLen = total_doc_size(Docs),
    DocTotalFreqs = total_token_freqs(Docs),
    dict:map(
        fun(Key, Value) ->
            dict:fetch(Key, Idfs) * Value / DocLen
            end,
        DocTotalFreqs).

I need to calculate the overall size (in tokens) of the full document collection, and then add up the token frequency over all documents. These have been factored out into separate functions. Then all is left is a map over all tokens to calculate the tf-idf value from the relative frequency in the document collection multiplied by the idf value as calculated earlier.

Computing the total document size is trivial: we loop over the list of term frequency dictionaries and this time extract the lengths, ignoring the actual dictionaries:

total_doc_size(Docs) ->
    lists:foldl(
        fun({_Doc, DocSum}, Total) -> Total + DocSum end,
        0,
        Docs).

And finally, that leaves computing the total frequencies of all tokens.

total_token_freqs(Docs) ->
    lists:foldl(
        fun({Doc, _Sum}, Current) ->
            dict:fold(
                fun(Key, Value, AccIn) ->
                    dict:update_counter(Key,Value,AccIn)
                    end,
                Current,
                Doc)
            end,
        dict:new(),
        Docs).

Here we process the document list (as there are likely to be fewer documents than tokens) and fold each dictionary, adding the tokens with their respective frequencies to our accumulator dictionary.

Apologies for this correction; but sometimes you only really realise that a particular interpretation of an algorithm is not the right one when you actually need to use it. The curse of developing libraries without proper specification of the requirements…

Comments»

No comments yet — be the first.

Leave a comment