jump to navigation

Elegant IR with Erlang 11 October 2010

Posted by Oliver Mason in erlang, programming.
trackback

I am currently working on a project that requires processing documents. As part of that I wanted to use term weighting as used in information retrieval (IR); the individual texts I’m working with are of course of different lengths and contain different sets of words, and I didn’t want that to mess things up as it did when I initially worked with raw token frequencies only.

What I actually wanted is tf-idf, the product of term frequency (tf) and inverted document frequency (idf); essentially you see how often a word/term/token occurs in a text, and multiply that with a measure of how ‘bursty’ it is. The idea being that common words (the, of, and etc) occur in pretty much every document and are thus useless for categorisation of the content. In a way it is a more sophisticated approach to using a stop word list. Sophisticated because you don’t have to create such a list, and it is also not binary include/exclude, but assigns each token a continuous weight depending on its distribution.

Term Frequency

This is simply the relative frequency of occurrence, the number of times a token occurs in the text divided by the text length. As input I assume that the text has already been tokenised and is represented as a list of tokens. The output should be a dictionary (ie a set of key/value tuples) with each token as a key and its tf as the value:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    dict:map(
        fun(_Key, Value) -> Value / Sum end,
        Dict);

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1, 
        dict:update_counter(Token,1,Dict)).

In case another token is available, I simply update its frequency by one, add one to the text size, and re-run the function on the rest of the text. If no more tokens are left, then I map the dictionary (which at this point contains absolute frequencies) to another dictionary by way of dividing each value by the text size; this new dictionary is then returned.

Inverted Document Frequency

For the idf I count how many documents each token occurs in, and divide the total number of documents by that number; so the rarer the token, the larger the resulting value. The token the should just give a result of 1.0; however, to make it a bit more complicated we then take the logarithm (base-10) of it, so that the final value will be greater than or equal to zero.

This time the input is a list of dictionaries, one for each document. The dictionary representing each document is the output of our term_freq/1 function, ie the keys are the tokens, and the values the term frequencies. We don’t really care about the frequencies here, as they all will be greater than zero – a word that does not occur in a text will not be a key in the respective dictionary. As output we will have a single dictionary of all tokens that occur in our document collection, with the values being the idf of each token.

inv_doc_freq(Docs) ->
    inv_doc_freq(Docs, 0, dict:new()).

inv_doc_freq([], DocNum, Dict) ->
    dict:map(
        fun(_Key, Value) -> math:log10(DocNum/Value) end,
        Dict);

inv_doc_freq([Doc|Rest], DocNum, Dict) ->
    inv_doc_freq(Rest, DocNum+1,
        dict:fold(
            fun(Key, _Value, AccIn) -> 
               dict:update_counter(Key,1,AccIn) end,
            Dict,
            Doc)
    ).

Again we iterate over all elements of our input list (ie the documents), and this time we iterate over all tokens of the document using a dict:fold/3 function, by adding 1 to the count for each token of the current document that we have already encountered, or entering it with a frequency of 1 if we haven’t yet. We also increment the document count by 1. This time the dict:map/2 function performs the calculation for the idf value as soon as we have reached the end of our document list.

tf-idf

At this stage we have a dictionary for each document containing the term frequencies, and a dictionary for the whole document collection containing the inverted document frequencies for all the tokens. Combining the two we then get the value for the tf-idf, which is different for each document (so the output is a list of dictionaries, one per document).

To make things easier, the call to compute the idf is integrated into the tf_idf/1 function, so the input is the same as for the inv_doc_freq/1 function, a list of term frequency dictionaries:

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    lists:map(
        fun(TFs) -> dict:map(
            fun(Key,Value) -> Value *
                dict:fetch(Key, Idfs) end,
            TFs) end,
        Docs).

Here we map the list of term frequency dictionaries (Docs) to a list of dictionaries containing the tf-idf values. For this mapping we map each (document) term frequency dictionary to the respective (document) tf-idf dictionary by multiplying each token’s term frequency by its idf value as computed by inv_doc_freq/1.

Summary

Calculating a set of values from texts is very concise with Erlang. In languages like C or Java one would have to code various (nested) loops, but this can easily be accomplished by using the map and fold functions that operate on lists and dictionaries in Erlang. It does need a bit of mental acrobatics, but if you are familiar with Prolog, then the basic structure of an Erlang program is not too difficult to follow. It’s those nested mappings that sometimes can be a little confusing.

The beauty of Erlang, of course, is that each map can be done in parallel; if you have a large list of documents and a processor with several cores then it is not hard to make use of its full power by simply using a parallel map function. To do this in other languages where nested loops are used in place of the map function is not trivial.

So Erlang is not only very concise, but it can also be future-proof by allowing easy concurrency.

Advertisements

Comments»

1. Update/Correction to “Elegant IR with Erlang” « Language and Computation - 14 October 2010

[…] programming. trackback When I tried to actually use my implementation of tf-idf that I described in the previous post, I realised that it’s not quite what I wanted: as it is, I get a different tf-idf value for […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: