jump to navigation

A Unified Experience 7 October 2013

Posted by Oliver Mason in Apple.
Tags: , , ,
1 comment so far

Today my friend Zettt posted the following on Twitter: So many people don’t know the difference between their Mac and iOS device. “Do I have to buy app X again for iOS?”. I flippantly replied about my projection for Apple’s next step, which I thought I’d write up in a bit more detail:

Apple is all about the user experience. Things just work out of the box, and the user doesn’t have to think much. You buy an app, and via the cloud you can download updates and copies to your other iOS devices. But there is one jarring line: the distinction between iOS and Mac OS X. If I have an Apple computer and a tablet, why are they so different? There are a number of apps that co-exist on both systems, but I still have to buy them separately, and also link them separately. While my iPhone and my iPad are almost the same in user experience, my MacBook Air stands out like a sore thumb.

My projection is that the two platforms will converge sooner or later. Why confuse the everyday user with having two separate systems? Two distinct App Stores? Either there will be a way to run iOS apps on Mac OS X (a bit like the simulator, only for real), or, more likely, iOS and OS X will merge at some point.

A fully universal app will then not only run on both iPad and iPhone, but also on any OS X device. You buy it once, and it runs on all your machines within the Apple ecosystem. That almost sounds like the original Java promise, where ‘everywhere’ means ‘any Apple device’.

But before it comes to that, my guess is that developers will be encouraged to have OS X versions of their iOS apps. And presumably not charge for them separately. Because, Apple is about hardware, and software is just a cheap commodity. If you want to be featured (and rake in the cash in the AppStore) you better make sure you have an OS X version as well. You know it makes sense: users can then buy any Apple hardware and run your software. Don’t worry about switching from the iPad to a MacBook, you can take your software with you.

This is of course pure speculation. But I somewhat think it would make sense to combine iOS and OS X in the long run. With the introduction of different screen sizes and AutoLayout developers are already weaned off hardcoding screen dimensions. And Apple has got a ton of experience in transitioning CPUs, as evidenced by the switch to Intel a while back. So why not switch from Intel to A7? Or just have iOS apps running on an A7 interpreter. It would open up a new range of Apple computers to casual users whose barrier to entry is dramatically lowered: they have all the software they need already, and no longer need to think what OS they’re running.

Party Politics and the Emperor’s Plain Clothes 24 August 2013

Posted by Oliver Mason in Uncategorized.
Tags: , , ,
add a comment

Image

On Twitter I came across a link to the (German) CDU Wahlprogramm in “Leichter Sprache”. This is the conservative party’s election manifesto in a German version of basic English, to make it more easily comprehensible to people who have difficulty understanding standard German.

At first I thought it was a parody, as it sounded truly ridiculous and patronising. As if you had to explain to your 4-year-old what the world was like. Then I found out about the “Leichte Sprache” (literally: “easy/plain language“), which I had not heard of before. And in principle it is a noble thing to do, to make your manifesto more accessible. I’m sure George Orwell would approve!

But the real reason why it does seem strange is rooted in the nature of political language: normally, language is used to obfuscate and manipulate, especially in complex areas such as politics. Fancy words are used to disguise true intentions and get people to vote for you. But take away the fancy words and complex phrases, and suddenly all is out there in the open, plain to see for everybody. And it turns out to be pretty meaningless. “Everybody should have a good job. And earn enough money. That needs to be written down.” But no mention about minimum wage or anything concrete to turn those vacuous phrases into reality.

Some other people on twitter reacted in the same way as I did; often I guess out of the same ignorance of the “easy language” concept. That’s a shame, but partly rooted in the patronising way the sentences come across. But it also shows that the (undesired) outcome of this publication is that people tend to not take the content seriously. Because there is not much of it.

Critical Discourse Analysis would lose a big part of its subject area if politics would switch wholesale to “easy language”. No more subconscious manipulation between the lines, no more obfuscation about who does what to whom. So after some reflection I applaud the CDU for making their manifesto available in this way, as it unmasks them as the patronising right-wingers they are, with their overly simplistic world view about pretty much any subject area in current politics. No more hiding behind fancy words that foreigners are not welcome. No more vague claims about surveillance cameras stopping crime. It’s all there, in plain language. And other parties seem to have done the same.

Now the only thing that needs to go is the disclaimer at the beginning: that this plain version is not the real manifesto, and only the real one does count. I would prefer it to be the other way round.

More than eye candy 19 February 2013

Posted by Oliver Mason in Apple, iphone, objective-c.
add a comment

For an undergraduate module in Digital Humanities I am currently coding an iOS app designed by our students (see @iBrumApp for the related twitter feed). This is a tourist attraction app about Birmingham. The students collect data on a number of interesting locations (images, descriptions, coordinates, …) and encode them in XML ready to put into the app.

In the app so far there is a map screen, which shows the locations of the attractions. This can be swapped for a list view, which shows the attractions as a table in text form, possibly with a short description and a category label. A segmented control allows switching between the two views.

Under the hood there are two views on top of each other, with one of them hidden. Pressing the segmented control unhides the hidden one and hides the previously visible one; that works fine. There is a simple method in the view controller which swaps the two views:

-(void)switchViewsIn:(UIView*)toShow 
                 out:(UIView*)toHide {

    [toHide setHidden:YES];
    [toShow setHidden:NO];
}

However, it feels a bit sudden and ‘in your face’, the way the views are changing.

So, it might be better to soften the transition somewhat, for example with a cross-fade animation. Basically the visible view would become increasingly transparent, while the hidden view becomes increasingly opaque, until the old one is completely invisible. This is very easy to do with CoreAnimation:

-(void)switchViewsIn:(UIView*)toShow 
                 out:(UIView*)toHide {

    [toHide setAlpha:1.0];
    [toShow setAlpha:0.0];

    [UIView beginAnimations:nil context:NULL];
    [UIView setAnimationDuration:0.75];
    [UIView setAnimationCurve:
            UIViewAnimationCurveEaseInOut];
    [toHide setAlpha:0.0];
    [toShow setAlpha:1.0];
    [UIView commitAnimations];
}

[Note: there is probably a more elegant way using the new block syntax of Objective-C, but this works just fine].

This fading animation has indeed the desired effect, making the transition much smoother and less sudden; I feel this is an improvement in the feel of the app. It’s only a small thing, but if there is one thing you pick up as a developer in the Apple ‘eco-system’ it’s that small things matter. Attention to detail is important.

One thing I have not yet explored (as I’m only testing it with a small sample data set of two locations) is the performance: I have the suspicion that having two superimposed views with transparency might slow down the whole thing, as the iPhone tries to render the other view despite it being transparent. But in that case I can just add a line that sets the ‘hidden’ property to disable the view completely should that prove to be an issue.

Every Test is Valuable 17 July 2012

Posted by Oliver Mason in Uncategorized.
1 comment so far

After reading Graham Lee’s Test-Driven iOS Development; (disclaimer: affiliate link) I have (again) adopted a test-driven approach to software developing. Whenever I create a new class I write a bunch of tests exercising the class’ properties. One might question the value of this, because there is not really any reason why those should not work. However, having such a test in place just uncovered an as-yet unnoticed bug I introduced in a project.

Originally the class property in question was going to be set in the `init` method, so I tested for the presence of the property after creating an instance of the relevant class. Easy pass. Weeks later I did something (and forgot to run the test suite). Today I did something else, and this time I did run them. Hey presto, failed test. And completely unexpected, because it was the property exercising one. How on Earth did that happen?

Upon closer inspection I tracked it down to a refactoring, where I extracted some code from `init` as there were now multiple ways an object could be initialised. The failing code was part of that, and I realised that it was called from the new `initFromFile:` method, but *no longer from the default initialiser*. Easy mistake to make. And had I run my test-suite more consistently, my application would not have been with a potential bug in the mean-time.

What did I take home from this?

  • Even Mickey-Mouse-tests are valuable. No test is too small to be useful (provided it’s actually testing something valid).
  • The test-suite should really be run after every change. I’ll have to check if I can get Xcode to run them automatically after each compilation…

Using Neotoma to parse PEG in Erlang 25 February 2011

Posted by Oliver Mason in erlang, programming.
4 comments

For a project I need some easy and simple way to read structured data from a text file. Initially I considered JSON, and found a JSON parser for Erlang, but then decided that this was just overkill for what I needed. Ideally there would be a better match between the data structures I needed (lists, atoms, strings) and the file format.

I then decided to use Lisp-like S-expressions; at least a simplified version thereof. The data I read from the file is basically a list which can contain other lists, strings (which technically are also just lists), and atoms. A while ago I wrote a simple Erlang module to process something similar, but that had made certain assumptions that didn’t hold anymore, and I felt something more maintainable was required. And what better way to do that than by using a formal grammar to describe the file format and a tool to generate a parser from that?

A simple and straight forward grammar formalism is PEG, Parsing Expression Grammar, and there is already an Erlang parser available for it, Neotoma by Sean Cribbs. Installation was easy, and so was writing a grammar:

list <- open elem* close;
elem <- list / atom / sstring / dstring;
atom <- [a-z0-9_]+ space*;
dstring <- '"' [^"]* '"' space*;
sstring <- "'" [^']* "'" space*;
open <- '(' space* ;
close <- ')' space* ;
space <- ' ' / '\t' / eol;
eol <- '\r\n' / '\n' / '\r';

A list is something (or nothing) enclosed in quotes (with optional spaces). An element is a choice of things, atoms are lower case letters and digits (at least one), and with strings I allow both double and single quotes. This grammar is saved in a file “terms.peg”:
Eshell V5.7.3 (abort with ^G)
1> neotoma:file("terms.peg").
ok
2> c(terms).
{ok,terms}

and you’re ready to go. I created four short one-line test files, with the following content:

  1. (atom)
  2. ( “string” )
  3. (foo bar)
  4. (())

This is the output:
3> terms:file("test1").
[["(",[]],[["atom",[]]],[")",["\n"]]]
4> terms:file("test2").
[["(",[" "]],[["\"","string","\"",[" "]]],[")",["\n"]]]
5> terms:file("test3").
[["(",[]],[["foo",[" "]],["bar",[]]],[")",["\n"]]]
6> terms:file("test4").
[["(",[]],[[["(",[]],[],[")",[]]]],[")",["\n"]]]

Not all that helpful, as there is a lot of noise in there, such as the spaces in “test2″, and all the line-breaks. So I need to go back to the AST and extract just those bits from the parse tree that I actually want. In Neotoma you can do this by adding bits of Erlang code to the grammar definition, like so:
list <- open elem* close
`[Open, Elem, Close] = Node, Elem`
;
atom <- [a-z0-9_]+ space*
`[Atom, Space] = Node, list_to_atom(Atom)`
;
dstring <- '"' [^"]* '"' space*
`[Quote, Str, Quote, Space] = Node, Str`
;
sstring <- "'" [^']* "'" space*
`[Quote, Str, Quote, Space] = Node, Str`
;

(All other lines are unchanged as in the grammar listed above)

What I do here is to split the Node into its component parts, and then discard the bits I don’t want. In the ‘list’ rule I am only interested in the elements, but not in the enclosing brackets, so I just return ‘Elem’. For the ‘atom’ I ignore the spaces and convert the matched character sequence into an atom. Now the output looks like this:
7> neotoma:file("terms.peg").
ok
8> c(terms).
{ok,terms}
9> terms:file("test1").
[atom]
10> terms:file("test2").
["string"]
11> terms:file("test3").
[foo,bar]
12> terms:file("test4").
[[]]

Much better, and just what I wanted. The ‘terms.elr’ file that neotoma generated is 7kb in size, just over 220 lines, and just under 8kb compiled.

The only issue is speed and memory consumption: on my 8GB MacBook Pro a file of less than 40k runs out of memory and crashes after 30+ seconds. If I take a part off at the end to make it 35k, the parser succeeds, but needs 35 seconds (hand-timed). So I think I will have to revisit my hand-made parser again after all… :(

UPDATE:
I had an email exchange about this with Sean, who informs me that this is a limitation of the memoisation, which creates multiple duplicates as (unoptimised) lists. So, not a fault of neotoma, but of the algorithm in general. There are ways around this, but available time to implement is as always a limiting factor!

The longest Chip in the World 18 December 2010

Posted by Oliver Mason in misc.
add a comment

In his Metamagical Themas: Questing for the Essence of Mind and Pattern, Douglas Hofstadter talks about numerical literacy, the ability to understand large numbers. This is especially important when state budgets are through around which deal with billions of pounds or euros. At some point you just lose all feeling for quantities, as they are all so unimaginably large and abstract. He then goes on to pose some rough calculation questions, such as “How many cigarettes are smoked in the US every day?” – you start with some estimates, and then work out an answer, which might be near the order of magnitude of the right answer (which we of course don’t know). Quite an interesting and useful mental exercise.

Yesterday we had Fish & Chips for dinner. The girls were comparing the sizes of their chips, and then we came on to the topic of the longest chip in the world. Obviously, with ‘proper’ chips this is limited by the size of the source potato. However, industrially produced chips are made of mashed potato formed into chip-shapes, so there is not really any fixed limit on the length. So, thinking of Hofstadter, I asked them how many potatoes we would need to make a chip that spans around the whole world.

Rough assumptions: one potato contains enough matter to produce 10cm worth of chip (the thickness is not specified). So, how many potatoes do we need for one metre of chip? This is also useful to practice basic primary-school-level maths… – 10. How many for a kilometre? 10,000. How many kilometres do we need to span the world? Roughly 40,000 km. So how many potatoes do we need? 400 million.

The next question is whether there are enough potatoes in the world to do this. Assuming a potato weighs 100g, how much do our 400 million potatoes weigh? 40 million kilogrammes, or 40,000 (metric) tons. What is the world’s potato production? According to Wikipedia, this is 315 million metric tons, so plenty enough. Now, if we were to turn the annual potato crop into one long chip, how many times would it go round the Earth? 7,875 times.

So, with a bit of basic maths (and Wikipedia for the data) you can make maths exciting for kids, practice how to multiply and divide, teach problem-solving, and have fun at the same time. And they also get a feeling for numbers: 400 million – that’s how many potatoes you need for a chip to span the Earth.

On the trouble with “Global Warming” 6 December 2010

Posted by Oliver Mason in linguistics.
add a comment

'Everybody is talking about the weather. Not us.' - German Railways poster from 1966 (Source: Wikipedia)Global warming is a real danger to life on the planet. As I write this, another extremely cold winter approaches, with snow and ice (and -9 degrees) already starting in late November. Global warming? WTF!?

The term “global warming” is obviously problematic, for several reasons, two of which I will discuss here: firstly, the climate is a complex system, and secondly, the climate is not the weather. Both reasons have links to linguistics, which is my justification to talk about them on this blog.

Climate is a complex system

Chaos theory was discovered by meteorologists, running climate models on computers. Small changes in the starting conditions result in largely different outcomes. That is why nobody can really predict reliably what is going to happen, as there is no clearly visible link from A to B, the conditions today to the conditions at whatever time in the future. Reducing this to a simple statement such as “temperatures will increase globally” is dangerous, as climate is not that simple itself, and you then get people who demolish your argument on the grounds of inaccuracy.

Language can be seen as a complex system as well; it is influenced by so many factors that it is not possible to make any predictions about how language will change. Any statements such as those on the bad influence of text messaging on the English language are clearly not appropriate; broadly general statements of this kind miss the point about the varieties and different language communities that make up the “English” language.

The climate is not the weather

This point is somewhat related: ‘weather’ is what we’ve got now, but ‘climate’ is a broader, more general tendency. So while we might indeed have a cold winter, if we have a correspondingly hotter summer, the average annual temperature might indeed rise, even if it doesn’t feel like that as you shiver your way to work in the morning. And this year is a single event, which in context might be an outlier if it becomes really warm next winter. Weather is somewhat unpredictable and chaotic, otherwise the Met Office would be out of work.

Global climate would also mean that it could become colder in Western Europe, while other regions of the Earth heat up, and the people of Tuvalu will have have a different view about melting ice caps than American farmers in Arizona.

Michael Halliday compares langue and parole (or competence and performance) with climate and weather: while we can observe one (weather/parole/performance), the other can only be perceived indirectly (climate/langue/competence) through studying the former. But essentially they are different views on the same phenomenon, one short-term and one long-term.

A solution?

Coming to the point of this post, I would suggest abandoning the term “Global Warming” in favour of “Climate Change”. Change can go in different directions, and so it is harder for climate-change-deniers to win easy points whenever the weather is colder, and it also emphasises the climate as opposed to the weather. This might seem like a simplistic point similar to the political correctness debate, but lexical choices when representing reality in language are really important.

And thus we have moved from complex systems via Halliday’s view of langue and parole to Critical Discourse Analysis.

Sentiment Analysis & the English Language 12 November 2010

Posted by Oliver Mason in Sentiment Analysis.
1 comment so far

Currently I am working on Sentiment Analysis, so I will probably post a series of smaller posts on issues that I come across. Today I was looking at YouGov’s website and public opinions of Nick Clegg, the social anthropologist currently serving as deputy prime minister. Here is a snapshot of the current opinions at the time, and you can see that most of them are classed as negative:

Most? I would say all… but apparently whatever system YouGov use for sentiment analysis cannot cope with idioms. And shooting yourself in the foot is not exactly a tricky one to identify I should think.

But this raises a more complex issue: there are many ways to express opinions, attitudes, judgements, etc in language. This is a much larger problem than counting the number of ‘positive’ and ‘negative’ words in a text. To begin with, words in isolation rarely have a meaning; opinions are usually subjective; and then there’s irony and sarcasm.

Yes, Clegg did really well when he supported the Tories on tuition fees…

Continuing on this theme, here’s another issue: the assumption that the scope of a sentiment is the whole text. Here’s an opinion (positive) from the same site about student protests:

I completely support the right to protest; however, violence is unreasonable.

This is somewhat positive, supporting the students, but in the second clause there is an additional judgment condemning the violent incidents that happened at the demonstration. This seems to suggest that the proper carrier of attitude should be the clause, rather than the sentence, let alone the text. Not everything is just black and white.

Update/Correction to “Elegant IR with Erlang” 14 October 2010

Posted by Oliver Mason in algorithm, erlang, programming.
add a comment

When I tried to actually use my implementation of tf-idf that I described in the previous post, I realised that it’s not quite what I wanted: as it is, I get a different tf-idf value for each token and each document. So with a collection of 1000 documents I get 1000 dictionaries containing the tokens in each text. However, what I really want is ONE dictionary with all the tokens in, and ONE tf-idf value for each token.

Merging the values is tricky, as it involves relative frequencies, so I needed to make some subtle changes. First, the term_freq/1 method now deals with absolute frequencies, and returns a tuple containing the frequency values and the document size in tokens, so that the relative frequencies can easily be computed if required:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    {Dict, Sum};

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1,
           dict:update_counter(Token,1,Dict)).

No change really, only the terminating clause of term_freq/3 has dropped its dict:map to compute the relative values, and instead returns the tuple with the frequency dictionary and the document size.

This also requires a minor change in the inv_doc_freq/3 function, where we need to deal with the tuple and extract the dictionary from it in the second and final clause:

old

inv_doc_freq([Doc|Rest], DocNum, Dict) ->

new

inv_doc_freq([{Doc, _Sum}|Rest], DocNum, Dict) ->

The biggest change, however, is in the combined tf_idf/1 function, as the algorithm has somewhat changed. Originally the function was a full screen in the editor, but I have extracted two functions to make them easier to follow; the gain in clarity will surely outweigh the minute performance penalty…

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    DocLen = total_doc_size(Docs),
    DocTotalFreqs = total_token_freqs(Docs),
    dict:map(
        fun(Key, Value) ->
            dict:fetch(Key, Idfs) * Value / DocLen
            end,
        DocTotalFreqs).

I need to calculate the overall size (in tokens) of the full document collection, and then add up the token frequency over all documents. These have been factored out into separate functions. Then all is left is a map over all tokens to calculate the tf-idf value from the relative frequency in the document collection multiplied by the idf value as calculated earlier.

Computing the total document size is trivial: we loop over the list of term frequency dictionaries and this time extract the lengths, ignoring the actual dictionaries:

total_doc_size(Docs) ->
    lists:foldl(
        fun({_Doc, DocSum}, Total) -> Total + DocSum end,
        0,
        Docs).

And finally, that leaves computing the total frequencies of all tokens.

total_token_freqs(Docs) ->
    lists:foldl(
        fun({Doc, _Sum}, Current) ->
            dict:fold(
                fun(Key, Value, AccIn) ->
                    dict:update_counter(Key,Value,AccIn)
                    end,
                Current,
                Doc)
            end,
        dict:new(),
        Docs).

Here we process the document list (as there are likely to be fewer documents than tokens) and fold each dictionary, adding the tokens with their respective frequencies to our accumulator dictionary.

Apologies for this correction; but sometimes you only really realise that a particular interpretation of an algorithm is not the right one when you actually need to use it. The curse of developing libraries without proper specification of the requirements…

Elegant IR with Erlang 11 October 2010

Posted by Oliver Mason in erlang, programming.
1 comment so far

I am currently working on a project that requires processing documents. As part of that I wanted to use term weighting as used in information retrieval (IR); the individual texts I’m working with are of course of different lengths and contain different sets of words, and I didn’t want that to mess things up as it did when I initially worked with raw token frequencies only.

What I actually wanted is tf-idf, the product of term frequency (tf) and inverted document frequency (idf); essentially you see how often a word/term/token occurs in a text, and multiply that with a measure of how ‘bursty’ it is. The idea being that common words (the, of, and etc) occur in pretty much every document and are thus useless for categorisation of the content. In a way it is a more sophisticated approach to using a stop word list. Sophisticated because you don’t have to create such a list, and it is also not binary include/exclude, but assigns each token a continuous weight depending on its distribution.

Term Frequency

This is simply the relative frequency of occurrence, the number of times a token occurs in the text divided by the text length. As input I assume that the text has already been tokenised and is represented as a list of tokens. The output should be a dictionary (ie a set of key/value tuples) with each token as a key and its tf as the value:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    dict:map(
        fun(_Key, Value) -> Value / Sum end,
        Dict);

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1, 
        dict:update_counter(Token,1,Dict)).

In case another token is available, I simply update its frequency by one, add one to the text size, and re-run the function on the rest of the text. If no more tokens are left, then I map the dictionary (which at this point contains absolute frequencies) to another dictionary by way of dividing each value by the text size; this new dictionary is then returned.

Inverted Document Frequency

For the idf I count how many documents each token occurs in, and divide the total number of documents by that number; so the rarer the token, the larger the resulting value. The token the should just give a result of 1.0; however, to make it a bit more complicated we then take the logarithm (base-10) of it, so that the final value will be greater than or equal to zero.

This time the input is a list of dictionaries, one for each document. The dictionary representing each document is the output of our term_freq/1 function, ie the keys are the tokens, and the values the term frequencies. We don’t really care about the frequencies here, as they all will be greater than zero – a word that does not occur in a text will not be a key in the respective dictionary. As output we will have a single dictionary of all tokens that occur in our document collection, with the values being the idf of each token.

inv_doc_freq(Docs) ->
    inv_doc_freq(Docs, 0, dict:new()).

inv_doc_freq([], DocNum, Dict) ->
    dict:map(
        fun(_Key, Value) -> math:log10(DocNum/Value) end,
        Dict);

inv_doc_freq([Doc|Rest], DocNum, Dict) ->
    inv_doc_freq(Rest, DocNum+1,
        dict:fold(
            fun(Key, _Value, AccIn) -> 
               dict:update_counter(Key,1,AccIn) end,
            Dict,
            Doc)
    ).

Again we iterate over all elements of our input list (ie the documents), and this time we iterate over all tokens of the document using a dict:fold/3 function, by adding 1 to the count for each token of the current document that we have already encountered, or entering it with a frequency of 1 if we haven’t yet. We also increment the document count by 1. This time the dict:map/2 function performs the calculation for the idf value as soon as we have reached the end of our document list.

tf-idf

At this stage we have a dictionary for each document containing the term frequencies, and a dictionary for the whole document collection containing the inverted document frequencies for all the tokens. Combining the two we then get the value for the tf-idf, which is different for each document (so the output is a list of dictionaries, one per document).

To make things easier, the call to compute the idf is integrated into the tf_idf/1 function, so the input is the same as for the inv_doc_freq/1 function, a list of term frequency dictionaries:

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    lists:map(
        fun(TFs) -> dict:map(
            fun(Key,Value) -> Value *
                dict:fetch(Key, Idfs) end,
            TFs) end,
        Docs).

Here we map the list of term frequency dictionaries (Docs) to a list of dictionaries containing the tf-idf values. For this mapping we map each (document) term frequency dictionary to the respective (document) tf-idf dictionary by multiplying each token’s term frequency by its idf value as computed by inv_doc_freq/1.

Summary

Calculating a set of values from texts is very concise with Erlang. In languages like C or Java one would have to code various (nested) loops, but this can easily be accomplished by using the map and fold functions that operate on lists and dictionaries in Erlang. It does need a bit of mental acrobatics, but if you are familiar with Prolog, then the basic structure of an Erlang program is not too difficult to follow. It’s those nested mappings that sometimes can be a little confusing.

The beauty of Erlang, of course, is that each map can be done in parallel; if you have a large list of documents and a processor with several cores then it is not hard to make use of its full power by simply using a parallel map function. To do this in other languages where nested loops are used in place of the map function is not trivial.

So Erlang is not only very concise, but it can also be future-proof by allowing easy concurrency.

Follow

Get every new post delivered to your Inbox.