jump to navigation

Elegant IR with Erlang 11 October 2010

Posted by Oliver Mason in erlang, programming.
1 comment so far

I am currently working on a project that requires processing documents. As part of that I wanted to use term weighting as used in information retrieval (IR); the individual texts I’m working with are of course of different lengths and contain different sets of words, and I didn’t want that to mess things up as it did when I initially worked with raw token frequencies only.

What I actually wanted is tf-idf, the product of term frequency (tf) and inverted document frequency (idf); essentially you see how often a word/term/token occurs in a text, and multiply that with a measure of how ‘bursty’ it is. The idea being that common words (the, of, and etc) occur in pretty much every document and are thus useless for categorisation of the content. In a way it is a more sophisticated approach to using a stop word list. Sophisticated because you don’t have to create such a list, and it is also not binary include/exclude, but assigns each token a continuous weight depending on its distribution.

Term Frequency

This is simply the relative frequency of occurrence, the number of times a token occurs in the text divided by the text length. As input I assume that the text has already been tokenised and is represented as a list of tokens. The output should be a dictionary (ie a set of key/value tuples) with each token as a key and its tf as the value:

term_freq(Text) ->
    term_freq(Text, 0, dict:new()).

term_freq([], Sum, Dict) ->
    dict:map(
        fun(_Key, Value) -> Value / Sum end,
        Dict);

term_freq([Token|Rest], Sum, Dict) ->
    term_freq(Rest, Sum+1, 
        dict:update_counter(Token,1,Dict)).

In case another token is available, I simply update its frequency by one, add one to the text size, and re-run the function on the rest of the text. If no more tokens are left, then I map the dictionary (which at this point contains absolute frequencies) to another dictionary by way of dividing each value by the text size; this new dictionary is then returned.

Inverted Document Frequency

For the idf I count how many documents each token occurs in, and divide the total number of documents by that number; so the rarer the token, the larger the resulting value. The token the should just give a result of 1.0; however, to make it a bit more complicated we then take the logarithm (base-10) of it, so that the final value will be greater than or equal to zero.

This time the input is a list of dictionaries, one for each document. The dictionary representing each document is the output of our term_freq/1 function, ie the keys are the tokens, and the values the term frequencies. We don’t really care about the frequencies here, as they all will be greater than zero – a word that does not occur in a text will not be a key in the respective dictionary. As output we will have a single dictionary of all tokens that occur in our document collection, with the values being the idf of each token.

inv_doc_freq(Docs) ->
    inv_doc_freq(Docs, 0, dict:new()).

inv_doc_freq([], DocNum, Dict) ->
    dict:map(
        fun(_Key, Value) -> math:log10(DocNum/Value) end,
        Dict);

inv_doc_freq([Doc|Rest], DocNum, Dict) ->
    inv_doc_freq(Rest, DocNum+1,
        dict:fold(
            fun(Key, _Value, AccIn) -> 
               dict:update_counter(Key,1,AccIn) end,
            Dict,
            Doc)
    ).

Again we iterate over all elements of our input list (ie the documents), and this time we iterate over all tokens of the document using a dict:fold/3 function, by adding 1 to the count for each token of the current document that we have already encountered, or entering it with a frequency of 1 if we haven’t yet. We also increment the document count by 1. This time the dict:map/2 function performs the calculation for the idf value as soon as we have reached the end of our document list.

tf-idf

At this stage we have a dictionary for each document containing the term frequencies, and a dictionary for the whole document collection containing the inverted document frequencies for all the tokens. Combining the two we then get the value for the tf-idf, which is different for each document (so the output is a list of dictionaries, one per document).

To make things easier, the call to compute the idf is integrated into the tf_idf/1 function, so the input is the same as for the inv_doc_freq/1 function, a list of term frequency dictionaries:

tf_idf(Docs) ->
    Idfs = inv_doc_freq(Docs),
    lists:map(
        fun(TFs) -> dict:map(
            fun(Key,Value) -> Value *
                dict:fetch(Key, Idfs) end,
            TFs) end,
        Docs).

Here we map the list of term frequency dictionaries (Docs) to a list of dictionaries containing the tf-idf values. For this mapping we map each (document) term frequency dictionary to the respective (document) tf-idf dictionary by multiplying each token’s term frequency by its idf value as computed by inv_doc_freq/1.

Summary

Calculating a set of values from texts is very concise with Erlang. In languages like C or Java one would have to code various (nested) loops, but this can easily be accomplished by using the map and fold functions that operate on lists and dictionaries in Erlang. It does need a bit of mental acrobatics, but if you are familiar with Prolog, then the basic structure of an Erlang program is not too difficult to follow. It’s those nested mappings that sometimes can be a little confusing.

The beauty of Erlang, of course, is that each map can be done in parallel; if you have a large list of documents and a processor with several cores then it is not hard to make use of its full power by simply using a parallel map function. To do this in other languages where nested loops are used in place of the map function is not trivial.

So Erlang is not only very concise, but it can also be future-proof by allowing easy concurrency.

On Planning and Reality 3 June 2010

Posted by Oliver Mason in Apple, iphone, objective-c, programming.
add a comment

When I got my iPhone a little more than a year ago, and started developing programs for it, I had a clear idea what my first program was going to be. However, as always, things turn out quite different from how you think they are going to be…

First, it did take me a bit to get used to Objective-C. Not because it is very different from Java (I used to program in C after all before Java came along), but because all the classes in the Cocoa framework need to be learned. There are subtle differences between those and their Java cousins, and after a bit more experience I believe that the Cocoa classes are actually more powerful and easier to use than their Java counterparts.

Some teething troubles, lack of automatic memory management on the iPhone, and a surfeit of squa brackets meant further delays. Finally I had a program written, but it needed more work on the graphics side, artwork and so on. The stuff that really makes a difference, but is very time-consuming and hard if you’re not used to using graphics software. So the easier way out was to write a different program, which is lighter on the artwork.

This then was a todo-list program, which is also suitable for planning small projects. I wanted a program like that, but didn’t want to fork out the money for Things, which also looked a bit like overkill. On the life hack blog I read an article by Dustin Wax on his moleskine setup, and that seemed like something usable, which I then went about implementing as an iPhone app. With a bit of help from a friend with the icon design, and thanks to freely available sound files and icons, ePlanner was born.

In ePlanner I tried out Core Data, which is really a lot easier than messing about with SQLite directly. It uses both tabs and navigation views, and a lot of tables. I found it rather tedious in that all the classes were almost identical, but only almost, not 100%, and it’s hard to see how that could be changed. The behaviour of those classes is ever so slightly different.

The submission procedure was very easy, thanks to a description I found on the web. My app did get rejected, due to a crash on a 3GS; but I don’t have a 3GS, so I could only test it on a 3G and an iPod touch. Thanks to Instruments I could track down the error, which was of course a memory management issue, but one without consequences on the machines I could test it on. After that was changed, the app went through, and has indeed been bought by people all over the world.

It is really a nice feeling to think that someone in Argentina is using my app, as is someone in Hong Kong, some people in the US, Sweden, etc. I used some free Google advertising at the beginning, but that is really expensive, though when I stopped it, sales began to trail off. But that could also have been an effect of it slipping out of the ‘newly released’ slots.

It is indeed not too hard coming up with a program that does sell. The overall process is not too hard, though there were some frustrating moments battling with the various code signing and certificate issues that Apple requires.

I since have bought an iPad, and am thinking of porting ePlanner to this; however, I’ll give it a while so that I get used to how the iPad works. Knowing your way round the platform makes it a lot easier to develop good software, and I am not yet sure how the UI design for the small iPhone screen can best be translated to the iPad’s larger display. But it will come, and I will describe the process on this blog…!

In the meantime, I will re-visit some of my previous program ideas, as it is really not hard to turn them into something that will end up in the App Store, and it is really satisfying to do so.

Single user vs Multi user: how will the iPad work? 17 March 2010

Posted by Oliver Mason in Apple, iPad, iphone.
1 comment so far

Note: this post is somewhat speculative, as I obviously do not have an iPad (yet!). It is simply an observation that got me thinking about how it is going to be used, and how that possible usage will influence the user experience.

I suspect that in our house the iPad will be a multi-user gadget.

I have an iPhone, my wife has got an iPod touch, and the kids currently use a clapped-out ancient Sony Vaio laptop whose disk is about to fail. Everybody has their own device to do things on. However, this is likely going to change when we acquire an iPad. I envisage this as being a general device that just lies on the sitting room table, to be picked up by whoever wants to use it for reading their email, checking something on Wikipedia, playing a quick game, adding an entry to a calendar, looking up a phone number, and so on.

When I check mail on my iPhone, it is set up to look at my email accounts. Similarly, the calendar is sync’d with my general calendar, and the address book is too. I don’t (and cannot) easily switch between identities (though it is possible to do so with mail and calendar). Some games that I play on my iPhone store their state in case of interruption, and I can resume them later. The same applies to other applications, which usually have one set of data they work with.

This is OK for a phone. Typically you have a phone, and you’re the only one using it, otherwise it’d lose some of its usefulness if you don’t know who you will reach when calling a particular number. But if the iPad is shared between people, how can I avoid reading my wife’s email, or swamping her address book with all my student email addresses (which Google puts into it automatically)? If I play a game, and then have to stop and go back to it later, what if one of my daughters wants to have a go in-between? She might not only end the game I’m playing, but also mess up my high-score records. My todo-list application only has one set of todos, so what if I look at it and suddenly find “Feed my puffle on Club Penguin” on top of my priorities?

On a Unix system you have different user accounts, and you log in and out; this avoids the problem on Mac OS X. But logging in and out is tricky if a system is shared. If I forget to log out when I interrupt my task, and the device is locked, nobody else can use it until I have come back. And how can this be made easy, without constantly having to remember a user name and password?

I somehow suspect that this will be an issue for which there is no satisfactory solution. But if the iPad is to become a general household item like a television or a radio, then there needs to be some non-intrusive way that allows easy sharing. I’m looking forward to finding out what Apple came up with here…!

Go – Went – Gone 30 December 2009

Posted by Oliver Mason in erlang, programming.
add a comment

I did play around with the unhelpfully named ‘go’ programming language, another output of the don’t-be-evil company. Trying to find any web resources for it is pretty much impossible, for one thing because it was too new, and then because of the name. I would have expected something more search-friendly from the number 1 web search engine!

There were a few things I liked about go. It’s smallish, C-like, has garbage collection, built-in support for concurrency, and unicode strings. Hash-tables (‘maps’) as a first-class data type. A nicely-looking set of libraries for all sorts of purposes. Not quite fast, but with lots of scope for performance improvements. No header files. First class support for unit tests.

This was looking attractive as opposed to Erlang, which is older and more mature/stable, but still not very high-performance, has slightly awkward string handling, and exactly three data types (list, tuple, atom). And a Prolog-style syntax with a number of inconveniences about the use of commas, semicolons, and full stops. Editing a clause is never straightforward.

I have since abandoned go again. It also has inconsistencies (the use of ‘new’ for some data types and ‘make’ for others), and worst of all, there was so much talk about wanting to add generics to the language that I fear they will become a feature of it. I don’t like generics: they seem to me to be more trouble than it’s worth. They make code really hard to read, and inflexible. They might make some kinds of bugs impossible, but in my view that is a feeble gain for wrecking a language. As Knuth (I think) said, part of writing programs is aesthetics. I cannot like Java code full of abstract type annotations. Objective-C is so much cleaner in comparison. And so was go, until now.

Another reason is the concurrency support. Go uses pipes for that, which seems awkward. I much prefer Erlang’s mailboxes, which neatly work together with pattern matching to respond to certain messages and ignore others. You do not need to worry about the order in which messages arrive as much, and the whole communication process is a lot easier with only the basic data types.

So I’m going back to Erlang. I will dig out the string library that I started, and get back into thinking recursively. At least I know where I am with it, and it is not suddenly going to change!

Sentence Disambiguation – Modality to the Rescue! 12 November 2009

Posted by Oliver Mason in linguistics.
add a comment

I’m currently reading a new book on iPhone development, iPhone Advanced Projects by Apress. I will probably talk about that book in a later post, but today I will just focus on one sentence I came across on page 212:

I also adore the capability that I have to flag articles from folks I follow on Twitter and save them to Instapaper.

This sentence has (at least) two readings, which are probably only obvious to a linguist (and who else would care?); I highlight the differences by adding commas:

  1. I adore the capability, that I have to flag articles…
  2. I adore the capability that I have, to flag articles…

In the first case you adore the capability. And the capability is that you have to do something (flag articles). Sounds rather odd, doesn’t it? The second case is more clear-cut and easy to understand: you can flag articles, and that’s the capability you have and adore.

So in terms of pattern grammar, you’re either looking at N that or N to-inf with capability. If you consult the Cobuild Dictionary, you’ll find that capability only occurs with the second pattern, the to-infinitive, so that you can rule out the first reading.

Another possibility would be to look at it in terms of modality: here we could argue that capability prospects a modality of ability, but have to expresses obligation; the two don’t go together. Hence the first reading sounds odd, as a capability does not usually force you to do anything, but rather enables you. It could, however, be used to signal sarcasm or irony, as in (the obviously made up) I really like that my new computer gives me the capability to have to save my work every five minutes. This is clearly an odd sentence, suggesting that modality works along similar lines as discourse prosody as described by Louw (1993) [for the full reference follow the previous link].

Here we have discussed two ways of disambiguating a sentence, one based on grammatical properties (or typical environments), and one on a non-syntactic phenomenon (modality). Pattern grammar allows us to identify what the typical usage would be, whereas modality explains to us why the first reading is at odds with the corresponding words. Now all we need is a ‘pattern grammar’ for modality!

Converting AMR to MP3 29 October 2009

Posted by Oliver Mason in misc.
add a comment

For a teaching project, students were required to audio record a group discussion meeting. For this they could either use a digital recorder, or any other recording equipment they had available. One group used a mobile phone, and we managed to transfer the audio file from the phone to my MacBook using Bluetooth.

Initially I was surprised at the small size of the file: 4.5 MB for a recording of about 50 mins. The quality is of course poor, but AMR does a pretty good job of keeping it small. The equivalent MP3 file comes in at 22.5 MB.

But there was a problem: dealing with AMR files. I cannot load it into my version of Audacity to convert it to MP3, but perhaps there’s a plug-in somewhere. I was not able to immediately find one. Then I found that Quicktime can play back AMR files, and save them in other formats, but for that you need to fork out money to get the Pro version. I then tried out iMovie, and managed to import the file as the sound track of a yet to complete movie. Attempts to extract the audio were fruitless; but then I accidentally noticed that the sound track was referred to as ‘aiff’ file. I opened the movie project package, and, hey presto, there was an aiff file. iMovie had converted it during the import.

So now I could import that into Audacity, and export from there to MP3.

What’s a UITextEffectsWindow? And why is it receiving messages? 17 September 2009

Posted by Oliver Mason in Apple, iphone, objective-c, programming.
1 comment so far

I just spent several hours (or at least it felt like several hours!) in frustration, searching a trivial bug. I’ve been testing a quick’n’easy prototype screen with an UIImageView and four UIButtons. The buttons are linked via an action to a view controller. And every time I press a button, my app conks out complaining that -[UITextEffectsWindow buttonPressed:] was an unrecognised selector. I checked the memory address, and it said it was my view controller, just before that exception was thrown.

I was ready to put the blame on some mistakes with Interface Builder, until I came across the solution (indirectly) in a blog: here the problem described related to properties, and the difference between vc = … and self.vc = …. I had another look at my code and quickly found the offending line: I had the view controller as a local variable in the app delegate’s ‘viewDidLoad’ method, and I autoreleased it. In other words, by the time the button was pressed the view controller no longer existed, and hence I got that weird error message.

This was not helped by the fact that ‘UITextEffectsWindow’ is not mentioned in the documentation anywhere, as it seems to be an internal UIKit class, but at least it appears to be consistent.

So, if your button presses send messages to ‘UITextEffectsWindow’, make sure to check that your view controller is still alive!

Application Promiscuity 7 September 2009

Posted by Oliver Mason in Apple, iphone, objective-c.
1 comment so far

I was getting a bit bored with the slow progress on my Esperanto dictionary app, and over the holidays I started work on a few other ideas I had. One was a Maths-drill program for kids, as the ones that are already out there (at least the ones I tried) don’t seem to be ideal (nothing ever is ideal, though!). So I tried writing another app so our kids could play and practice their maths skills.

That app is almost done, just the artwork and sound effects are missing. At the moment it looks pretty rubbish (but looks aren’t that important as long as it works and doesn’t crash!), and the sound effects are nicked from somewhere, so I have to replace them with free ones. Again, the purpose was to try things out.

That app was quite fun, and also easy to do. More on that later…

The next app is one that supports teaching and learning students’ names. This makes use of a navigation controller, which is slow going. I’m picking up loads of experience in Objective-C quirks along the way. For example: avoid using NSNumbers as the keys in a dictionary if you want to save it using writeToFile later on…

Overall it’s very exciting, and the iPhone is a fun platform. It’s really great to see your own stuff amongst all those polished apps, and provides great motivation to do better.

NSData Naughtiness 13 July 2009

Posted by Oliver Mason in Apple, objective-c, programming.
add a comment

Well, this is not exactly NSData’s fault, but I ran into a problem (for the second time; the first time I bypassed it with a short-cut) when reading text data from a file.

Occasionally there was random garbage at the end of a line, which I could not understand. Incidentally, I was reading a number of full files in one go each into an NSData instance, and converted that into an NSString with the correct encoding; this I would then tokenise and add to another file. So the garbage was actually at the end of each file. I then found that I can directly initialise an NSString with the contents of a file, and the problem disappeared.

Now I want to produce concordance lines, and I jump into the middle of the file to read a stretch. First I run into trouble with the encoding: as the data is UTF8-encoded, a random jump can end up in the middle of a multi-byte character. NSString does not like that… but here I can just test for that and skip the initial bytes. The same problem obviously also happens at the end, where the final multi-byte character could be incomplete. Again, truncation seems the easy way out.

But I also then had the issue with the occasional random garbage again! NSData seems to be at fault, and this time I can’t bypass it, as NSString can only read a full file. Quick websearch, and the solution crops up (in an aside) on stackoverflow.com: the data that NSData returns from the -bytes method is not zero-terminated, but NSString’s -stringWithUTF8String expects that, hence the random garbage of the unterminated data. In a way I’m surprised that it actually worked most of the time!

Dictionary Dangers 9 July 2009

Posted by Oliver Mason in Apple, objective-c, programming.
add a comment

I just spotted a bug that cost me several hours yesterday, without having a clue what was going on. I’m currently working on a program which indexes texts, and as I encounter words, I add them to a NSMutableDictionary with their positions in the text. All was working well, until I tested it on a bunch of texts, and it failed with some obscure message about key-value coding. I then added some NSLog messages, and discovered that the token it failed on was the at sign, @.

Today I had another look at the documentation of NSMutableDictionary, and especially the method valueForKey: where the problem seemed to occur. And there it was: “If key does not start with “@”, invokes objectForKey:. If key does start with “@”, strips the “@” and invokes [super valueForKey:] with the rest of the key.” Suddenly it dawned on me: I was using the wrong method. Instead of valueForKey: I should have used objectForKey: – the plain @-sign is discarded, and leaves an empty key (which did of course make the error message less comprehensible, as I couldn’t really tell it was empty).

Quick change in the code, and it works. Problem solved!

Lesson learned: always pay close attention to the available methods, and make sure it is the one you want, even if the one you’re using sounds like it’s the right one. And read the docs carefully!