Wednesday, November 13, 2013

On MOOCs

I just completed Keith Devlin's course "Introduction into Mathematical Thinking" on Stanford's coursera platform. It's a massive open online course or MOOC. It wasn't the first one I did. But this one was, based on my limited experience, so remarkable that I wanted to write about it here.

The course may not seem overly exciting at first glance and maybe even after looking at the syllabus: logical combinators, quantifiers, elementary number theory and proofs. The primary audience are first-year college students. What makes Devlin's course different from the two excellent courses on more advanced topics I have attended so far (one on machine learning, one on functional programming) is that he really made the most of the format and I learned a lot on the way.

I will try to explain my point by quickly summarising and comparing the different components of the MOOCs I have seen so far.

The Lecture Videos

The MOOCs I have signed up for consist usually of more or less well edited lecture videos accompanied by some form of assignment or homework.

I have seen quite elaborate technical solutions in regards to how the videos are presented. Martin Odersky used a quite fancy overlay technique during his functional programming course, which seemingly allowed him to "write" onto the slides he was using to support his presentation as if it was paper. You could even see his hands.

Keith Devlin's approach was more low tech: He used almost no slides just a piece of paper while the camera was looking from a 90 ° angle onto the table. This may seem like a lot of slow handwriting, but some clever editing makes for an impressively simple and effective presentation format: sequences where he has to write longer bits of text or mathematical formulae are sped up ever so slightly during editing to avoid the effect that the students have to "wait" until he finishes writing.

One big advantage of the MOOC format is that you can adjust the speed of the videos to your level of comprehension. If the teacher is very verbose and overly detailed in his explanations you speed up the video. I found that I can still follow when watching at 1.5 times the normal speed. On the other hand, if you do not understand something you can pause or watch it again until you get it.

The Assignments

Computer science lends itself well to the MOOC format because its "output" can in most cases be graded by a machine (of course not without some considerable effort to set this up). And indeed the assignments I submitted in the programming related courses were graded by running some form of automated testing on my code.

Mathematics, you might think, are even better: the output of your computation is typically highly formalised and either true or false. But this course was not about learning how to solve equations or learn new "recipes" how to compute things. It was focused on teaching a different way of thinking about mathematical problems and a way of communication that is highly subjective in its form of expression: the mathematical proof.

Devlin and his team had to come up with a way to evaluate the quality of proofs submitted by thousands of students. There is no easy way to do that without an army of human teaching assistants looking at every single submission. They came up with the following idea: instead of having the students submit proofs for grading by teachers let them grade the proofs themselves.

This is not only a brilliant solution to the technical problem but has a nice educational benefit. It turns out you learn a substantial amount by looking at other people's work and trying to grade them.

The proofs we got to grade as part of an assignment were carefully prepared by Devlin for the best educational effect: he added typical mistakes and imperfections based on his experience.

Exam And Peer Review

This course had a final exam (as opposed to the other courses). The exam was the only occasion where we had to submit proofs we actually did ourselves. All the grading was done via peer review. We had exercised our grading skills during the assignments and this was the first attempt to apply them to real world data.

I found this to be the most exciting part of the course. I graded the exams of three of my peers. First of all, it was a shock: real exams looked quite different compared to the artificial training proofs. Sometimes I reached the limits of my mathematical abilities while trying to judge whether the slightly unorthodox proof my fellow student had submitted was valid and if yes how good it was. The final step was to grade your own exam after grading your peers. I looked at my work with different eyes after grading the others. It turned out that the grade I gave myself came reasonably close to what my peers thought of my work.

And of course there were quite a few students complaining in the forums that they felt their peers had done a poor job at grading them. Anecdotal evidence in form of exam proofs posted by those students seemed to support their case. It is obvious that this format is still experimental and very much work in progress. But there is not much to lose: you don't get any credit towards a possible degree at any university I know of by doing Coursera's courses.

Final Thoughts

There are many questions unanswered about MOOCs: Who is the target audience (I do it just for fun, but what if you want to obtain a degree)? How does the business model work (Coursera has started to offer a so called "signature track" where you have to pay and get a verified certificate)? What is the hidden agenda (some courses can be seen as a form of clever advertisement for a specific technology e.g. a programming language)? How do you replace the missing personal contact and tutoring (discussion forums are an imperfect but valuable substitute)? Keith Devlin's course might point to a possible answer to one of the most fundamental problems of the format: how to make qualitative judgements about the work of tens of thousands of students. His answer might not be perfect but it is certainly very instructive.

Saturday, October 05, 2013

Tries, JSNI and GWT

It's been a long time since the last post. I have been busy developing a web app based on GWT recently. While it's not exactly cutting edge technology, there are still some interesting aspects.

For example, there is the SuggestBox implementation that comes with GWT. You can use it to implement predictive text or auto-suggestions for a text input.

The suggest box uses a so called SuggestOracle to calculate the suggestions for a given string entered by the user. The default implementation that comes with GWT is the MultiWordSuggestOracle, which itself uses a combination of hash tables and a prefix tree to do its job.

A Trie is a Tree

A prefix tree or trie (as in retrieval) is a search tree where the nodes do not contain the keys themselves. Instead, the key for a particular node is determined by its position in the tree. More specifically, the key for a particular node is the concatenation of all the prefixes on the edges above it.

     X(he)        > prefix:""
      \
       X(ll)      > prefix: "he"
        \
         X(o)     > prefix: "hell"

If you look a my attempt at visualising a prefix tree for the word "Hello" above, you will notice that the root node is special in that its prefix is the empty string. The other thing to notice is that my example tree uses a prefix size of two characters. This has certain consequences for lookup time and data storage as we will see shortly. (Note: I put the values on the nodes to make it easier to read.)

What happens if you want to store more than one word in your prefix tree? Let's say we want to add the word "hell" to our trie.

     X(he)                 X(h)       
      \                     \
   (ll)X(ll)      OR         X(e)
        \                     \
      (o)X                     X(l)
                                \
                                 O(l)
                                  \
                                   O(o)

First of all the word "hell" was already in the tree as it is a substring of "hello". But the tree did not "know" about it. The tree on the left is the same tree I introduced before but on the first child node the additional "ll" string has been added. This is how the prefix tree that comes with GWT works. When adding new words, in this case "hell", we walk down the tree and we check on every node whether the remaining string length is equal or less than the prefix size (2). If so, we store the value directly on the node. If not, we look for a matching prefix tree below or create one. The prefix size increases by a power of 2 for each layer. (To keep it simpler I left that bit out in the illustration.)

If we compare this to a prefix tree with a prefix length of one (illustrated above on the right), there are two observations to make. First, you need a way of marking nodes of interest that are not leaf nodes. In our case we want to make clear that the string "hell" is a meaningful value and not just a substring of "hello". The GWT trie uses a separate storage location to hold "ll". This means that when looking up all values starting with "he", you have to look at those values stored on the nodes below "he" directly ("ll") and all values on child nodes ("llo"). In the case of the single character prefix you need some way of marking nodes that contain suffixes (i.e. that are the end of a meaningful word). I highlighted those nodes in the graphic by using an "O".

Oh, performance!

The second observation is that using a prefix size greater than one reduces the depth of the tree considerably. But, of course, there is a trade-off. If you choose your prefix too long, you lose the property that made the trie the right choice for the task: efficient lookup. The time complexity for a lookup in a prefix tree is determined by the length of the query string n and not the size of the tree, it is therefore O(n).

By increasing the prefix size, the deeper we descend into the tree, we give up that property. The longer your prefix, the more values will be stored on a particular node. The time complexity is now dominated by the size of the tree and the data structure used to store values on the nodes. Because you have to check every suffix on a particular node for a partial match, time complexity is now O(m), where m is the number of values stored on that node. So the worst case performance for this modified prefix tree with a start prefix length of p is O((logp(n)m) on a degenerated tree with all values on one node.

I can only guess that the reason for this design decision was an assumption about typical usage patterns of the auto-suggest feature. If the user accepts a suggestion after typing only a couple of characters, the increasing prefix size makes the tree actually more efficient. The lookup is reasonably efficient for short queries, where the prefixes are still short and you do not have to descend into a very deep tree to find all suggestions.

This implementation also avoids deep trees consisting of many objects, because in a typical prefix tree you would find one (sub-)tree object per character. The creators seem to have traded lookup efficiency for easy retrieval, smaller memory footprint and a lower cost of object creation in JavaScript.

There is much more to be said about prefix trees, but what I want to try here is to go from the high level concepts down to the implementation details in Java and eventually to the resulting JavaScript all in one post. So let's look at the implementation now.

JSNI is the inline assembly of GWT

Well, sort of. JSNI allows you to include bits of "native" JavaScript in your Java code. By writing a "native" method in Java and including the JavaScript in a special comment block, you instruct the GWT compiler to include that bit of code with only minimal checking directly into the resulting JavaScript object.

In this example you can also see how the interaction with those parts of the PrefixTree class works that have not been implemented in JSNI but in Java. It's not exactly elegant or easy too read. Suddenly, the assembly comparison seems not so far off. Once you put JavaScript into a forced marriage with Java, it loses all elegance and succinctness. What you can see, though, is how all the members of the PrefixTree class are translated to JavaScript by referring to them in "mangled" syntax that uses fully-qualified names for everything.

If you try to find the method I just used as an example for the JSNI syntax in the output of the GWT compiler, you won't find it. At least not where you expect it. The compiler has inlined the "clear" method into the constructor of the prefix tree. The generated code is very readable compared to its JSNI counterpart.

If you turn on the emulated stack trace feature of GWT, the compiler keeps track of the original source on every call by storing the current location in a global variable. All the readability is gone. But it shows again how the code has been inlined. You can now see the line numbers from the original Java source file:

All non trivial methods of the PrefixTree class are implemented using JSNI in GWT.

What does it buy you?

First, there is a cost attached: you lose almost all tool support in your IDE. There is no refactoring tool that will pick up usages inside a JSNI block. It is easy to get your JavaScript wrong, introduce a memory leak and so forth.

What you're promised in return is performance. To test that I reimplemented the PrefixTree in pure Java, compiled it with GWT and compared the two implementations in a recent version of Firefox. I saw that the optimised method to calculate the possible suffixes for a given prefix was about four to six times faster than my reimplementation.

My simple test setup was to implement both versions using a small subset of about 2000 geographic names in the USA, run the same queries on both of the inputs and profile them using the build-in Firefox profiler.

The version without JSNI shown above spends about 20% of the overall runtime looking up suggestions in the trie. It also noteworthy that about 80% of the time is spent outside the prefix tree code.

Looking at the optimised version, the speed-up is striking but, at the same time, the ratio of tree lookup / time spent on things not related to the lookup is even more off-balance. The reason is to be found in the way the MultiWordSuggestOracle is implemented. As it is a multi-word suggest-box, it needs to map partial matches across word boundaries back to the original input strings (e.g. a query for "Mou" would match "Aquarius Mountains" etc.) It therefore keeps a separate copy of all "real" suggestions in a hash map. It then splits up the originals at the word boundaries and inserts them as a normalised, "fragmented" version of each suggestion into the tree. Looking up the original suggestions and sorting the final results takes about 50% of the time; the actual tree lookup amounts to only 4.5%.

Conclusion

I find it very rewarding to look at a piece of technology and to try to dig up layer after layer to understand its inner workings and the design decisions behind it. What I found in this case supports my general opinion about GWT in that there is a whole lot of cleverness and well thought-out solutions to be found under the hood: the right data structure with some well advised optimisations yields a convincing solution. But it has showed me again the ambiguity that comes with most optimisations. While the 5x performance gain justifies the use of JSNI, the component itself is only one small part in the greater context of the auto-suggest feature. In this bigger context the gain is much smaller and limited to the stock implementation that comes with GWT. While the PrefixTree and its optimisations are well isolated from other code by making the class package private, it also means you cannot reuse any of it easily in your auto-suggest implementation.