The Year of the Em Dash

I admit it. The first thing I thought of when noticing what year was coming was—"em dash". The unicode character 2014 is the em dash, and as I am wont to type it in manually it is practically synonymous to me [1].

We symbolic creatures like to embue meaning into all things. While I am not superstitious, I enjoy the Rorschach test–like nature of divination. If I ask myself to make a connection between the changing of the year right now and the character em dash, it reflects my state—and nothing more—but this does not diminish its value.

In contrast with last year's character—the en dash—which is supposed to be used in place of 'to' in spans such as 2010–2016, or as a hyphen between open compounds, as in "computational linguistics–machine learning conference", the em dash is used informally in places of commas, colons, or semicolons for emphasis, interruption, or an abrupt change of thought. So I'm expecting a year of diversion, a relevant but indirect route, with an ultimate return to the path previously intended.

We have been stuck in a run of dash characters since 2010, the year of the hyphen, and won't escape until 2016, the year of the double vertical line (‖), which, I suppose, will entail some sort of parallelism.

Happy New Year!

[1] Ctrl-shift-u 2014 will do it in many applications. To enable this in emacs, put "(global-set-key (kbd "C-S-u") 'quoted-insert)" in your .emacs file.

Malaprop v0.1.0

"...she's as headstrong as an allegory on the banks of Nile."

— Mrs. Malaprop, in Sheridan's The Rivals   

As a contribution to the adversarial evaluation paradigm, I have released my first version of Malaprop , a project involving transformations of natural text that result in some words being replaced by real-word near neighbours.

The Adversarial Evaluation Model for Natural Language Processing

Noah Smith recently proposed a framework for evaluating linguistic models based on adversarial roles ¹. In essence, if you have a sufficiently good linguistic model, you should be able to differentiate between a sample of natural language and an artificially altered sample. An entity that performs this differentiation is called a Claude. At the same time, having a good linguistic model should also enable you to transform a sample of natural language in a way that preserves its linguistic properties; that is, that makes it hard for a Claude to tell which was the original. An entity that performs this transformation is called a Zellig. These tasks are complementary.

This framework is reminiscent of the cryptographic indistinguishability property, in which an attacker chooses two plaintexts to give to an Oracle. The Oracle chooses one and encrypts it. The encryption scheme is considered secure if the attacker can not guess at better than chance which of the two plaintexts corresponds to the Oracle's ciphertext.

Even though encryption schemes are constructed mathematically, questions of security are always empirical. The notion of Provable Security is regarded with skepticism (at least by some); schemes are considered tentatively secure based on withstanding attempts to be broken. Similarly, it would take an array of independent Claude's all unable to guess at better than chance to support the claim that a given Zellig had hit upon a truly linguistic-structure-preserving transformation. Likewise, if an array of independent Zelligs can't fool a given Claude, that would support a strong claim about his ability to recognize natural language.

A Real-Word Error Corpus

In this spirit, I've just released Malaprop v0.1.0. It creates a corpus of real-word errors embedded in text. It was designed to work with Westbury Lab's Wikipedia corpora, but can be used with any text.

The code acts as a noisy channel, randomly inserting Damerau-Levenshtein errors at the character level as a word is passed through. If the resulting string is a real word — that is, a sufficiently frequent word in the original corpus — the new word replaces the original.

I intend to use this corpus to evaluate algorithms that correct orthographical errors. However, it could be used quite generally as just one Zellig in what I hope becomes a large body of such resources.


The term malapropism was first used in the context of the computational linguistics task of real-word error detection and correction by David St-Onge in 1995 in his Master's thesis, Detecting and Correcting Malapropisms with Lexical Chains .

¹Noah A. Smith. Adversarial Evaluation for Models of Natural Language. CoRR abs/1207.0245 2012

Publishing a Paper without the Code is Not Enough

Shoulders to stand on

Here are a couple of anecdotes demonstrating that without access to the implementation of an experiment, scientific progress is halted.

  • When I first began my master's degree, I made serious starts on a few ideas that eventually were blocked for lack of access of one kind or another. In one case I was unable to make progress because my idea involved building on the work of another who would not release his code to me.

    (To give the benefit of the doubt, I now assume that the code was sloppy, missing pieces, or poorly documented, and he was simply embarrassed. Or perhaps there was some other valid reason. I don't know; he never gave one.)

    Regardless of his motivation, this was a suboptimal outcome for him as well as for me, because his work could have been extended and improved at that time, but it was not. I had already spent a lot of time working out design details of my project, under the assumption that the code would be available. Nonetheless, I felt it would be too much work to replicate his entire thesis, and so I moved on to another topic.

  • A colleague in my cohort had a related experience in which she did replicate the work of a renowned scientist in the field, in order to attempt some improvements. In her case, the code she wrote, guided by interpreting the relevant paper, didn't do what the paper claimed. It is not clear whether this was because she was missing some vital methodology, or whether the claims were not justified. Neither is an acceptable result.
Having talked with other students, I know that these stories are not of isolated experiences.

Surely, you too have recently come across some fascinating scientific results that gave you an idea you wanted to implement right away, but to your dismay, you quickly realized that you would have to start from scratch. The algorithm was complex. Maybe the data was not available. With disappointment, you realized that it would take you weeks or months of error-prone coding just to get to the baseline. Then you would not be confident that you implemented it in exactly the same way as the original, and there would be no direct comparison. So you wrote it down in your Someday/Maybe file and forgot about it. This happens to me constantly, and it just seems tragic and unnecessary.

My own dog food

Later, I too did a replication study. In this case my advisor and a fellow student wanted to compare their work to known work on the same problem. However, neither the code nor the data was released (the data was proprietary), and the evaluations were not published in a form comparable to more modern evaluation. Luckily for us, the method worked beautifully, and now everyone can see that more clearly.

In an ironic turn, being a novice programmer at the time, my replication code was disorganized. Some of it was lost. It was not under version control until quite late and had few tests of correctness. I now had grounds to empathize with the colleague I earlier felt slighted by. However, I am in the fortunate position of having had to take a forced break before graduating, during which I learned basic software engineering skills, and had ample time to think about this issue.

I am now re-rewriting the entire code base such that the experiments can be completely replicated from scratch. Every step will be included in the code I release, including "trivial" command line glue. Although every module has tests, no code is invulnerable. Bugs may well turn up, and if they do, they will be there for analysis and repair by anyone who cares. Most importantly, if anyone wants to know exactly what I did, they will not have to scour the paper for hints. Similarly, all the data will be at their disposal for analysis, if mine lacks the answer to any particular unforeseeable analytical question.

This should be standard

In 2013, there is no excuse for publishing a paper in applied computer science without releasing the code; a paper by itself is an unsubstantiated claim. Unlike in biology, or other physical fields, applied computer science results should be trivial to replicate by anyone with the right equipment. Moreover, not releasing the code gives your lab an unsportsmanlike advantage: you get to claim the results, perhaps state-of-the-art results, and you get to stay on the cutting edge, because no one else has time to catch up.

Many universities and labs now have projects under open licenses, but it is by no means a standard, and it is not a prerequisite for publication. We ought to change this.

Unit Test Fairies

Sometimes unit tests take more than a few seconds to run, especially if they involve training a small test corpus in some way or other. This can be dangerous for my focus. "I'll just check my email while I'm waiting for this test to complete." are famous last words that can lead to a slow development cycle. See compiling.

So, I started queuing a sound file to play after each test finished. This way I would be alerted immediately, rather than relying on polling.

Recently Zooko helped me improve this system by adding the feature of switching the file to play based on the test results.

Currently, I have the following alias in my .bash_aliases file:

alias testbell='X=$? ; if [ $X = 0 ] ; then mplayer -really-quiet -volume 50 ~/chimes3.mp3 ;
else mplayer -really-quiet ~/gong.mp3 ; fi'
The -really-quiet flag is about verbosity, not volume, and the -volume 50 is just because my chimes file is much louder than my gong file.

So today, for example, I've been running:

time python -m unittest code.error_insertion.test.test_RealWordErrorChannel.RealWordErrorChannelTest ;

A friend told me it sounds like there is a fairy somewhere intermittently trying to escape.

xmonad on Unity 12.04

I love the xmonad window manager. It's tiling, so it maximizes real estate, and it is driven by keys instead of the mouse. It's written in Haskell -- enough said.

However, much as I love to tinker, I am reluctant to give up some things that "just work" when running Gnome and its successor, Unity. So I run xmonad as the replacement window manager within Unity.

I recently had the privilege of obtaining a new Zareason UltraLap laptop (Zareason has an open bootloader, and ships with any of a variety of linuxes that also "just work"), and so I got to configure xmonad from scratch.

My modus operandus for this kind of thing is to see if someone else has solved it first, and my final (so-far) configuration came from merging a few sources.

First, I followed the advice of Elon Flegenheimer. Three of my config files came from him:


[Desktop Entry]


[GNOME Session]
Name=Xmonad Unity-2D Desktop


#! /bin/sh
exec gnome-session --session xmonad "$@"

Then I based my /home/amber/.xmonad/xmonad.hs on that of Arash Rouhani, but I changed the Unity panel option to "doFloat" instead of doIgnore:


import XMonad
import XMonad.Util.Run
import XMonad.Util.EZConfig (additionalKeys)
import XMonad.Hooks.ManageDocks
import XMonad.Hooks.ICCCMFocus
import XMonad.Config.Gnome

myManageHook = composeAll (
[ className =? "Unity-2d-panel" --> doFloat
, className =? "Unity-2d-launcher" --> doFloat

main =
xmonad $ gnomeConfig { modMask = mod4Mask
, manageHook = manageDocks myManageHook manageHook gnomeConfig
, layoutHook = avoidStruts $ layoutHook defaultConfig
, logHook = takeTopFocus
} `additionalKeys` [ ((mod4Mask, xK_d), spawn "dmenu_run -b")
, ((mod4Mask, xK_Return), spawn "xfce4-terminal")

All the configurations I looked at had something in /usr/share/xsessions/xmonad-gnome-session.desktop, but my old laptop, on which I also run xmonad on Unity, does not, so I left that out.

Finally, there is a problem with application menus which I fixed per advice in that link by uninstalling indicator-appmenu.

This keymap chart shows most everything you need to know to get going with xmonad, and if you get stuck, try joining the irc channel #xmonad for help.