Publishing a Paper without the Code is Not Enough

Shoulders to stand on

Here are a couple of anecdotes demonstrating that without access to the implementation of an experiment, scientific progress is halted.

  • When I first began my master's degree, I made serious starts on a few ideas that eventually were blocked for lack of access of one kind or another. In one case I was unable to make progress because my idea involved building on the work of another who would not release his code to me.

    (To give the benefit of the doubt, I now assume that the code was sloppy, missing pieces, or poorly documented, and he was simply embarrassed. Or perhaps there was some other valid reason. I don't know; he never gave one.)

    Regardless of his motivation, this was a suboptimal outcome for him as well as for me, because his work could have been extended and improved at that time, but it was not. I had already spent a lot of time working out design details of my project, under the assumption that the code would be available. Nonetheless, I felt it would be too much work to replicate his entire thesis, and so I moved on to another topic.

  • A colleague in my cohort had a related experience in which she did replicate the work of a renowned scientist in the field, in order to attempt some improvements. In her case, the code she wrote, guided by interpreting the relevant paper, didn't do what the paper claimed. It is not clear whether this was because she was missing some vital methodology, or whether the claims were not justified. Neither is an acceptable result.
Having talked with other students, I know that these stories are not of isolated experiences.

Surely, you too have recently come across some fascinating scientific results that gave you an idea you wanted to implement right away, but to your dismay, you quickly realized that you would have to start from scratch. The algorithm was complex. Maybe the data was not available. With disappointment, you realized that it would take you weeks or months of error-prone coding just to get to the baseline. Then you would not be confident that you implemented it in exactly the same way as the original, and there would be no direct comparison. So you wrote it down in your Someday/Maybe file and forgot about it. This happens to me constantly, and it just seems tragic and unnecessary.

My own dog food

Later, I too did a replication study. In this case my advisor and a fellow student wanted to compare their work to known work on the same problem. However, neither the code nor the data was released (the data was proprietary), and the evaluations were not published in a form comparable to more modern evaluation. Luckily for us, the method worked beautifully, and now everyone can see that more clearly.

In an ironic turn, being a novice programmer at the time, my replication code was disorganized. Some of it was lost. It was not under version control until quite late and had few tests of correctness. I now had grounds to empathize with the colleague I earlier felt slighted by. However, I am in the fortunate position of having had to take a forced break before graduating, during which I learned basic software engineering skills, and had ample time to think about this issue.

I am now re-rewriting the entire code base such that the experiments can be completely replicated from scratch. Every step will be included in the code I release, including "trivial" command line glue. Although every module has tests, no code is invulnerable. Bugs may well turn up, and if they do, they will be there for analysis and repair by anyone who cares. Most importantly, if anyone wants to know exactly what I did, they will not have to scour the paper for hints. Similarly, all the data will be at their disposal for analysis, if mine lacks the answer to any particular unforeseeable analytical question.

This should be standard

In 2013, there is no excuse for publishing a paper in applied computer science without releasing the code; a paper by itself is an unsubstantiated claim. Unlike in biology, or other physical fields, applied computer science results should be trivial to replicate by anyone with the right equipment. Moreover, not releasing the code gives your lab an unsportsmanlike advantage: you get to claim the results, perhaps state-of-the-art results, and you get to stay on the cutting edge, because no one else has time to catch up.

Many universities and labs now have projects under open licenses, but it is by no means a standard, and it is not a prerequisite for publication. We ought to change this.


Comments powered by Disqus