hacker-school-friday-june-13th

Today it was quiet around here, which was nice. I got an actual repository started for my tagging project, which I dubbed protagonist. I found myself faced with some old procrastination behaviour, and overcame it. I had a sucessful and refreshing nap. All-in-all it was a good day for flow.

The day started with pouring rain, and now, when I need to leave, it is pouring again. I like the rain.

Here was my day:

Plan

Tahoe-LAFS symlinks

I'm still interested in this as a design problem. In fact, I love design problems. However, that is different from a coding project, and so I'm bumping this out of the priviledged position. My goal for it now is to work with the devs in making a spec that everyone can agree on.

Tagging

The tagging project is moving into the current priority slot.

In brief, it will be a program on top of my filesystem that organises files by multiple, non-hierarchical tags, that can be queried with Boolean operators.

Rather than fully specify it here, I'm putting the specification as my plan. This includes settling on using a sql database, or links. I have the beginnings of an implementation using both of the first two, but I'm thinking of switching to hardlinks, so that I can use it in conjunction with Taho-LAFS as it is.

This time I'm going to GitHub it.

Light Reading

I might go to the library today to read Zellig Harris.

Actual

Protagonist

Made a new GitHub repo.

Design

I started by considering two basic designs that I had already mocked up a little bit.

sql: The sql design would involve having a database tracking each file in the system, and a table for each tag.

Advantages:

The queries are already managed in sql, and would just need wrapping.

Recent experience with sql is probably marketable.

Disadvantages:

Wrapped sql is bit hard to read.

links: The links design would involve having a special directory for tags called, maybe, .tags. Every new tag tag_x is represented by a directory .tags/tag_x/. Every file to be tagged with tag_x is linked to in that directory.

Advantages:

This is human readable, in that it is easy for a user to browse the directories and see exactly what is going on.

The code is easier to read and write just with os directives.

The design is easily extensible to key, value pairs, by making one more level of directories, and using tag/ for the ones that are just tags.

Disadvantages:

???

Here is a comparison of code from the two explorations of the design space I had done previously:
sql
def tag_file(self, tag, file_name):
self.new_tag(tag)
if not os.path.exists(file_name):
raise TagNoFileError(file_name)
self.cursor.execute("select id from file_names where path='" + file_name + "'")
primary_key_row = self.cursor.fetchone()
if primary_key_row == None:
self.new_file(file_name)
self.cursor.execute("select id from file_names where path='" + file_name + "'")
primary_key_row = self.cursor.fetchone()
primary_key = primary_key_row[0]
self.cursor.execute("select id from " + tag + " where id='" + str(primary_key) + "'")
keys = [row[0] for row in self.cursor.fetchall()]
if len(keys) == 0:
self.cursor.execute("insert into " + tag + "(id) values('" + str(primary_key) + "')")
links
def tag_file(self, tag, file_name):
if not os.path.exists(file_name):
raise TagNoFileError(file_name)
if not os.path.exists(tag_dir + tag): os.mkdir(tag_dir + tag)
os.symlink(file_name, tag_dir + tag + md5(open(file_name, 'r').read()).hexdigest())
Unique ids

Both of these designs require a unique identifier for files. In the links mockup above, I am using the md5 hash for this purpose. (Well, actually I ought to be using BLAKE2: “Harder, Better, Faster, Stronger” Than MD5. I'll fix that this time around.) The reason I wanted to use content-based hashing as opposed to e.g. an integer, is that it might help identify multiple copies of the same file, recover from accidentally moving the original, and look for files that haven't been tagged. It might be overkill. That's ok.

But what about the case for mutable files?

In Tahoe-LAFS, references to mutable files (capabilities) are not content based, they merely securely identify slots from which to read or write. So I think the right analogy here would be to hash the pathname. This will not help with finding something that moved, of course.

I'm not sure I want to have to choose mutable or immutable every time I tag a file. So, I'll think about this part of the design more.

The links solution appeals to me most. I'm going to use hard links instaed of symlinks, because they seem strictly better.

Wrote a bare unit test for creating the tagsystem set up, and made it pass (by making a module to correspond to the import).

Tahoe-LAFS symlinks design

Got instructions on where to collaborate on design docs (mailing list). Joined mailing list.

Reflections

I've been thinking about the overlapping batches. Obviously I can't directly compare what it is like to be introduced to Hacker School in different ways, but I was suddenly reminded of the way that kindergarden is stagger-started in Boulder. At Mesa Elementary, they divided the classes into two halves, and half the class came each alone for a few days, and then they merged. At Boulder Community Montessori, the class is a mix of two cohorts of preschool, and kindergarten. They start the kindergarteners in first; those children have the most mentoring role. Then they bring in the youngest, new preschoolers who can bond directly with the Ks, and finally the returning preschoolers join. It seems to work well.

I'm looking forward to the new people coming, 5 weeks from now. I hope I will have bonded more with the firsts by then, though, which means I need to start pairing a lot more.