Tuesday, July 28, 2009

Tokyo Cabinet and Me



There's been lots of talk about Tokyo Cabinet (TC) lately so I had to get my hands on this shiny new toy. The appeal is that key value stores like TC offer better performance than traditional RDBMs, and are also dramatically easier to setup. They're not right for every project, but I had a project come up that seemed like a good fit. The project was a background task that interacted with the Twitter API and had to keep track of a large (millions) number of key-value combinations. Fast existence tests for keys and basic get/set/iterate methods was all that was needed. Essentially what I wanted was a persistent Ruby Hash.

I looked at Rufus Tokyo and the Ruby bindings written by Mikio Hirabayashi. I ended up using Hirabayashi's Ruby bindings, because they were a little closer to the metal, and the less code that can break the better. While there is no gem, installing from source was easy, both on Centos and Leopard. If you're on Leopard, don't use the macports versions - they're broken. I'm not using Tokyo Tyrant, as only one process needs to access this data I'm using Tokyo Cabinet directly.

The only things the Ruby bindings didn't give me were a simple open method and a way to serialize values so I could put any Ruby data structure into the cabinet.

Hirabayashi's Ruby bindings already provide Ruby Hash like access, including hsh['foo'] = "bar", and hsh.each. Nor did it make sense to create my own interface class when really all I needed was to extend a tiny subset of TokyoCabinet::HDB's existing methods. What I came up with is this:

require 'tokyocabinet'
include TokyoCabinet

# Duck Punch of TokyoCabinet Hash database that uses YAML to serialize Ruby objects. Serialization
# necessary since TC expects values as strings or integers so won't handle other data types. Also provides
# a consistent, simpler open method.
class TokyoCabinet::HDB

# initialize db and return handle to db. There is one db file per data structure, e.g.
# new hash means new database and database file so call init again. Creates db file
# if it doesn't already exist.

alias original_open open
def open(path_to_db)
# open the database
if !self.original_open(path_to_db, HDB::OWRITER | HDB::OCREAT)
ecode = self.ecode
STDERR.printf("open error: %s\ntry to open file #{path_to_db} - ", self.errmsg(ecode))
end
end

alias original_get_brackets []
def [](key)
result = self.original_get_brackets(key)
result ? YAML.load(result) : nil
end

alias original_set_brackets []=
def []=(key,value)
self.original_set_brackets(key, YAML.dump(value) )
end

alias original_each each
def each
self.original_each { |k, v| yield( k, YAML.load(v) ) }
end

end

Dear MySQL, I think we should see other people.

3 comments:

jmettraux said...

When I started on the rufus-tokyo gem, the bindings by Hirabayashi-san were not hash-like at all. So somehow, rufus-tokyo has had a good influence.

One other thing you overlook is that rufus-tokyo, being a FFI binding primarily, is usable on JRuby, as on MRI/KRI Ruby.

Rufus-tokyo tyrant bindings are faster than the pure ruby tyrant binding provided by Hirabayashi-san as well.

As you have implied, rufus-tokyo includes a Rufus::Edo namespace where the C extension ruby bindings by Hirabayashi-san are "wrapped".

That's it for a more complete context.

+1 for the "less code". You're walking the last mile, you walk it the way you want.

Best regards, have fun with the Tokyo products !

Rafael Rosa Fu said...

I didn't have any opportunity to play with these new key-value DBs but they seem really really nice. Did you take a look at Wycats' Moneta [1]?? It gives you a hash-like interface to a lot of different backends, might be useful in the long run.

[1] http://github.com/wycats/moneta/tree/master

Cheers

Logan Henriquez said...

@Rafael: I briefly looked at several APIs into TC including Moneta. In moneta's case it didn't support iteration, which was essential to my project. I like the backend-portability aspect, hopefully they'll offer a more complete interface in the future. In practice I've found backend portability to be very complex - its not just about the API, but about migrating data. This makes any actual migration fairly unlikely, so I tend to look for simplicity first.