CouchDB and ORMs

Alex did a good introduction talk to CouchDB at Scotland on Rails. Towards the end of the talk he did an overview of the current ruby plugins/gems available for interfacing with CouchDB, one of which was my own CouchFoo. Alex's opinion was that any ORM for CouchDB should be as thin as possible just wrapping the Ruby to JSON object translation. I raised my opinion in the question section at the end by saying that I didn't agree and thought the ORM should map the level of functionality available in ActiveRecord. This sparked a debate both in the talk and via Twitter of the best approach for an ORM for CouchDB to take. As a result I agreed to write this blog post to outline my views.

CouchDB is a document orientated database with a HTTP interface amongst other features. When I first started using it I played with the database a lot via simple interactions through CURL. In the same way I feel it is important to know SQL before using any higher level API to store and retrieve objects in a relational database, I feel it is important to understand how CouchDB works before using a library to interact with it. As with most areas of computing you will find a range of opinions over what level you interact with the database - there are the purists who like to write SQL queries for each database query performed and those who are willing to sacrifice a bit of performance (maybe not having the optimum query run each time) for the time efficiencies realized whilst developing. I align quite well with the Rails mantra on this one - I'm willing to sacrifice perfect SQL each time for the efficiency gains made whilst developing. Part of Alex's argument was that you should be as close to the database as possible because the Ruby to JSON conversion is much less than the Ruby to SQL conversion. Whilst I don't disagree that it's important to know how CouchDB works, I do disagree on the level at which any Ruby library should sit. I'm happy to pay a small price in terms of extra ruby code executed because I want as clean as DSL as possible.

Whilst developing CouchDB I tried all the existing ruby libraries and as I worked through them I ran into several issues. After using ActiveRecord's save and find methods it was particularly annoying to use a library that used different method names for the same conceptual operations (eg get instead of find). This wasn't a major issue of course I just forked the library and made changes. But as time went on there were features that I missed from ActiveRecord. Validations, callbacks, finders and associations were the prime contenders. Then dynamic finders and named scopes got added to the list. In the end changing the existing libraries became so much work I decided to start with ActiveRecord and work from there.

Of the features in ActiveRecord Associations are perhaps the most controversial on whether they should apply to Document orientated databases or not. The argument goes that if you're trying to use associations you don't understand how CouchDB should be used. I disagree on this point - a simple counter argument is presented by having a document that allows comments. Those comments could be stored inline in the document itself or in separate documents that have a reference to their parent. This is association whichever way you look at it. Which approach you decide to use will depend on your application and the characteristics of it. Incidentally Alex's gem did a great job of this letting the user specify in the association whether they wanted the object stored inline or not. This has since been removed from his gem but is something that's definitely on the TODO list for CouchFoo.

For me CouchDB lends itself well to two distinct domains. Firstly domains where documents are used - that is an object where the fields that are stored to the database change depending on the object. Secondly domains where you wish to take advantage of some of CouchDB's features not present (or poorly implemented) in relational databases - a HTTP interface, fantastic scaling ability due to bi-directional replication, and schema free nature (see this excellent article on friendfeed experience with MySQL) are just a few that spring to mind. People may use CouchDB for the second set of criteria even though their database design could be considered quite structured, and I fully expect this group of people to rise as CouchDB reaches 1.0. However that wasn't why I wrote CouchFoo, my project fell into the first domain. Whilst I provided a way to use ActiveRecord's higher level API I also provided access to a database object that allows simple storage and retrieval of documents by id. If that is all the functionality you require then I would expect CouchREST would be a better choice. However I believe in reality you will quickly find you need to add validations to a field, or maybe add an association or two. And as soon as you start on that slope I believe CouchFoo to be a better choice.

Ultimately I created CouchFoo as I missed the richness of the ActiveRecord API. Whilst I don't believe my library will be perfect for everyone it has received a lot of good feedback. To paraphrase DHH I didn't create the perfect framework for everyone else, I created it for me. I only hope that other people find it useful.

Using objects in models (with CouchFoo)

ActiveRecord allows you to serialize objects into text columns through YAML. This seems useful but in my experience is under-used. One of the primary reasons for this is it's not possible to use the data that the object encapsulates without the ruby model. For example it's not possible to find on the contents of that object or for that matter, modify the object with languages that lack YAML support. With CouchDB all data is stored in JSON so this is not an issue.

The project I wrote CouchFoo for used complex ACLs and I wanted to encapsulate this all in an object rather than use several many-many relationships and construct an ACL object based on their contents. So how do you this with CouchFoo? Simple, any object can be assigned as a property in a CouchFoo model as long as it has a .to_json method and a class .from_json method. The methods do what you'd expect, for example:

class DataObjectAttributeList

  attr_accessor :attributes

  # Constructs the object from JSON
  def self.from_json(json)
    DataObjectAttributeList.new(json)
  end

  # Converts the object to JSON
  def to_json
    @attributes
  end

  def initialize(initials = {}, *args)
    @attributes = initials
  end
end

This is just a simple example storing a hash but the structure could be as complex as you'd like. In the future I plan to add inline associations to CouchFoo, so rather than have a one-to-many association where the many are accessed via a second database query you could have the objects stored as part of the parent contents. Performance wise, this is normally much more efficient (although not in all situations - eg heavy write and low read).

Overall, this becomes a very addictive way of developing and in the same way you start to question whether you need a relational database, you start to question whether you should store associated objects inline or separately.

CouchFoo: ActiveRecord styled API for CouchDB

CouchDB is an excellent database, designed especially for distributed applications. To quote the official site site:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.

along with the knowledge it's written in Erlang, you know it's going to go be a winner in the future.

Media_httpwwwrowthebo_wagsy

For one of my current freelance projects I needed to store data in a document fashion - ie unstructured. This made CouchDB an ideal candidate. There were several ruby gems available: CouchPotato, CouchREST, ActiveCouch and RelaxDB gems. Each offered its own benefits and own challenges. After hacking with each I couldn’t get a library was happy with. So I started with ActiveRecord and modified it to work with CouchDB. And so CouchFoo was born.

In the end I ended up with a gem that mirrors ActiveRecord in all but a few minor places. In particular:

  • CouchDB is schema free so property defintions for the document are defined in the model (like DataMapper)
  • :select, :joins, :having, :group, :from and :lock are not available on find or associations as they don’t apply (locking is handled as conflict resolution at insertion time)
  • :conditions can only accept a hash and not an array or SQL. For example :conditions => {:user_name => "Georgio_1999"}
  • :offset is less efficient in CouchDB - there’s more on this in the rdoc
  • :order is applied after results are retrieved from the database. Therefore :order cannot be used with :limit without a new option :use_key. This is explained fully in the quick start guide and CouchFoo#find documentation
  • :include isn’t implemented yet but the finders and associations still accept the option so you won’t need to make any code changes
  • By default results are ordered by document key. The key uses a UUID scheme so these don’t auto-increment and are likely to come out in a different order to insertion. default_sort can be used on a model to sort by create date by default and overcome this
  • validates_uniqueness_of has had the :case_sensitive option removed
  • Because there’s no SQL there’s no SQL finder methods
  • Timezones, aggregations and fixtures are not yet implemented
  • The price of index updating is paid when next accessing the index rather than the point of insertion. This can be more efficient or less depending on your application. It may make sense to use an external process to do the updating for you - see CouchFoo#find for more on this
  • On that note, occasional compacting of CouchDB is required to recover space from old versions of documents and keep performance high. This can be kicked off in several ways (see quick start guide)

The RDoc for the gem contains more details on each of these differences, new features that I added, a quick start guide and additional areas of responsibility to think about when using CouchDB (in particular performance).

As a quick overview, basic operations are the same as ActiveRecord:

class Address < CouchFoo::Base
property :number, Integer
property :street, String
property :postcode # Any generic type is fine as long as .to_json can be called on it
end
address1 = Address.create(:number => 3, :street => "My Street", :postcode => "secret") # Create address
address2 = Address.create(:number => 27, :street => "Another Street", :postcode => "secret")
Address.all # = [address1, address2] or maybe [address2, address2] depending on key generation
Address.first    # = address1 or address2 depending on keys so probably isn't as expected
Address.find_by_street("My Street") # = address1

As key generation is through a UUID scheme, the order can't be predicted. However you can order the results by default:

class Address < CouchFoo::Base
property :number, Integer
property :street, String
property :postcode # Any generic type is fine as long as .to_json can be called on it
property :created_at, DateTime

default_sort :created_at
end
Address.all # = [address1, address2]
Address.first    # = address1 or address2, sorting is applied after results
Address.first(:use_key => :created_at) # = address1 but at the price of creating a new index

Note that there's an optimisation that will order results by created_at if there are no conditions so in the above case, the default_sort wasn't required. However when using with conditions it will be required so it makes sense to use at all times.

Conditions work slightly differently:

Address.find(:all, :conditions {:street => "My Street"}) # = address1, creates index on :street
Address.find(:all, :conditions {:created_at => "sometime"}) # Uses same index as :use_key => :created_at
Address.find(:all, :use_key => :street, :startkey => 'p') # All streets from p in alphabet, reuses the index created 2 lines up

As well as providing support for people using relational databases, CouchFoo attempts to provide a library for those wanting to use CouchDB as a document-orientated database:

class Document < CouchFoo::Base
property :number, Integer
property :street, String

view :number_ordered, "function(doc) {emit([doc.number , doc.street], doc); }", nil, :descending => true
end
Document.number_ordered(:limit => 75) # Will get the last 75 documents in the database ordered by number, street attributes

Associations work as expected but you must to remember to add the properties required for an association (we’ll make this automatic soon):

class House < CouchFoo::Base
has_many :windows
end

class Window < CouchFoo::Base
property :house_id, String
belongs_to :house
end

There's a few bits left to tidy up (as noted in the readme) but generally speaking it's now ready for use by others. Grab it on github and feel free to fork and send me pull requests.

And now to do something I've not being doing a lot of lately, spend some more time on the Couch...

Rails with Datamapper

With the recent announcement that Rails and MERB will merge and my preference for DataMapper I decided to plug datamapper into rails for my next freelance project. The theory goes this should make the upgrade path to Rails 3 a lot simpler!

It's currently possible to use Datamapper with Rails, heck even DHH himself commented so, but it's not quite easy as using ActiveRecord. After a quick Google I only ran into question of how to do it, no howto guide. So I set out to make mine own - it really was quite simple in the end:

sudo gem install addressable data_objects do_mysql # do_mysql can be changed for do_postgres or do_sqlite3 as appropriate
sudo gem install dm-core dm-more

In the dm-more github repos there's a folder called rails_datamapper which is a plugin for rails to add datamapper support. This doesn't install with the dm-more gem so it's a case of cloning the git repository and copy the folder to your rails project:

git clone git://github.com/sam/dm-more.git 
cp -R dm-more/rails_datamapper /vendor/plugins

Then edit your project environment.rb file and add the following lines:

# Load the required gems in the correct order 
config.gem "addressable", :lib => "addressable/uri" 
config.gem "data_objects" 
config.gem "do_mysql" 
config.gem "dm-core"  

# Make datamapper load first as some plugins have dependencies on it
config.plugins = [ :rails_datamapper, :all ]

# Remove ActiveRecord if you no longer need it
config.frameworks -= [ :active_record ]

The connection to the database will be made by the rails_datamapper plugin using your database.yml configuration file. You'll need to use a slightly different format for datamapper:

development: 
:repositories: 
:adapter: mysql 
:database: opnli_dev

Or alternately you can specify your own initializer and forgo the rails plugin:

hash = YAML.load(File.new(RAILS_ROOT + "/config/database.yml")) DataMapper.setup(:default, hash[RAILS_ENV])

The only real gotcha in using datamapper is some rails plugins assume you're using ActiveRecord. Hopefully this won't be the case in the future, but for now you'll need to get forking!

Git

For those readers of my blog who don't live in the rails world I highly recommend checking out Git, a distributed version control system. It has been big in the rails world since early this year for several good reasons:

  • It has distributed and offline functionality
  • Making and merging branches is a breeze - encouraging you to try experiments in branches
  • It uses much less space than alternatives, such as Subversion, and only has one .git folder at the base of your project
  • It's in active development with constant releases of new features (but stable enough to be used for the linux kernel)

The terminology is slightly different from subversion and friends but once you've got used to it you never look back!

Merb was very quick to jump on the git bandwagon and rails followed not much later. Practically this made distributed development a hell of a lot easier, but it also had some nice knock on effects. Patching is now a lot quicker too - you simply fork the project, make a fix and inform the admin who can then choose to merge back into the master (if they see fit). It's made the process for fixing bugs a hell of a lot quicker.

Soon after git came along the fantastic github.com followed making it easy to host remote repositories. And so to the reason for me writing this post - github just launched git pages where you can upload your own page to front your repositories. It's a neat idea and naturally is all managed through a git repository. You simply create your site in a repository, push to github and the deployment is automatic. Although it's only simple HTML pages, it's a great proof of concept of other things that could be possible. My effort can be found here which following the git ethos I just forked from somebody else

New plugins

I've just pushed two plugins to github. The first is an improvement on the standard Defensio plugin that only checks the validity of your API key when posting articles or comments. This is better than checking each time a model that uses the plugin is instantiated as it doesn't require contact with the Defensio API (so is faster) and also won't bring your site to a standstill if someone is just viewing a page and the Defensio service is down.

The second updates the highly useful timed fragment cache plugin by Richard Livsey to support Rails 2.1

Rails 1.1 vs Rails 1.2 vs Rails 2.0

I'm really into rails performance (and now merb but that wasn't around when this post was written) so decided to look at a comparison between the standard rails versions - 1.1, 1.2 and 2.0. In these tests the excellent railsbench was used to drive the tests - I favor this over the standard rails benchmarking tools is it measures the raw performance of Rails request processing, ignoring the time spent passing the request from the web server to the Rails application. Additionally it is consistent across rails versions and I'm aware there have been changes in the benchmarking area in the rails 2.0 release.

The hardware

The tests were run on an old machine of mine dedicated to performance testing. It's an AMD Athlon 1.4GHz, 256 MB RAM, Ubuntu 6.06 LTS server edition with latest security patches applied, Ruby 1.8.6 and Gems 1.0.1 (installed from source) and a dedicated crossover ethernet cable to a much more powerful desktop machine. It's not the aim of these tests to find reasonable performance values for a modern server but more the most efficient rails version to use.

The tests

The tests compose of several pages designed to stress different areas of rails. These are:

Page Action Area
front1 Renders current time without a session Designed to hit as little of rails as possible
front2 Renders current time with a session Designed to show the difference in using a session
users Lists of users (100 in test database) Designed to test the database driver
showuser Shows the user details (also shows the user’s posts) Designed to behave like a typical test page with a user who has_many posts
edituser Thed edit user page (posts aren’t included) Tests page rendering and form helpers
updateuser Updates the user details via a POST Accessing/Updating 1 database row

In all tests (unless otherwise noted) 5000 requests were performed 3 times and the average request time taken.

In the first tests I performed it was noticed that creating actions in a web service style (respond to different MIME types) caused a significant performance hit. EG in rails 1.1 the code:

def index   
  @users = User.find(:all) 
end

was changed in rails 1.2 to:

def index   
  @users = User.find(:all)    
  respond_to do |format|     
    format.html   
  end 
end

As a result two versions of the test appliaction were created, one with responds_to blocks and one without.

Test name Rails setup
performance11 Rails 1.1
performance12 Rails 1.2 without responds_to
performance12 Rails 1.2 with responds_to
performance20 Rails 2.0 without responds_to
performance20 Rails 2.0 with responds_to

There were some changes in the edit view in each of the versions above. The first change was introduced in rails 1.2 due to a new way of using routes as a result of RESTful applications. So rather than:

<%= form_tag :action => 'update', :id => @user %>

in rails 1.2 we use:

<%= form_tag(user_url, :method => :put) %>

Testing showed this made no noticable difference to performance. The second change was introduced in rails 2.0 as a result of a new way of constructing forms. So rather than:

<%= form_tag(user_url, :method => :put) %>
    <%= text_field(:user, :notes) %>
    <%= submit_tag %>
<%= end_form_tag %>

in rails 2.0 we use:

<% form_for(@user, :url => user_url, :method => :put) do |f| %>
    <%= f.text_field(:notes) %>
    <%= submit_tag %>
<% end %>

This is only really a style change from the rails 1.2 approach and testing showed it made little difference to performance.

The results

Showing requests per second:

  front1 front2 users showuser edituser updateuser all
rails11 390.49 179.45 111.62 122.61 180.11 107.24 149.53
rails12 306.51 192.66 127.29 129.27 160.43 106.72 151.33
rails12ws 258.02 172.83 117.16 118.31 144.60 101.01 138.04
rails20 292.12 186.00 93.63 119.84 142.77 101.93 134.41
rails20ws 261.40 173.68 89.94 113.96 134.22 97.03 127.03

Or in graph form (click for the large image):

Media_httpwwwrowthebo_uijfh
So what can we see from the results?

  • Rails 1.1 seems to excel when rendering a simple page with no session, but then you would sort of expect this as the codebase will be smallest. Still the margin was quite significant.
  • Rails 1.1 and rails 1.2 share the honors for fastest at processing the page depending on the test (indeed their averages are very close).
  • The web service versions of the rails 1.2 and rails 2.0 applications are on average 9.6% and 5.8% slower respectively
  • If you're jumping from Rails 1.1 to Rails 2.0 with web services (as we are going to at idlasso.com at some point) then you could see a performance decrease of 17.7% on average

Conclusion

So does this mean rails 2.0 is the slowest release yet? Well no. In order to keep the comparison fair the session store default in the rails 2.0 tests was changed to pstore which is the default used by rails in versions 1.1 and 1.2. However in rails 2.0 the default was changed to use a cookie stored in the user's browser. A quick test using Apache Bench shows that rails 1.2 averaged 48req/s whereas rails 2.0 averaged 60req/s on the front2 page. However this was just that - a quick test. Only 1000 requests were sent, with no averaging or warmup. I'll explore sessions more fully in a later article but for now we can safely say use cookie based session if using rails 2.0

So from this series of performance tests it seems it's sensible to recommend:

  • Avoid the use of the respond_to statement unless your action actually responds in more than one way
  • Use client side sessions if using rails 2.0 or later.