Seed Data in Rails

Friday, November 16

Someone asked me about this the other day, so I thought I’d write about it. In some of my applications, I need to “seed” the database with data. This might be a list of categories, sections, or other defaults.

There are a couple of ways you can do this. One way is to use migrations. You create records in your migration via ActiveRecord as you normally would, and when you run your migrations, the data is inserted. This works OK, except it obscures the location of the data. By the time you have a lot of migrations, you’re unlikely to remember that 003_create_categories.rb is also the place where you’re adding your default categories.

I like to think of migrations as being transient. As your schema grows and your project evolves, the chances of your migrations running perfectly from top to bottom diminish. When bootstrapping a database, it’s a much better idea to load the entire schema via db:schema:load than running through each transformation with migrations.

So, if we’re not using migrations for seed data, where do we keep it? I like to use YAML fixtures for this. You could use the test fixtures from test/fixtures, but this is an inappropriate location. If you were a new developer coming on to a project, why would you think to look in the test directory for seed data? Test fixtures are for your tests.

For seed data, I create a fixtures directory inside the existing db/ directory: db/fixtures. Then I use the following Rake task, called db:seed to load them:

namespace :db do
  desc "Load seed fixtures (from db/fixtures) into the current environment's database." 
  task :seed => :environment do
    require 'active_record/fixtures'
    Dir.glob(RAILS_ROOT + '/db/fixtures/*.yml').each do |file|
      Fixtures.create_fixtures('db/fixtures', File.basename(file, '.*'))
    end
  end
end

So, I might have something like db/fixtures/categories.yml. When I’m bootstrapping the project on a new machine (say, when deploying), I’d just do the following:

$ rake db:create:all
$ rake db:schema:load
$ rake db:seed

How are other folks out there dealing with seed data?

Comments

Leave a response

  1. JamalNovember 17, 2007 @ 02:51 PM

    Thanks, this is just what I needed :)

  2. ChrisNovember 27, 2007 @ 03:17 PM

    Ditto…that’s slick.

  3. RyanNovember 27, 2007 @ 03:19 PM

    Well, before reading this post, I was doing the migration thing. Only, I wouldn’t include the data in the 003_create_categories migration, but add a new one 004_fill_default_categories (for the exact reason you pointed out – I couldn’t remember which migration had the data).

    However, I think I’ll be doing it this way from now on… it’s clever and much cleaner. Thanks!

  4. RussNovember 27, 2007 @ 03:25 PM

    This is quite close to what we’ve been doing with one of our applications. The only real differences between your suggestion and our implementation is naming. We put ours into db/required_bootstrap and named the task db:insert_required_bootstrap. It’s been a lot easier to manage this way (several months now) than having core required data intermingled with test fixtures.

  5. ErikNovember 27, 2007 @ 03:29 PM

    I have a “Scenario” that will create admin accounts and other defaults by going through and issuing a bunch of User.create() [etc] commands. I can load them up on demand via a rake task.

    I find it much more flexible than YAML fixtures. Especially when it comes to handling associations, and when “seeding” databases that might not be empty.

  6. ChrisNovember 27, 2007 @ 04:28 PM

    Thanks for this … very slick indeed.

    Also, congrats on the new job!

  7. JustinNovember 28, 2007 @ 12:03 AM

    I do basically the same thing but use csv instead of yaml. Unfortunately, Syck dumps unicode as unreadable base-64 strings. While the data is still usable (i.e. you can load it back from yaml without trouble), it kind of defeats the purpose of using a human-readable format—no manual edits to the data in your favorite editor, no looking at meaningful diffs in source control, etc.

  8. Koen Van der AuweraNovember 28, 2007 @ 07:07 AM

    That’s pretty much how we are handling things. Must be good then ;)

    As Chris already said, congratulations on the new job!

  9. Josh KnowlesNovember 28, 2007 @ 01:20 PM

    I’ve done something similar in the past, though I used regular Ruby files as opposed to fixtures so that I could get validations, etc.

    I’ve thrown the code up as a plugin on Google Code if anyone wants to take a look: http://code.google.com/p/db-populate/

  10. TonkatsufanNovember 28, 2007 @ 03:54 PM

    if dealing with huge datasets (around 250Mb Ascii data), I had no luck with using anything Ruby based but had to resort to LOAD DATA INFILE in MySQL (or the actual data dump reading syntax for your exact database). Even FasterCSV wasn’t fast enough to do the job. We’re talking about maybe 10-20 seconds for the LOAD DATA INFILE method compared to 20-30 minutes with FasterCSV and the CPU at 100%.

  11. Trevor TurkNovember 28, 2007 @ 10:56 PM

    It’s fugly, but I like to seed my applications from within the application itself, so an additional rake task isn’t necessary. See the get_settings action here: http://eldorado.googlecode.com/svn/trunk/app/controllers/application.rb

  12. AnandNovember 29, 2007 @ 12:05 AM

    Thanks for the tip. Congrats on the job.

  13. mattNovember 29, 2007 @ 06:34 PM

    thank you very much for posting this. extremely helpful.

  14. Robin WardDecember 03, 2007 @ 01:01 PM

    I used to use the YAML fixtures method you mention (I think Pete Forde sent me an earlier version of this code you’re using). It worked great early on in development, and even surprisingly well across different databases. On my laptop I was running SQLite and on my workstation I was using Postgres and the seed data migrated perfectly!

    However, as Tonkatsufan mentioned above, I quickly ran into performance constraints. If the seed data of your site gets large, the fixtures simply won’t cut it. I actually found that I couldn’t import the fixtures using Fixtures.create_fixtures as it would throw errors that Ruby was eating up my Linux box’s stack!

    I came up with the same solution Tonkatsufan did, and now I use data dumps from the database itself (although I prefer mysql_dump and just piping the file into the mysql command line). It’s much faster, and the sql text files it generates can be checked in nicely to subversion and integrate well with Capistrano.

  15. MikDecember 04, 2007 @ 03:00 PM

    I am looking for this for a long time. Thanks very much.

  16. JustinDecember 06, 2007 @ 06:04 PM

    I’ve been using the migration way of doing this for a while, but after coming to the same conclusion about schema.rb, I’ve been looking for a new way. This is perfect. Thanks!

  17. FalkOctober 31, 2008 @ 04:17 PM

    I use migrations like Ryan describes. Your aproach looks nice, but what happens when you do “rake db:seed” more than one? Do we get duplicates then?

  18. quotesDecember 01, 2008 @ 11:28 PM

    I also would rather use migrations.

  19. polDecember 15, 2008 @ 05:32 PM

    We use a modified version of ar_fixtures to do this. The big problem with it, though, is that 1) it obeys validation code (bad if you want to load data to the database out of order) 2) it doesn’t handle habtm join tables.

    So, I’m going to try your method. It looks cleaner and presumably the Fixtures.create_fixture method will create habtm join table rows?

  20. Christopher MeiklejohnJanuary 12, 2009 @ 12:45 AM

    Very slick!

  21. Chris LoftusJuly 10, 2009 @ 05:41 AM

    Oops, screwed up the output in previous comment. See

    http://users.aber.ac.uk/cwl/ruby/db_seed.rake

    for the updated version

    Chris

  22. John McLeodAugust 12, 2009 @ 02:49 PM

    Excellent! I’ve been with Rails for about 3 weeks and needed something like this. I tweaked it for .csv file. After a bit of file cleaning, the task worked. Thank you very much.

    John