One of my big projects over the last year has been a Parrot Babysteps tutorial. One of the more interesting tasks in that tutorial was reading a CSV file in Parrot. I used the HYG Star Catalog as a sample CSV file that was large enough to present some interesting data. This was fun in Parrot, but obviously I thought quite a bit about how I would tackle the problem in a higher level language such as Ruby. Today seems like a good day to find out.
I am emphasizing the Moderately in this Moderately Interesting Ruby Exercise. After exploring the csv library for Ruby, we’ll use Sequel to build a database that can be quickly queried. Even though I have an unhealthy love for making projects larger and more complex than they need to be, I want to keep this short and sweet.
What I’m Using
My primary machine for these projects is the happy home iMac. It is running OS X 10.6 plus MacPorts. My default Ruby is 1.9.2, installed via rvm.
I may revisit this exercise with other Ruby installations on other platforms to double-check that things work, but your results should be similar to mine as long as you are using Ruby 1.9.2.
Exploration
We will start by poking at the Ruby standard csv library a little bit, just to see how we use it. I already have a copy of the HYG Star Catalog from my previous efforts, but for this exercise I’ll pretend I do not. We’ll just download it using our favorite downloading technique. Mine is GNU wget.
If we open hygxyz.csv in our favorite Editor, we will see that the file is large and bewildering.
There are many fields. Some of them are strings, others are numbers. Quite a few are empty.
Parsing the CSV
Let’s start with the simplest and dumbest CSV parsing code we can manage.
How does that look?
Okay, wow. That is a lot of stuff going by. I don’t know about you, but I’m going to hit Control-C and make an adjustment to the code.
There. Now we will only look at the first three entries. That should be a little easier to digest. I also shuffled the filename into its own variable. That’s just how I like to do things. I tell myself that it will be easier to read and edit later.
The default behavior for csv is reasonable. It split up the fields correctly, and set the empty fields to nil. Next we need to deal with the fact that the first row is supposed to be the header, providing names for fields in the corresponding columns.
One small change has a big impact.
Now csv is generating something that looks vaguely like a hash. Nice. However, every field is handled as a String when some of them are obviously numbers. The converters option should fix that.
Setting the converters option to :numeric tells CSV to convert anything that looks like a number to a Number. That is useful for comparing values, because Ruby won’t automatically convert a String into a Number. You must tell it to convert. Anyways - I’m babbling. It is really amazing how hard it is to pad the content of these little essays out when you are talking about Ruby code. That’s probably why there are so many silly cartoons and insane gibberish accompanying the best Ruby tutorials.
Let’s do something with those Numbers. How about counting the number of stars within ten light years of Earth?
How many are there?
That’s a lot of neighbors. It took a while to count them, though. That probably has something to do with the 20 Megabyte CSV file. We are not ready to speed things up, though. Let’s try one more task: looking for a specific star.
It better find our own Sun. It’s the first entry, after all.
Sorry, I got sleepy. Is it done? I should have put a break in that code after printing out the star details. Of course, that is just cheating around the fact that parsing a large CSV file is slow. Perhaps it is time to try a database.
Creating a Database
I would imagine that stuffing these values into a database should make simple questions like “show me the star named ‘Sol’” or “count the stars within 10 light years” pretty straightforward. We can use a lightweight database such as SQLite. There may be nearly 120,000 stars in the catalog, but that is trivial for SQLite. I have heard anecdotal reports of it being used for tables with millions of rows.
First, I want to install sqlite3.
I’m not doing this on my Ubuntu Linux machine, but if I was I’d install both the sqlite3 shell and the development libraries.
The Sequel Library
I have already chosen Sequel as my preferred Ruby database library, so I need to install that. Oh, and I should also install sqlite3-ruby. Sequel provides a nice layer of abstraction, but it does not contain the code which actually speaks to the database.
We can use the create_table database method described on the Sequel migrations page to build the table, rather than relying on my rather lightweight knowledge of SQLite schema definition. The dump of star data from our earlier CSV parsing code provides the hints we need to build a usable schema.
This script will set up the database and fill it with values from the CSV file. Each row is converted to a Hash, which makes the database insert method happy. There is also a little check and cleanup near the beginning. This is just in case there is a typo that messes up your code later on. It doesn’t hurt to be cautious.
Go take a break. Make some coffee, catch up with your family, or play one more turn of Civilization. This is going to take a while. Me, I went for some more coffee.
Searching the Database
We will look at the Sequel querying API in a moment, but first let us make sure that the database returns plausible results to direct queries.
Hopefully you noticed a big improvement in the speed of your searches by switching to a database. I sure did.
What if we tried the same queries with Ruby and Sequel? Let’s open an irb prompt and test it out.
Yes indeed. That was much faster. Let’s close with something a little bit fancier: showing a table of information about all the stars in the catalog that are G Spectrum and have a proper name, ordered by their distance from Earth.
I encourage you to explore the Sequel querying API more on your own, but I need to wrap this up.
Conclusion
All right. You’ve got 119,617 stars with various characteristics, all sitting there waiting for you to think of something interesting to do with them. I just wanted to see how much easier it would be to parse a CSV in a high level language. Turns out, it’s pretty easy. Explore the Ruby standard library and the many Rubygems that are available out in the big world. You’ll probably have fun, and you’ll almost definitely learn something.