One of my big projects over the last year has been a Parrot Babysteps tutorial. One of the more interesting tasks in that tutorial was reading a CSV file in Parrot. I used the HYG Star Catalog as a sample CSV file that was large enough to present some interesting data. This was fun in Parrot, but obviously I thought quite a bit about how I would tackle the problem in a higher level language such as Ruby. Today seems like a good day to find out.
I am emphasizing the Moderately in this Moderately Interesting Ruby Exercise. After exploring the csv
library for Ruby, we’ll use Sequel to build a database that can be quickly queried. Even though I have an unhealthy love for making projects larger and more complex than they need to be, I want to keep this short and sweet.
What I’m Using
My primary machine for these projects is the happy home iMac. It is running OS X 10.6 plus MacPorts. My default Ruby is 1.9.2, installed via rvm.
I may revisit this exercise with other Ruby installations on other platforms to double-check that things work, but your results should be similar to mine as long as you are using Ruby 1.9.2.
Exploration
We will start by poking at the Ruby standard csv
library a little bit, just to see how we use it. I already have a copy of the HYG Star Catalog from my previous efforts, but for this exercise I’ll pretend I do not. We’ll just download it using our favorite downloading technique. Mine is GNU wget.
$ wget http://www.astronexus.com/files/downloads/hygxyz.csv.gz
$ tar xfvz hygxyz.csv.gz
If we open hygxyz.csv
in our favorite Editor, we will see that the file is large and bewildering.
StarID,HIP,HD,HR,Gliese,BayerFlamsteed,ProperName,RA,Dec,Distance,PMRA,PMDec,RV,Mag,AbsMag,Spectrum,ColorIndex,X,Y,Z,VX,VY,VZ
0,,,,,,Sol,0,0,0.000004848,0,0,0,-26.73,4.85,G2V,0.656,0,0,0,0,0,0
1,1,224700,,,,,6.079e-05,01.08901332,282.485875706215,-5.20,-1.88,,9.10,1.84501631012894,F5,0.482,282.43485,0.00449,5.36884,4.9e-08,-7.12e-06,-2.574e-06
... and so on for 119,618 lines
There are many fields. Some of them are strings, others are numbers. Quite a few are empty.
Parsing the CSV
Let’s start with the simplest and dumbest CSV parsing code we can manage.
require 'csv'
CSV.open('hygxyz.csv').each { |row| p row }
How does that look?
$ ruby stellar
["StarID", "HIP", "HD", "HR", "Gliese", "BayerFlamsteed", "ProperName", "RA", "Dec", "Distance", "PMRA",
"PMDec", "RV", "Mag", "AbsMag", "Spectrum", "ColorIndex", "X", "Y", "Z", "VX", "VY", "VZ"]
["0", nil, nil, nil, nil, nil, "Sol", "0", "0", "0.000004848", "0", "0", "0", "-26.73", "4.85", "G2V",
"0.656", "0", "0", "0", "0", "0", "0"]
["1", "1", "224700", nil, nil, nil, nil, "6.079e-05", "01.08901332", "282.485875706215", "-5.20", "-1.88",
nil, "9.10", "1.84501631012894", "F5", "0.482", "282.43485", "0.00449", "5.36884", "4.9e-08", "-7.12e-06",
"-2.574e-06"]
["2", "2", "224690", nil, nil, nil, nil, "0.00025315", "-19.49883745", "45.662100456621", "181.21",
"-0.93", nil, "9.27", "5.97222057420059", "K3V", "0.999", "43.04329", "0.00285", "-15.24144", "-7.1e-08",
"4.0112e-05", "-1.94e-07"]
["3", "3", "224699", nil, nil, nil, nil, "0.00033386", "38.85928608", "355.871886120996", "5.24", "-2.91",
nil, "6.61", "-1.1464684004746", "B9", "-0.019", "277.11358", "0.02422", "223.27753", "3.148e-06",
"9.04e-06", "-3.909e-06"]
...
Okay, wow. That is a lot of stuff going by. I don’t know about you, but I’m going to hit Control-C and make an adjustment to the code.
require 'csv'
filename = 'hygxyz.csv'
CSV.open(filename).first(3).each { |row| p row }
There. Now we will only look at the first three entries. That should be a little easier to digest. I also shuffled the filename into its own variable. That’s just how I like to do things. I tell myself that it will be easier to read and edit later.
$ ruby stellar
["StarID", "HIP", "HD", "HR", "Gliese", "BayerFlamsteed", "ProperName", "RA", "Dec", "Distance", "PMRA",
"PMDec", "RV", "Mag", "AbsMag", "Spectrum", "ColorIndex", "X", "Y", "Z", "VX", "VY", "VZ"]
["0", nil, nil, nil, nil, nil, "Sol", "0", "0", "0.000004848", "0", "0", "0", "-26.73", "4.85", "G2V",
"0.656", "0", "0", "0", "0", "0", "0"]
["1", "1", "224700", nil, nil, nil, nil, "6.079e-05", "01.08901332", "282.485875706215", "-5.20", "-1.88",
nil, "9.10", "1.84501631012894", "F5", "0.482", "282.43485", "0.00449", "5.36884", "4.9e-08", "-7.12e-06",
"-2.574e-06"]
The default behavior for csv
is reasonable. It split up the fields correctly, and set the empty fields to nil
. Next we need to deal with the fact that the first row is supposed to be the header, providing names for fields in the corresponding columns.
require 'csv'
filename = 'hygxyz.csv'
CSV.open(filename, headers: true).first(3).each { |row| p row }
One small change has a big impact.
$ ruby stellar.rb
#<CSV::Row "StarID":"0" "HIP":nil "HD":nil "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":"Sol"
"RA":"0" "Dec":"0" "Distance":"0.000004848" "PMRA":"0" "PMDec":"0" "RV":"0" "Mag":"-26.73" "AbsMag":"4.85"
"Spectrum":"G2V" "ColorIndex":"0.656" "X":"0" "Y":"0" "Z":"0" "VX":"0" "VY":"0" "VZ":"0">
#<CSV::Row "StarID":"1" "HIP":"1" "HD":"224700" "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":nil
"RA":"6.079e-05" "Dec":"01.08901332" "Distance":"282.485875706215" "PMRA":"-5.20" "PMDec":"-1.88" "RV":nil
"Mag":"9.10" "AbsMag":"1.84501631012894" "Spectrum":"F5" "ColorIndex":"0.482" "X":"282.43485" "Y":"0.00449"
"Z":"5.36884" "VX":"4.9e-08" "VY":"-7.12e-06" "VZ":"-2.574e-06">
#<CSV::Row "StarID":"2" "HIP":"2" "HD":"224690" "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":nil
"RA":"0.00025315" "Dec":"-19.49883745" "Distance":"45.662100456621" "PMRA":"181.21" "PMDec":"-0.93" "RV":nil
"Mag":"9.27" "AbsMag":"5.97222057420059" "Spectrum":"K3V" "ColorIndex":"0.999" "X":"43.04329" "Y":"0.00285"
"Z":"-15.24144" "VX":"-7.1e-08" "VY":"4.0112e-05" "VZ":"-1.94e-07">
Now csv
is generating something that looks vaguely like a hash. Nice. However, every field is handled as a String when some of them are obviously numbers. The converters
option should fix that.
CSV.open(filename, headers: true, converters: :numeric).first(5).each do |row|
p row
end
Setting the converters
option to :numeric
tells CSV to convert anything that looks like a number to a Number. That is useful for comparing values, because Ruby won’t automatically convert a String into a Number. You must tell it to convert. Anyways - I’m babbling. It is really amazing how hard it is to pad the content of these little essays out when you are talking about Ruby code. That’s probably why there are so many silly cartoons and insane gibberish accompanying the best Ruby tutorials.
$ ruby stellar.rb
#<CSV::Row "StarID":0 "HIP":nil "HD":nil "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":"Sol" "RA":0
"Dec":0 "Distance":4.848e-06 "PMRA":0 "PMDec":0 "RV":0 "Mag":-26.73 "AbsMag":4.85 "Spectrum":"G2V"
"ColorIndex":0.656 "X":0 "Y":0 "Z":0 "VX":0 "VY":0 "VZ":0>
#<CSV::Row "StarID":1 "HIP":1 "HD":224700 "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":nil
"RA":6.079e-05 "Dec":1.08901332 "Distance":282.485875706215 "PMRA":-5.2 "PMDec":-1.88 "RV":nil "Mag":9.1
"AbsMag":1.84501631012894 "Spectrum":"F5" "ColorIndex":0.482 "X":282.43485 "Y":0.00449 "Z":5.36884
"VX":4.9e-08 "VY":-7.12e-06 "VZ":-2.574e-06>
#<CSV::Row "StarID":2 "HIP":2 "HD":224690 "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":nil
"RA":0.00025315 "Dec":-19.49883745 "Distance":45.662100456621 "PMRA":181.21 "PMDec":-0.93 "RV":nil "Mag":9.27
"AbsMag":5.97222057420059 "Spectrum":"K3V" "ColorIndex":0.999 "X":43.04329 "Y":0.00285 "Z":-15.24144
"VX":-7.1e-08 "VY":4.0112e-05 "VZ":-1.94e-07>
Let’s do something with those Numbers. How about counting the number of stars within ten light years of Earth?
# count-neighbors.rb
require 'csv'
filename = 'hygxyz.csv'
neighbor_count = 0
CSV.open(filename, headers: true, converters: :numeric).each do |row|
if row['Distance'] < 10
neighbor_count += 1
end
end
puts "There are #{neighbor_count} stars within 10 light years of Earth."
How many are there?
$ ruby count-neighbors.rb
There are 320 stars within 10 light years of Earth.
That’s a lot of neighbors. It took a while to count them, though. That probably has something to do with the 20 Megabyte CSV file. We are not ready to speed things up, though. Let’s try one more task: looking for a specific star.
# find-sol.rb
require 'csv'
filename = 'hygxyz.csv'
CSV.open(filename, headers: true, converters: :numeric).each do |row|
if row['ProperName'] == "Sol"
p row
end
end
It better find our own Sun. It’s the first entry, after all.
$ ruby find-sol.rb
#<CSV::Row "StarID":0 "HIP":nil "HD":nil "HR":nil "Gliese":nil "BayerFlamsteed":nil "ProperName":"Sol" "RA":0
"Dec":0 "Distance":4.848e-06 "PMRA":0 "PMDec":0 "RV":0 "Mag":-26.73 "AbsMag":4.85 "Spectrum":"G2V"
"ColorIndex":0.656 "X":0 "Y":0 "Z":0 "VX":0 "VY":0 "VZ":0>
Sorry, I got sleepy. Is it done? I should have put a break
in that code after printing out the star details. Of course, that is just cheating around the fact that parsing a large CSV file is slow. Perhaps it is time to try a database.
Creating a Database
I would imagine that stuffing these values into a database should make simple questions like “show me the star named ‘Sol’” or “count the stars within 10 light years” pretty straightforward. We can use a lightweight database such as SQLite. There may be nearly 120,000 stars in the catalog, but that is trivial for SQLite. I have heard anecdotal reports of it being used for tables with millions of rows.
First, I want to install sqlite3
.
$ port install sqlite3
I’m not doing this on my Ubuntu Linux machine, but if I was I’d install both the sqlite3
shell and the development libraries.
$ sudo apt-get install sqlite3 libsqlite3-dev
The Sequel Library
I have already chosen Sequel as my preferred Ruby database library, so I need to install that. Oh, and I should also install sqlite3-ruby. Sequel provides a nice layer of abstraction, but it does not contain the code which actually speaks to the database.
$ gem install sequel
$ gem install sqlite3-ruby
We can use the create_table
database method described on the Sequel migrations page to build the table, rather than relying on my rather lightweight knowledge of SQLite schema definition. The dump of star data from our earlier CSV parsing code provides the hints we need to build a usable schema.
require 'csv'
require 'sequel'
filename = 'hygxyz.csv'
DB = Sequel.sqlite('hyg.db')
if DB.table_exists? :stars
DB.drop_table :stars
end
DB.create_table :stars do
primary_key :id
Integer :StarID
Integer :HIP
Integer :HD
Integer :HR
Integer :Gliese
Integer :BayerFlamsteed
String :ProperName
Float :RA
Float :Dec
Float :Distance
Float :PMRA
Float :PMDec
Float :RV
Float :Mag
Float :AbsMag
String :Spectrum
Float :ColorIndex
Float :X
Float :Y
Float :Z
Float :VX
Float :VY
Float :VZ
end
CSV.open(filename, headers: true, converters: :numeric).each do |row|
print "."
DB[:stars].insert(row.to_hash)
end
puts
This script will set up the database and fill it with values from the CSV file. Each row is converted to a Hash, which makes the database insert
method happy. There is also a little check and cleanup near the beginning. This is just in case there is a typo that messes up your code later on. It doesn’t hurt to be cautious.
$ ruby stellar.rb
.......
Go take a break. Make some coffee, catch up with your family, or play one more turn of Civilization. This is going to take a while. Me, I went for some more coffee.
Searching the Database
We will look at the Sequel querying API in a moment, but first let us make sure that the database returns plausible results to direct queries.
$ sqlite3 hyg.db
sqlite> select count(*) from stars where distance < 10;
320
sqlite> select * from stars where propername = 'Sol';
1|0||||||Sol|0.0|0.0|4.848e-06|0.0|0.0|0.0|-26.73|4.85|G2V|0.656|0.0|0.0|0.0|0.0|0.0|0.0
sqlite> .quit
Hopefully you noticed a big improvement in the speed of your searches by switching to a database. I sure did.
What if we tried the same queries with Ruby and Sequel? Let’s open an irb
prompt and test it out.
$ irb
ruby-1.9.2-p0 > require 'sequel'
=> true
ruby-1.9.2-p0 > DB = Sequel.sqlite('hyg.db')
=> #<Sequel::SQLite::Database: "sqlite:/hyg.db">
ruby-1.9.2-p0 > DB[:stars].first(ProperName: 'Sol')
=> {:id=>1, :StarID=>0, :HIP=>nil, :HD=>nil, :HR=>nil, :Gliese=>nil, :BayerFlamsteed=>nil,
:ProperName=>"Sol", :RA=>0.0, :Dec=>0.0, :Distance=>4.848e-06, :PMRA=>0.0, :PMDec=>0.0, :RV=>0.0,
:Mag=>-26.73, :AbsMag=>4.85, :Spectrum=>"G2V", :ColorIndex=>0.656, :X=>0.0, :Y=>0.0, :Z=>0.0, :VX=>0.0,
:VY=>0.0, :VZ=>0.0}
ruby-1.9.2-p0 > DB[:stars].filter { distance < 10 }.count
=> 320
Yes indeed. That was much faster. Let’s close with something a little bit fancier: showing a table of information about all the stars in the catalog that are G Spectrum and have a proper name, ordered by their distance from Earth.
ruby-1.9.2-p0 > DB[:stars].filter(:Spectrum.like('G%')).filter('ProperName not null').order(:Distance).each { |row|
ruby-1.9.2-p0 > printf("%20s\t%4.2f\t%s\n", row[:ProperName], row[:Distance], row[:Spectrum])
ruby-1.9.2-p0 ?> }
Sol 0.00 G2V
Rigel Kentaurus A 1.35 G2V
82 G. Eri 6.06 G8V
Groombridge 1830 9.16 G8Vp
Vindemiatrix 31.35 G8IIIvar
Nihal 48.80 G5II
=> #<Sequel::SQLite::Dataset: "SELECT * FROM `stars` WHERE ((Spectrum like 'G%') AND (ProperName not null))">
I encourage you to explore the Sequel querying API more on your own, but I need to wrap this up.
Conclusion
All right. You’ve got 119,617 stars with various characteristics, all sitting there waiting for you to think of something interesting to do with them. I just wanted to see how much easier it would be to parse a CSV in a high level language. Turns out, it’s pretty easy. Explore the Ruby standard library and the many Rubygems that are available out in the big world. You’ll probably have fun, and you’ll almost definitely learn something.
Backlinks
Added to vault 2024-01-15. Updated on 2024-07-10