This one’s a bit more bloggy than the earlier steps, but that’s just the mood I was in when writing it. You can ignore the commentary and focus on the code if that’s your preference.
We have inched our way forward in our understanding of Parrot and PIR. I think that it’s time to take a big step, though. We’re going to add file handling to our toolkit. Reading and writing files are easy tasks in Parrot - so easy that I could probably discuss both in a couple of paragraphs and be more or less done. But I’m hungry for something meatier. I want to work with a lot of data and get curious trivia from that data. Hashes are good, too. Let’s look at Parrot hashes at some point today as well.
First, Get the Data
It took me some time to decide exactly what sort of data I wanted to look at. I was thinking of nutritional data, but I’m not ready for all of the cross-referencing I’d have to do in order to produce information that would be meaningful to me.
Then it hit me. I love astronomy. Wait a moment. That’s not completely true. I like astronomy. It teaches us a lot about our place in the universe, and exactly how freaking small we really are. What I love is random trivia about space: the name of the closest star to our solar system, how many of our neighboring stars are sort of like ours, stuff like that. I want to write a program that will help me get those juicy tidbits.
The next challenge was finding a data source that would be useful for me. There are plenty of star catalogs available. The problem is that I like astronomy - I don’t love it. Much of modern astronomy is incomprehensible to me unless it has a pretty picture of a penny next to a football field illustrating interplanetary distances or some other thing I can pretend to understand. Oh, and remember that I barely know Parrot. I need something simple and easy to parse, but big enough to have interesting data.
After nearly 15 minutes of dedicated research - once you subtract the hours spent admiring the Astronomy Picture of the Day archives - I came across David Nash’s Astronomy Nexus. This is a great resource for amateur astronomers, space trivia buffs, and people who enjoy geeky pictures like the view of Earth from Gliese 581. It also has a nice, easily parsed file listing almost 120,000 stars. At roughly 20 Megabytes uncompressed, that’s big enough to be interesting.
The file is compressed in
Uncompressing it on Linux or OS X is easy:
$ gunzip hygxyz.csv.gz
You’re going to have to install an archive utility on Windows, though. I suggest 7-Zip.
Put the resulting CSV (Comma-Separated Values) file in your project directory after uncompressing. Now we have a file full of comma-separated values which look something like this:
StarID,HIP,HD,HR,Gliese,BayerFlamsteed,ProperName,RA,Dec,Distance,PMRA,PM\ Dec,RV,Mag,AbsMag,Spectrum,ColorIndex,X,Y,Z,VX,VY,VZ 0,,,,,,Sol,0,0,0.000004848,0,0,0,-26.73,4.85,G2V,0.656,0,0,0,0,0,0 1,1,224700,,,,,6.079e-05,01.08901332,282.485875706215,-5.20,-1.88,,9.10,1\ .84501631012894,F5,0.482,282.43485,0.00449,5.36884,4.9e-08,-7.12e-06,-2.5\ 74e-06
My goodness, there are a lot of commas and numbers in there. The structure is sensible, though. We have a header line that tells us what each field represents, followed by many lines of data.
Let’s start small, by counting the number of stars listed in the HYG database.
To count stars, we can read each line of the file and count the number of lines read. Remember not to count the header line!
We use Parrot’s I/O library to handle opening and reading files.
open will actually open the file for us.
data_file = open filename, 'r'
open opcode accepts two arguments: the name of the file and a mode indicator.
We are reading the file, so we specify mode
What will we use to read a line from the file? How about the
current_line = readline data_file
A file that reached EOF (End Of File) and has nothing left to read looks false to Parrot. That means we can use the filehandle to test if we should keep reading.
unless data_file goto SHOW_STAR_COUNT
Finally, it is polite to close a file when we’re done using it.
Is it necessary to close the file, though? That’s a reasonable question. Many modern languages close files automatically when their handle goes out of scope - for example, when the program ends. The Parrot Book I/O chapter does not make it clear what Parrot’s approach is, though. I’m going to keep closing those finished files until somebody tells me otherwise.
I’ll probably continue closing finished files even after somebody tells me otherwise, truthfully. I am one of those people who likes explicit code and ties his shoelaces with a tidy little double-knot. I can’t help it - it’s in my nature.
That’s all the important information about reading files.
Oh sure, there are details we’ll need to look at eventually, such as what happens when the file doesn’t exist or you don’t have permission.
But for reading a file that we know exists and that we can read,
close are the main bits.
How many stars are in HYG?
$ example-06-01.pir There are 119618 stars in the HYG catalog.
That is a big number. Nowhere near the billions of stars in our universe, but I think we can stay busy for quite some time with nearly one hundred twenty thousand stars.
Intermission: File Mode Indicators
Now is as good a time as any to summarize the indicator codes that
Indicators can be combined.
rw indicates that you plan to read and write to a file.
a should not be used alone -
specify that you will be write-appending to the file with
Order doesn’t matter, either.
wr are both valid ways to say you plan to read and write a file.
We will just be reading files today, but you might as well remember it now. It will come up eventually.
All right. I’m manually counting commas in HYG. It looks like "ProperName" is the seventh field. It also looks like there are quite a few stars in the catalog that have no proper name. How many?
There is one new opcode in this code: the
split String opcode.
It accepts a string delimiter and a target string, and returns the list of strings that result from splitting the target string with the delimiter.
star_data = split ',', current_line
star_data is a normal array, so we can access the ProperName field by the index we came to in hand-counting the fields.
star_name = star_data
It is very clumsy to rely on hand-counting fields, so we will come back to that in a moment. First, let’s look at what this application tells us.
$ parrot example-06-02.pir There are 119618 stars in the HYG catalog. 87 of them have proper names. 119531 of them do not have proper names.
Only 87 of them have names? Huh. I thought there would be more than that. It’s possible that the number is wrong because I was relying on hand-counting the fields. Let’s tell Parrot to figure out the fields for us.
Quite a few changes have been made. One of the first was to define constants for some important values which I know will never change.
.const string DELIMITER = ',' .const string NAME_FIELD = 'ProperName'
DELIMITER uses more characters than
','. I prefer referring to things by name when practical.
This gives two benefits in my mind.
- I know the purpose of the value. The semantics of it appeals to me: "split with DELIMITER, which is
','`" rather than "split with ’,'which is the delimiter".
- I only have to change one spot. If someday David Nash wants to switch to tab delimited files, I will not have to find and replace
','throughout my code.
As far as
NAME_FIELD, that’s just because I prefer referring to things by name.
It doesn’t really serve any other purpose.
Choose your own style, but make sure others can read it.
The next task is to find which field holds the star name. We’ll split the header line and step through each field until we either find the field we’re looking for or hit the end.
current_line = readline data_file field_names = split DELIMITER, current_line field_count = field_names FIND_NAME_INDEX: if name_index >= field_count goto NAME_INDEX_ERROR current_field = field_names[name_index] if current_field == NAME_FIELD goto NEXT_STAR name_index += 1 goto FIND_NAME_INDEX NAME_INDEX_ERROR: say 'Went through available fields without finding name index!' goto END
Why did I finally throw some error-checking into this? I won’t say, but believe me when I tell you to always look for typos in your code. And if your loop doesn’t check if it’s time to quit, that loop might never quit.
Now that Parrot knows which field holds the names, we can use it in our name counting.
star_name = star_data[name_index]
Let’s run the new code.
$ parrot example-06-03.pir There are 119618 stars in the HYG catalog. 87 of them have proper names. 119531 of them do not have proper names.
I get the same result. The hand-counting of fields I did earlier worked. That’s a relief, but I’m much happier now that Parrot is counting for me.
Understanding the Data by Looking at Sol
I want to get a lot more information from this data, but in order to do that I’ll need a nice way to understand the information about each star in the set. We’re going to go about that by focusing on Sol, our own sun.
Sol is the first star listed after the header line, so we don’t have to do anything clever to find it.
In order to display the field names and values together, we step through the header and star data arrays at the same time.
DISPLAY_NEXT_FIELD: if current_field_index >= field_count goto END current_field_name = field_names[current_field_index] current_field_value = star_data[current_field_index] print current_field_name print ': ' say current_field_value current_field_index += 1 goto DISPLAY_NEXT_FIELD
What does the HYG data for Sol look like?
$ parrot example-06-04.pir StarID: 0 HIP: HD: HR: Gliese: BayerFlamsteed: ProperName: Sol RA: 0 Dec: 0 Distance: 0.000004848 PMRA: 0 PMDec: 0 RV: 0 Mag: -26.73 AbsMag: 4.85 Spectrum: G2V ColorIndex: 0.656 X: 0 Y: 0 Z: 0 VX: 0 VY: 0 VZ : 0 $
That’s a fair amount of trivia, which makes me happy. Granted, I only understand what five of those fields actually mean - although I can guess at a few more. The data isn’t what’s jumping out at me, though. This is:
VZ : 0 $
readline reads the full line from the file, including the special newline characters that mark the end of the line.
That newline becomes part of the string, which means it also gets printed out when we display the header and final field for our data.
I knew I’d have to deal with this eventually.
Perl has the builtin
chomp function which is perfect for exactly this situation.
Parrot doesn’t have
chomp as a builtin, but it is available via the standard String/Utils
There’s no need to download anything extra, because "String/Utils" ships with Parrot.
Since "String/Utils" is a library, we need to load it.
Parrot compiles its library PIR files into Parrot Compiled Byte Code.
PBC has been processed enough that the Parrot interpreter can load and execute its code a little faster.
load_bytecode core opcode tells Parrot that we are going to load a bytecode file and we need its capabilities to
be added to the system.
chomp functionality is still just beyond our reach, though.
We need to make room for it in our own program by reserving a PMC.
.local pmc chomp
Now we can reach over into the "String/Utils" namespace and grab
chomp for our own use.
chomp = get_global ['String';'Utils'], 'chomp'
get_global is a variable opcode that allows us to get a PMC from the global namespace.
Used like this, it allows us to grab a PMC from a specific available namespace.
What makes namespaces great is the fact that they can have any number of variable names without cluttering the globally available list of names.
On the other hand, you need to take an extra step to make that name available for your own use.
That is fairly consistent with other languages that I’ve used, although maybe a little lower level than I care for.
Oh well. This is a low-level language, after all.
Now that we’ve got that loading business out of the way, we can actually use
chomp is a subroutine, and not an opcode.
You’ll need to use parentheses when you use it.
current_line = chomp(current_line)
chomp returns a copy of
current_line with that annoying newline removed.
We want to reuse that copy immediately, so we just assign the result right back to
Remember to use it again when reading the data line for Sol.
current_line = chomp(current_line)
How does the data look now?
$ parrot example-06-05.pir StarID: 0 HIP: HD: HR: Gliese: BayerFlamsteed: ProperName: Sol RA: 0 Dec: 0 Distance: 0.000004848 PMRA: 0 PMDec: 0 RV: 0 Mag: -26.73 AbsMag: 4.85 Spectrum: G2V ColorIndex: 0.656 X: 0 Y: 0 Z: 0 VX: 0 VY: 0 VZ: 0
Now it would be nice to ask for specific data for our star in a meaningful way. For example, I want to just see the name and spectrum information. We could dig through the fields the way we have been. I would prefer it if we could just ask for them by name.
One way to do that is with a Hash. This is a collection structure similar to an array. The difference is that you get data from the hash using string keys instead of looking things up by index. Python programmers know it as a "dictionary".
We didn’t have to go through so many contortions to add a hash, thank goodness.
Hashes are built-in, so we just have to allocate a PMC and call
.local pmc star # ... star = new 'Hash'
Instead of reading and printing the fields, we assign them to the hash.
ASSIGN_NEXT_FIELD: if current_field_index >= field_count goto DISPLAY_STAR_DETAILS current_field_name = field_names[current_field_index] current_field_value = star_data[current_field_index] star[current_field_name] = current_field_value current_field_index += 1 goto ASSIGN_NEXT_FIELD
On the display side of things, I did get a little lazy and use register variables. There’s nothing wrong with that, but it’s not consistent with my normal style. We can fix that in the next round.
DISPLAY_STAR_DETAILS: $S0 = star['ProperName'] $S1 = star['Spectrum'] $S2 = star['Distance'] print "<Name: " print $S0 print ", Spectrum: " print $S1 print ", Distance: " print $S2 say ">"
Hash indexes look a lot like array indexes. The keys can get complicated, but let’s stick with simple strings.
What does our output look like now?
$ parrot example-06-06.pir <Name: Sol, Spectrum: G2V, Distance: 0.000004848>
I’m tempted to print out all the data this way, but there are well over a hundred thousand. Printing takes a while. Reading takes a lot of whiles. How about just printing the information for stars with a matching spectrum?
Stars Like Ours
Now that we have a Hash to describe characteristics of our own Sun, we can build Hashes for other stars and look for the ones that are similar to ours. We’ll use the spectrum as our guideline, and look for an exact match rather than just a vague similarity. We’re also going to filter out the ones that don’t have a name. Many of the stars in this set don’t have proper names.
We look at each star as we go, checking to see if it exactly matches Sol’s. I know that we’re missing a couple of entries designated as "G1/G2V", but I am not going to worry about it today.
EXAMINE_STAR: star_spectrum = star['Spectrum'] if star_spectrum == sol_spectrum goto REMEMBER_MATCH goto LOAD_NEXT_STAR
We’re remembering stars with the same spectrum, but will only be displaying those with proper names. We’ll just count the others.
REMEMBER_MATCH: matching_count += 1 star_name = star['ProperName'] if star_name goto DISPLAY_STAR_DETAILS unnamed_match_count += 1 goto LOAD_NEXT_STAR
You may have noticed that I reassign some variables with the same value they probably already had. This may not be efficient, but it’s for my own sanity. I want to be certain about the values held in those variables. I am also pretending these little labelled regions are like distinct blocks of code. It’s a lie, but a useful one.
DISPLAY_STAR_DETAILS: star_name = star['ProperName'] star_spectrum = star['Spectrum'] star_distance = star['Distance'] print "<Name: " print star_name print ", Spectrum: " print star_spectrum print ", Distance: " print star_distance say ">" goto LOAD_NEXT_STAR
On the other hand, this program does take a couple of seconds to run on my machine now.
$ parrot example-06-07.pir <Name: Sol, Spectrum: G2V, Distance: 0.000004848> <Name: Rigel Kentaurus A, Spectrum: G2V, Distance: 1.34749097181049> 568 stars exactly matched Sol's spectrum G2V 567 have no proper name
Those are disappointing results. It looks like we have many neighbors that look like our Sun, but only one with a name. I would love to use one of the alternate references if available, such as the Gliese or Bayer-Flamsteed designations. I don’t think that’s practical with how we’re writing our Parrot application today.
Wow. There has been a lot of new stuff today. Not only did we learn how to read files and use Hashes, we also saw how to load bytecode libraries. We counted, searched through, and displayed data from a 20 Megabyte text file with nearly 120,000 entries. We also learned that Rigel Kentaurus A is the only named neighbor in the database that is the same spectral type as our Sun.
I think we’re reaching the limits of what I want to do with
goto as my primary tool for guiding program flow.
PIR Code is getting harder to write and edit.
The next step really should be creating subroutines to abstract some of the more complicated or tedious processes.