This one’s a bit more bloggy than the earlier steps, but that’s just the mood I was in when writing it. You can ignore the commentary and focus on the code if that’s your preference.
Introduction
We have inched our way forward in our understanding of Parrot and PIR. I think that it’s time to take a big step, though. We’re going to add file handling to our toolkit. Reading and writing files are easy tasks in Parrot - so easy that I could probably discuss both in a couple of paragraphs and be more or less done. But I’m hungry for something meatier. I want to work with a lot of data and get curious trivia from that data. Hashes are good, too. Let’s look at Parrot hashes at some point today as well.
First, Get the Data
It took me some time to decide exactly what sort of data I wanted to look at. I was thinking of nutritional data, but I’m not ready for all of the cross-referencing I’d have to do in order to produce information that would be meaningful to me.
Then it hit me. I love astronomy. Wait a moment. That’s not completely true. I like astronomy. It teaches us a lot about our place in the universe, and exactly how freaking small we really are. What I love is random trivia about space: the name of the closest star to our solar system, how many of our neighboring stars are sort of like ours, stuff like that. I want to write a program that will help me get those juicy tidbits.
The next challenge was finding a data source that would be useful for me. There are plenty of star catalogs available. The problem is that I like astronomy - I don’t love it. Much of modern astronomy is incomprehensible to me unless it has a pretty picture of a penny next to a football field illustrating interplanetary distances or some other thing I can pretend to understand. Oh, and remember that I barely know Parrot. I need something simple and easy to parse, but big enough to have interesting data.
After nearly 15 minutes of dedicated research - once you subtract the hours spent admiring the Astronomy Picture of the Day archives - I came across David Nash’s Astronomy Nexus. This is a great resource for amateur astronomers, space trivia buffs, and people who enjoy geeky pictures like the view of Earth from Gliese 581. It also has a nice, easily parsed file listing almost 120,000 stars. At roughly 20 Megabytes uncompressed, that’s big enough to be interesting.
Enough jabbering. Let’s start downloading. The latest version of the catalog is available from the HYG Database page. I grabbed version 2.0, which is currently the most recent.
The file is compressed in gz
format. Uncompressing it on Linux or OS X is easy:
You’re going to have to install an archive utility on Windows, though. I suggest 7-Zip.
Put the resulting CSV (Comma-Separated Values) file in your project directory after uncompressing. Now we have a file full of comma-separated values which look something like this:
My goodness, there are a lot of commas and numbers in there. The structure is sensible, though. We have a header line that tells us what each field represents, followed by many lines of data.
Let’s start small, by counting the number of stars listed in the HYG database.
Counting Stars
To count stars, we can read each line of the file and count the number of lines read. Remember not to count the header line!
We use Parrot’s I/O library to handle opening and reading files. open
will actually open the file for us.
The open
opcode accepts two arguments: the name of the file and a mode indicator. We are reading the file, so we specify mode r
.
What will we use to read a line from the file? How about the readline
?
A file that reached EOF (End Of File) and has nothing left to read looks false to Parrot. That means we can use the filehandle to test if we should keep reading.
Finally, it is polite to close a file when we’re done using it.
Is it necessary to close the file, though? That’s a reasonable question. Many modern languages close files automatically when their handle goes out of scope — for example, when the program ends. The Parrot Book I/O chapter does not make it clear what Parrot’s approach is, though. I’m going to keep closing those finished files until somebody tells me otherwise.
I’ll probably continue closing finished files even after somebody tells me otherwise, truthfully. I am one of those people who likes explicit code and ties his shoelaces with a tidy little double-knot. I can’t help it - it’s in my nature.
That’s all the important information about reading files. Oh sure, there are details we’ll need to look at eventually, such as what happens when the file doesn’t exist or you don’t have permission. But for reading a file that we know exists and that we can read, open
, readline
, and close
are the main bits.
How many stars are in HYG?
That is a big number. Nowhere near the billions of stars in our universe, but I think we can stay busy for quite some time with nearly one hundred twenty thousand stars.
Intermission: File Mode Indicators
Now is as good a time as any to summarize the indicator codes that open
accepts.
Indicator | Mode |
---|---|
r |
read |
w |
write |
a |
append |
p |
pipe |
Indicators can be combined. For example, rw
indicates that you plan to read and write to a file. In fact, a
should not be used alone - specify that you will be write-appending to the file with wa
.
Order doesn’t matter, either. rw
and wr
are both valid ways to say you plan to read and write a file.
We will just be reading files today, but you might as well remember it now. It will come up eventually.
Counting Names
All right. I’m manually counting commas in HYG. It looks like “ProperName” is the seventh field. It also looks like there are quite a few stars in the catalog that have no proper name. How many?
There is one new opcode in this code: the split
String opcode. It accepts a string delimiter and a target string, and returns the list of strings that result from splitting the target string with the delimiter.
star_data
is a normal array, so we can access the ProperName field by the index we came to in hand-counting the fields.
It is very clumsy to rely on hand-counting fields, so we will come back to that in a moment. First, let’s look at what this application tells us.
Only 87 of them have names? Huh. I thought there would be more than that. It’s possible that the number is wrong because I was relying on hand-counting the fields. Let’s tell Parrot to figure out the fields for us.
Quite a few changes have been made. One of the first was to define constants for some important values which I know will never change.
Yes, DELIMITER
uses more characters than ','
. I prefer referring to things by name when practical. This gives two benefits in my mind.
- I know the purpose of the value. The semantics of it appeals to me: “split with DELIMITER, which is
','
” rather than “split with','
which is the delimiter”. - I only have to change one spot. If someday David Nash wants to switch to tab delimited files, I will not have to find and replace
','
throughout my code.
As far as NAME_FIELD
, that’s just because I prefer referring to things by name. It doesn’t really serve any other purpose. Choose your own style, but make sure others can read it.
The next task is to find which field holds the star name. We’ll split the header line and step through each field until we either find the field we’re looking for or hit the end.
Why did I finally throw some error-checking into this? I won’t say, but believe me when I tell you to always look for typos in your code. And if your loop doesn’t check if it’s time to quit, that loop might never quit.
Now that Parrot knows which field holds the names, we can use it in our name counting.
Let’s run the new code.
I get the same result. The hand-counting of fields I did earlier worked. That’s a relief, but I’m much happier now that Parrot is counting for me.
Understanding the Data by Looking at Sol
I want to get a lot more information from this data, but in order to do that I’ll need a nice way to understand the information about each star in the set. We’re going to go about that by focusing on Sol, our own sun.
Sol is the first star listed after the header line, so we don’t have to do anything clever to find it.
In order to display the field names and values together, we step through the header and star data arrays at the same time.
What does the HYG data for Sol look like?
That’s a fair amount of trivia, which makes me happy. Granted, I only understand what five of those fields actually mean - although I can guess at a few more. The data isn’t what’s jumping out at me, though. This is:
readline
reads the full line from the file, including the special newline characters that mark the end of the line. That newline becomes part of the string, which means it also gets printed out when we display the header and final field for our data. I knew I’d have to deal with this eventually.
Perl has the builtin chomp
function which is perfect for exactly this situation. Parrot doesn’t have chomp
as a builtin, but it is available via the standard String/Utils library. There’s no need to download anything extra, because “String/Utils” ships with Parrot.
Since “String/Utils” is a library, we need to load it.
Parrot compiles its library PIR files into Parrot Compiled Byte Code. PBC has been processed enough that the Parrot interpreter can load and execute its code a little faster. The load_bytecode
core opcode tells Parrot that we are going to load a bytecode file and we need its capabilities to be added to the system.
The actual chomp
functionality is still just beyond our reach, though. We need to make room for it in our own program by reserving a PMC.
Now we can reach over into the “String/Utils” namespace and grab chomp
for our own use.
get_global
is a variable opcode that allows us to get a PMC from the global namespace. Used like this, it allows us to grab a PMC from a specific available namespace. What makes namespaces great is the fact that they can have any number of variable names without cluttering the globally available list of names. On the other hand, you need to take an extra step to make that name available for your own use. That is fairly consistent with other languages that I’ve used, although maybe a little lower level than I care for. Oh well. This is a low-level language, after all.
Now that we’ve got that loading business out of the way, we can actually use chomp
. chomp
is a subroutine, and not an opcode. You’ll need to use parentheses when you use it.
chomp
returns a copy of current_line
with that annoying newline removed. We want to reuse that copy immediately, so we just assign the result right back to current_line
.
Remember to use it again when reading the data line for Sol.
How does the data look now?
That’s better.
Now it would be nice to ask for specific data for our star in a meaningful way. For example, I want to just see the name and spectrum information. We could dig through the fields the way we have been, but I think it would be better if we could just ask for them by name.
One way to do that is with a Hash. This is a collection structure similar to an array. The difference is that you get data from the hash using string keys instead of looking things up by index. Python programmers know it as a “dictionary”.
We didn’t have to go through so many contortions to add a hash, thank goodness. Hashes are built-in, so we just have to allocate a PMC and call new
.
Instead of reading and printing the fields, we assign them to the hash.
On the display side of things, I did get a little lazy and use register variables. There’s nothing wrong with that, but it’s not consistent with my normal style. We can fix that in the next round.
Hash indexes look a lot like array indexes. The keys can get complicated, but let’s stick with simple strings.
What does our output look like now?
I’m tempted to print out all the data this way, but there are well over a hundred thousand. Printing takes a while. Reading takes a lot of whiles. How about just printing the information for stars with a matching spectrum?
Stars Like Ours
Now that we have a Hash to describe characteristics of our own Sun, we can build Hashes for other stars and look for the ones that are similar to ours. We’ll use the spectrum as our guideline, and look for an exact match rather than just a vague similarity. We’re also going to filter out the ones that don’t have a name, because we know that many of the stars in this set don’t have proper names.
We look at each star as we go, checking to see if it exactly matches Sol’s. I know that we’re missing a couple of entries designated as “G1/G2V”, but I am not going to worry about it today.
We’re remembering stars with the same spectrum, but will only be displaying those with proper names. We’ll just count the others.
You may have noticed that I reassign some variables with the same value they probably already had. This may not be efficient, but it’s for my own sanity. I want to be certain about the values held in those variables. I am also pretending these little labelled regions are like distinct blocks of code. It’s a lie, but a useful one.
On the other hand, this program does take a couple of seconds to run on my machine now.
Those are disappointing results. It looks like we have many neighbors that look like our Sun, but only one with a name. I would love to use one of the alternate references if available, such as the Gliese or Bayer-Flamsteed designations. I don’t think that’s practical with how we’re writing our Parrot application today.
Conclusion
Wow. There has been a lot of new stuff today. Not only did we learn how to read files and use Hashes, we also saw how to load bytecode libraries. We counted, searched through, and displayed data from a 20 Megabyte text file with nearly 120,000 entries. We also learned that Rigel Kentaurus A is the only named neighbor in the database that is the same spectral type as our Sun.
I think we’re reaching the limits of what I want to do with goto
as my primary tool for guiding program flow. PIR Code is getting harder to write and edit. The next step really should be creating subroutines to abstract some of the more complicated or tedious processes.
Backlinks
Added to vault 2024-01-15. Updated on 2024-01-26