My blogging output by year – bet you didn’t know there were blogs in 1 CE
The plan
Use Simon Williamson’s Datasette tools — and maybe the rest of and maybe the rest of Datasette too — to replace an existing data script and perhaps gain new insight into site contents.
Also: puttering is fun.
Background
I incorporated in my site workflow a while back because of fRew Schmidt. See, he wrote an interesting post about his Hugo / Unix / Vim blogging workflow. I immediately copied q
, a Perl script he wrote for querying site metadata via an in-memory SQLite database.
q
is ridiculously fast and convenient if you know the query you want to ask.
Is your query too awkward to comfortably type on a command line? Put it in a shell script, like fREW’s tag-count
example.
That gets you a list of all the tags in your entries, ordered by number of appearances.
I converted it to Python eventually, of course. Records produced nice-looking output while letting me stick to SQL syntax. Noticeably slower than q
, but acceptable. A bit later, I added a table for aliases. Then, another table for announcements — my tweets and toots about new content.
Aside from answering idle questions like “how many drawings have I posted?” my little database streamlines the process of publishing site updates and automatically announcing new posts.
What’s the problem?
I’m thinking about adding more tables to improve IndieWeb integration. But to be honest, my creation has grown unwieldy. And slow.
Compared to the original Perl script?
I’m tempted to gut my Python workflow. Start fresh from q
and Perl again. But no. That would be a lot of work — duplicated work, at that.
I’ll stick to Python for now. But surely we can do better than the duct tape script I’ve been using.
Simon Willison’s sqlite-utils
may be just the thing.
What’s sqlite-utils
?
sqlite-utils
provides both a Python library and a command
line too for building and querying SQLite databases. It can guess a schema from
structures as you insert them — though you can be explicit if you prefer. It’s
particularly useful for bulk inserting or exporting data dumps in JSON or CSV.
Makes sense. It’s part of the Datasette ecosystem. Willison’s Datasette project simplifies exploring and publishing data for folks who understand data but not necessarily databases.
And although it’s slower than q
for SQL queries, it’s much closer than what I came up with!
Creating site.db
with sqlite-utils
Okay! Let’s do this. I want to load and prepare details from my content entries before I let sqlite-utils
turn it into a database.
Getting metadata from Hugo and front matter
We could recursively walk through the content/
folder, collecting front matter details about everything we find. Most of the time, this is more than I want. Hugo site organization includes supplemental material like page bundles and _index
pages. That’s distracting when I only want to focus on my posts and notes. If you just want core content — posts, pages, stuff like that — ask Hugo.
I mentioned the hugo list
commands before. hugo list all
prints summaries of all your main content entries, as comma-separated text. Python’s standard subprocess and csv are all I need to turn that into something useful.
csv.DictReader gives me a collection of manageable dictionaries.
That’s useful for a high level overview of that post, but I want more. I want tags, categories, aliases, etc. I keep those details as YAML text in content front matter.
PyYAML moves quick when powered by libYAML. So I make sure I have both installed on my Manjaro machine.
I can’t think of a pretty way to get at an entry’s front matter, so let’s just get it out of the way.
NOTE
It’s not worth rewriting a whole blog post just yet, but in the years since writing this I use python-frontmatter to handle frontmatter.
- read the file
- grab the front matter YAML
- hand back a dictionary based on that front matter text
Why a function? It’s only a couple lines of code.
Sure, but it’s an ugly couple of lines. Plus it hides the YAML-specific details away in case I switch my front matter to TOML, JSON, or some other format.
Okay, now to mash entry and relevant front matter bits together. I prefer Arrow ’s interface to datetime.
- A namedtuple helps when you want some structure but not a full class.
- Always make sure times are in UTC when saving to the database!
- Hugo determines sections from a file’s location under
content/
, so let’s do the same. entry_path
will help connect tags to entries when creating the database
I got everything tidy and pretty. We’re ready for the database.
Loading the data
The sqlite-utils
Python API ends up taking hardly any code at all, thanks partly to the work spent massaging the entries. You treat the database as a dictionary, with each table name as a field. Tables get created when you insert something.
I think we called that “autovivification” in Perl. It’s nice.
- Rebuild the database from scratch every time this runs. Alternately, leave this out and
upsert_all
to do an insert or update. - Setting
path
as the primary key, since every filename should be unique. - A tuple
pk
specifies a compound primary key — each pair must be unique. sqlite-utils
hides the details of foreign keys in SQLite. Just use a list ofFIELD_NAME
/TABLE_NAME
pairs.
All that works makes build_db
a convenient starting point for build-db.py
.
I could specify an in-memory database, but no. For the moment I’ll settle on creating a database file. That way I can more easily play with the sqlite-utils
command line interface.
I’m content. But I’m also curious. What does my site metadata look like in Datasette?
Exploring site.db
with Datasette
I don’t have any great insights here. I just wanted to look at the pretty tables. Played with metadata and checked out the results.
The basic interface is pleasant enough, especially when you apply some CSS Of course, Niche Museums shows you can go a long ways past “some CSS” with Datasette.
Oh and more on the pleasant interface. It allows you to edit the SQL for your current view.
There are detail pages for each row. Again, the Niche Museums site shows that detail view can be heavily tweaked.
Plugins add all sorts of functionality. datasette-vega draws graphs of query results.
Year 0001? Hang on, let me check that in my shiny new Datasette server.
Year 0001 content
path | title | publishDate | draft | section | category |
---|---|---|---|---|---|
content/about/index.adoc |
About | 0001-01-01T00:00:00+00:00 | 0 | ||
content/contact.adoc |
Contact | 0001-01-01T00:00:00+00:00 | 0 | ||
content/follow.adoc |
Follow | 0001-01-01T00:00:00+00:00 | 0 | ||
content/error.adoc |
uh oh | 0001-01-01T00:00:00+00:00 | 0 | ||
content/now/index.adoc |
Now | 0001-01-01T00:00:00+00:00 | 0 | ||
…and a bunch of drafts | [redacted] | 0001-01-01T00:00:00+00:00 | 1 | draft | [redacted] |
Oh that’s right. Non-dated pages and drafts.
No I’m too embarrassed to share the full list. I get lots of ideas, okay? Then I’m afraid to throw any of them away. Sometimes they mutate. A couple months ago, this post was going to be about playing with hugo list
in Python.
Yep. Datasette is pretty nifty. I’m not missing q
at all. And I’m really not missing my Python query
script.
What’s next?
That was fun! Still got a lot ahead of me.
- Figuring out the quickest way to refresh the database
- Adding those IndieWeb tables, so I can review Webmention updates
- New Pyinvoke tasks to build, query, and explore the database
- Fixing those tasks that still call out to my old and busted
query
Python - Make my Datasette site dashboard pretty
Explore
Check out some other datasettes available online!
- New York City Cultural Institutions
- Every mention of 2016 primary candidates in hip-hop songs
- the location of every tree in San Francisco — you need to page through to see all 189,144 locations
Backlinks
Got a comment? A question? More of a comment than a question?
Talk to me about this page on: mastodon
Added to vault 2024-01-15. Updated on 2024-02-02