Querying Hugo Content With Python : Random Geekery

My blogging output by year – bet you didn’t know there were blogs in 1 CE

2020-11-09

Used my code from 2020-07-08 Tangling code from Hugo content with Raku to touch this up. I want to tidy it so bad, but I’d need to rewrite large chunks of this post. But hey it’s there and it works. Have fun!

The plan

Use Simon Williamson’s Datasette ecosystem — and maybe the rest of and maybe the rest of Datasette too — to replace an existing data script and perhaps gain new insight into site contents.

Also: puttering is fun.

Background

I incorporated in my site workflow a while back because of fRew Schmidt. See, he wrote an interesting post about his Hugo / Unix / Vim blogging workflow. I immediately copied q, a Perl script he wrote for querying site metadata via an in-memory SQLite database.

q is ridiculously fast and convenient if you know the query you want to ask.

Code Sample

    $ q --sql 'select count(1) from articles where title like "%perl%"'
22
$ q --sql 'select count(1) from articles where title like "%python%"'
25

Is your query too awkward to comfortably type on a command line? Put it in a shell script, like fREW’s tag-count example.

tag-count

    #!/bin/bash

exec scripts/q \
  --sql 'SELECT COUNT(*) AS c, tag FROM _ GROUP BY tag ORDER BY COUNT(*), tag' \
  --formatter 'sprintf "%3d %s", $r{c}, $r{tag}'

That gets you a list of all the tags in your entries, ordered by number of appearances.

Code Sample

    $ tag-count
⋮
 29 ruby
 33 perl
 38 python
 43 learn
 45 drawing
 77 site

I converted it to [Python][python] eventually, of course. Records produced nice-looking output while letting me stick to SQL syntax. Noticeably slower than q, but acceptable. A bit later, I added a table for aliases. Then, another table for announcements — my tweets and toots about new content.

Aside from answering idle questions like “how many drawings have I posted?” my little database streamlines the process of publishing site updates and automatically announcing new posts.

What’s the problem?

I’m thinking about adding more tables to improve IndieWeb integration. But to be honest, my creation has grown unwieldy. And slow.

Code Sample

    $ time query 'select date, title from articles order by date'
⋮
Skinny Lines and Flat Colors                             |2020-05-02T21:11:00-07:00
Pondering My Indieweb Guinea Pig                         |2020-05-03T20:30:00-07:00
Got a Working glitch-soc Rails Dev Environment           |2020-05-04T23:26:21-07:00
445 rows
query 'select title, published_at from contents order by published_at'  \
2.64s user 0.13s system 101% cpu 2.730 total

Compared to the original Perl script?

Code Sample

    $ time q --sql 'select date, title from articles where date is not null is not null order by date'
⋮
2020-05-02T21:11:00-07:00    Skinny Lines and Flat Colors
2020-05-03T15:46:00-07:00    Sending Webmentions
2020-05-03T20:30:00-07:00    Pondering My Indieweb Guinea Pig
2020-05-04 23:26:21-07:00    Got a Working glitch-soc Rails Dev Environment
q --sql   0.17s user 0.02s system 99% cpu 0.197 total

I’m tempted to gut my Python workflow. Start fresh from q and Perl again. But no. That would be a lot of work — duplicated work, at that.

I’ll stick to Python for now. But surely we can do better than the duct tape script I’ve been using.

Simon Willison’s sqlite-utils may be just the thing.

What’s `sqlite-utils`?

sqlite-utils provides both a Python library and a command line too for building and querying SQLite databases. It can guess a schema from structures as you insert them — though you can be explicit if you prefer. It’s particularly useful for bulk inserting or exporting data dumps in JSON or CSV.

Makes sense. It’s part of the Datasette ecosystem. Willison’s Datasette project simplifies exploring and publishing data for folks who understand data but not necessarily databases.

And although it’s slower than q for SQL queries, it’s much closer than what I came up with!

Code Sample

    $ time sqlite-utils site.db 'select publishDate, title from entries order by publishDate' --table
⋮
2020-05-03T04:11:00+00:00  Skinny Lines and Flat Colors
2020-05-03T22:46:00+00:00  Sending Webmentions
2020-05-04T03:30:00+00:00  Pondering My Indieweb Guinea Pig
2020-05-05T06:26:21+00:00  Got a Working glitch-soc Rails Dev Environment
sqlite-utils site.db  --table  0.54s user 0.16s system 158% cpu 0.442 total

Creating `site.db` with `sqlite-utils`

Okay! Let’s do this. I want to load and prepare details from my content entries before I let sqlite-utils turn it into a database.

Getting metadata from hugo and front matter

We could recursively walk through the content/ folder, collecting front matter details about everything we find. Most of the time, this is more than I want. Hugo site organization includes supplemental material like page bundles and _index pages. That’s distracting when I only want to focus on my posts and notes. If you just want core content — posts, pages, stuff like that — ask Hugo.

hugo list all prints summaries of all your main content entries, as comma-separated text. Python’s standard subprocess and csv are all I need to turn that into something useful.

ℹ️ Note

2020-03-31 Listing Hugo Content Extensions With Raku talks a bit more about hugo list all and processing its output from Raku.

Code Sample

    import csv
import subprocess


def list_hugo_content():
    """Return a listing of hugo content entries"""
    result = subprocess.run(
        ["hugo", "list", "all"], capture_output=True, text=True, check=True
    ).stdout
    return csv.DictReader(result.split("\n"))

csv.DictReader gives me a collection of manageable dictionaries.

Code Sample

    {
    'date': '2020-04-28T01:36:54-07:00',
    'draft': 'false',
    'expiryDate': '0001-01-01T00:00:00Z',
    'path': 'content/post/2020/04/from-dotfiles-to-org-file/index.adoc',
    'permalink': 'https://randomgeekery.org/post/2020/04/from-dotfiles-to-org-file/',
    'publishDate': '2020-04-28T01:36:54-07:00',
    'slug': '',
    'title': 'From Dotfiles to Org File',
}

That’s useful for a high level overview of 2020-04-28 From Dotfiles to Org File, but I want more. I want tags, categories, aliases, etc. I keep those details as YAML text in content front matter.

Code Sample

    title: From Dotfiles to Org File
uuid: f03e5f2f-70a8-4988-92cd-595c8e3fdc97
description: at 1:30am they're all good ideas
year: '2020'
date: '2020-04-28 01:36:54-07:00'
tags:
- emacs
- shell
- org
draft: false
categories:
- Tools
aliases:
- /2020/04/28/from-dotfiles-to-org-file/
announcements:
  mastodon: https://hackers.town/@randomgeek/104075340897518607
  twitter: https://twitter.com/brianwisti/status/1255056188087111681

PyYAML moves quick when powered by libYAML. So I make sure I have both installed on my Manjaro machine.

Code Sample

    $ pamac install libyaml
$ pip install pyyaml

I can’t think of a pretty way to get at an entry’s front matter, so let’s just get it out of the way.

ℹ️ Note

It’s not worth rewriting a whole blog post just yet, but lately I use python-frontmatter for tasks like this.

Code Sample

    import yaml

DELIMITER = "---\n"


def get_frontmatter(filename):
    """Get dictionary from a file's YAML frontmatter"""
    _, yaml_text, _ = open(filename).read().split(DELIMITER, maxsplit=2)
    return yaml.safe_load(yaml_text)

read the file
grab the front matter YAML
hand back a dictionary based on that front matter text

Why a function? It’s only a couple lines of code.

Sure, but it’s an ugly couple of lines. Plus it hides the YAML-specific details away in case I switch my front matter to TOML, JSON, or some other format.

Okay, now to mash entry and relevant front matter bits together. I prefer Arrow’s interface to datetime.

Code Sample

    $ pip install arrow

Import entry dependencies

    from collections import namedtuple
import os

import arrow

Prepare entry

    SiteEntry = namedtuple("SiteEntry", ["entry", "tags", "announcements", "aliases"])

def prepare_entry(entry):
    """Return a SiteEntry with details about a single content entry"""
    path = entry["path"]
    frontmatter = get_frontmatter(path)

    entry["draft"] = True if entry.get("draft") == "true" else False

    # Convert date strings to native datetime objects.
    for f in ("date", "expiryDate", "publishDate"):
        entry[f] = arrow.get(entry[f]).to("utc").datetime

    sections = ("note", "post", "draft")
    section_fragment = path.split(os.sep)[1]
    entry["section"] = section_fragment if section_fragment in sections else None

    # Extract important fields from frontmatter
    simple_fields = ("caption", "category", "description", "series")
    for field in simple_fields:
        entry[field] = frontmatter.get(field)

    tags = [{"entry_path": path, "tag": tag} for tag in frontmatter.get("tags", [])]
    announcements = [
        {"entry_path": path, "service": service, "url": url}
        for service, url in frontmatter.get("announcements", {}).items()
    ]
    aliases = [
        {"entry_path": path, "url": alias} for alias in frontmatter.get("aliases", [])
    ]
    return SiteEntry(entry, tags, announcements, aliases)

A named tuple helps when you want some structure but not a full class.
Always make sure times are in UTC when saving to the database!
Hugo determines sections from a file’s location under content/, so let’s do the same.
entry_path will help connect tags to entries when creating the database

I got everything tidy and pretty. We’re ready for the database.

Loading the data

Code Sample

    $ pip install sqlite-utils

The sqlite-utils Python API ends up taking hardly any code at all, thanks partly to the work spent massaging the entries. You treat the database as a dictionary, with each table name as a field. Tables get created when you insert something.

I think we called that “autovivification” in Perl. It’s nice.

import sqlite-utils

    from sqlite_utils import Database

build databse

    DB_NAME =  "site.db"

def build_db():
    """Build the database"""
    entries = []
    tags = []
    announcements = []
    aliases = []

    for entry in list_hugo_content():
        site_entry = prepare_entry(entry)
        entries.append(site_entry.entry)
        tags += site_entry.tags
        announcements += site_entry.announcements
        aliases += site_entry.aliases

    site = Database(DB_NAME, recreate=True)
    site["entries"].insert_all(entries, pk="path")
    site["tags"].insert_all(
        tags, pk=("entry_path", "tag"), foreign_keys=[("entry_path", "entries")]
    )
    site["announcements"].insert_all(
        announcements,
        pk=("entry_path", "url"),
        foreign_keys=[("entry_path", "entries")],
    )
    site["aliases"].insert_all(
        aliases, pk="url", foreign_keys=[("entry_path", "entries")]
    )

Rebuild the database from scratch every time this runs. Alternately, leave this out and upsert_all to do an insert or update.
Setting path as the primary key, since every filename should be unique.
A tuple pk specifies a compound primary key — each pair must be unique.
sqlite-utils hides the details of foreign keys in SQLite. Just use a list of FIELD_NAME / TABLE_NAME pairs.

All that works makes build_db a convenient starting point for build-db.py.

define script execution

    if __name__ == "__main__":
    build_db()

I could specify an in-memory database, but no. For the moment I’ll settle on creating a database file. That way I can more easily play with the sqlite-utils command line interface.

Code Sample

    $ sqlite-utils tables site.db --counts --json-cols | python -m json.tool
[
    {
        "table": "entries",
        "count": 467
    },
    {
        "table": "tags",
        "count": 1043
    },
    {
        "table": "announcements",
        "count": 299
    },
    {
        "table": "aliases",
        "count": 900
    }
]

I’m content. But I’m also curious. What does my site metadata look like in Datasette?

Exploring `site.db` with Datasette

I don’t have any great insights here. I just wanted to look at the pretty tables. Played with metadata and checked out the results.

Code Sample

    $ pip install datasette
⋮
$ datasette serve --metadata datasette.json site.db

The basic interface is pleasant enough, especially when you apply some CSS Of course, Niche Museums shows you can go a long ways past “some CSS” with Datasette.

Oh and more on the pleasant interface. It allows you to edit the SQL for your current view.

There are detail pages for each row. Again, the Niche Museums site shows that detail view can be heavily tweaked.

Datasette plugins add all sorts of functionality. datasette-vega draws graphs of query results.

Charting interface with entries posted per year

Year 0001? Hang on, let me check that in my shiny new Datasette server.

Code Sample

    select
  path,
  title,
  publishDate,
  draft,
  section,
  category
from
  entries
where
  "publishDate" like '0001%'
order by
  publishDate desc

Year 0001 content

path	title	publishDate	draft	section	category
`content/about/index.adoc`	About	0001-01-01T00:00:00+00:00	0
`content/contact.adoc`	Contact	0001-01-01T00:00:00+00:00	0
`content/follow.adoc`	Follow	0001-01-01T00:00:00+00:00	0
`content/error.adoc`	uh oh	0001-01-01T00:00:00+00:00	0
`content/now/index.adoc`	Now	0001-01-01T00:00:00+00:00	0
…and a bunch of drafts	[redacted]	0001-01-01T00:00:00+00:00	1	draft	[redacted]

Oh that’s right. Non-dated pages and drafts.

No I’m too embarrassed to share the full list. I get lots of ideas, okay? Then I’m afraid to throw any of them away. Sometimes they mutate. A couple months ago, this post was going to be about playing with hugo list in Python.

Yep. Datasette is pretty nifty. I’m not missing q at all. And I’m really not missing my Python query script.

What’s next?

That was fun! Still got a lot ahead of me.

Figuring out the quickest way to refresh the database
Adding those IndieWeb tables, so I can review Webmentions
New Invoke tasks to build, query, and explore the database
Fixing those tasks that still call out to my old and busted query Python
Make my Datasette site dashboard pretty

Explore

Check out some other datasettes available online!

New York City Cultural Institutions
Every mention of 2016 primary candidates in hip-hop songs
the location of every tree in San Francisco — you need to page through to see all 189,144 locations

The plan

Background

What’s the problem?

What’s sqlite-utils?

Creating site.db with sqlite-utils

Getting metadata from hugo and front matter

Loading the data

Exploring site.db with Datasette

Year 0001 content

What’s next?

Explore

Backlinks

What’s `sqlite-utils`?

Creating `site.db` with `sqlite-utils`

Exploring `site.db` with Datasette