I wrote a Perl script using utility features in Mojolicious to check all of the links in my site.
Nothing lasts forever. Sites get reorganized, move, or disappear. As my own site has gotten older — some of these pages are over fifteen years old — links from old posts stop working. link rot is a fact of life on the Internet. I want to minimize it here.
Instead of manually checking each of the 245 posts on this site, I chose to write some code that identifies the dead end links. Then I could manually adjust the bad links. Yay! That’s hand-crafted automation there.
use Mojo!
Mojolicious is a Perl framework for making Web applications. It also happens to provide excellent support for a wide range of Web-related programming.
I mentioned Mojolicious here before. I use it as a part of my daily dev toolkit, even though I still haven’t made a real Web app with it.
The code
I could just dump the script here and go on with my day, but I feel like typing a lot for some reason. Let’s go through the major chunks of the code.
The setup
Whenever possible, I specify the latest version of Perl (currently 5.24. It enables some features and deprecates others. If nothing else, it reminds me when I last worked on the code. Recent Perl versions automatically enable strict
, but it’s useful for me to also turn on warnings
.
The experimental
CPAN module saves some boiler plate when using Perl features that have not fully stabilized — such as function signatures.
Mojolicious provides a remarkable amount of functionality for such a small installation. This is just what I’m explicitly using.
- Mojo::DOM
- HTML/XML DOM parser that supports CSS Selectors
- Mojo::File
- for handling filepaths and easy reading / writing files.
- Mojo::JSON
decode_json
lets me turn the Hugoconfig.json
file into a Perl structure.- Mojo::URL
- understands the components of Uniform Resource Locators
- Mojo::UserAgent
- makes HTTP and WebSocket requests (similar to LWP::UserAgent, or Requests for Python people)
From the top
This is the important bit: load the config, create a user agent, and check links in one or all of the generated HTML files. I checked the generated HTML files in public
because I didn’t feel like messing with hugo server
or a Mojolicious mini-app. Scraping a local server could be an option later.
Using Mojolicious for everything was so much fun that I rewrote config.yaml
as config.json
to allow using Mojo::JSON
here. Hugo’s built-in support for different configuration format made that a painless shift. Then Mojo lets me slurp
the contents of the config file into a single string, which decode_json
turns into a hash reference.
list_tree
gives a recursive directory listing of everything under $root
as a Mojo::Collection. Collections provide a tidy toolkit of list handling functionality without requiring me to go back and forth between arrays and array references.
I could find and iterate over all the HTML and XML files in vanilla Perl 5, but I like this better.
After a few runs, I added the ability to specify a single file in @ARGV
. That way I can figure things out when that one link in that one file causes trouble.
Checking links in a file
Once again I slurp
a file into a string. This time it gets handed off to Mojo::DOM
so it can find
any elements with src
or href
attributes, and then create a Mojo::URL
from the appropriate attr
. Mojo::URL
does the tedious work of parsing URLs and making components like scheme
available.
Leaning on the //
defined-or logical shortcut lets me take advantage of the three boolean states of Perl: truthy, falsey, and “I dunno.” Each URL-testing subroutine can return undef
to indicate that it doesn’t know what to do with the URL, and let the next subroutine in line handle it. If nobody knows what to do with it, then that’s a bad link and gets remembered as a falsey value.
each
hands two items to the subroutine it invokes: an item in the collection and what number in the collection that item is (starting from 1). No, I don’t use $n
, but I wanted you to see that it’s available. You can also access the item as $_
as I did earlier. You can even do your subroutine arguments the old fashioned way with @_
.
Is it an internal link?
I would check for ../
abuse if this was a general purpose script, but it’s mostly links I added by hand and checked manually at some point in the last fifteen years. So - assuming past me was not acting maliciously or foolishly, we rule out more likely situations:
- The URL
host
points to something besides my site, which means it can’t be a local file. - The link has a
fragment
pointing to a named anchor and nothing else. I only have that on one page right now, and I don’t feel like complicating this script for a single page. - The
path
isn’t set, which at this point means an empty link. That can’t be good. - If the link is to a local file, we check whether it exists.
Mojo::Path manipulation delights me. Sure, this could be a regular expression substitution with fewer characters of code, but someone else seeing merge
after a check for a trailing slash would probably understand that I’m adjusting for the common practice of /thing/
being a link to /thing/index.html
. They might understand even if they weren’t Perl developers!
Is it a working external link?
After some quick checks to ensure I’m not looking at a blog demo link and that I handle protocol-relative URLs correctly, I wrap a simple head
request in an eval
block.
I use HTTP HEAD
because I only care about whether the link is valid. I don’t want the full content at the link. eval
lets me catch timeouts and requests being sent to Web sites which no longer exist. Assuming no errors, this eventually returns whether the result
of the HTTP transaction succeeded with is_success
.
Summarize it
Today I only looked for bad links, but it can be useful to know the status of all links in my site. I used it a few times during development. May as well leave that bit of logic in.
What’s That Do?
A couple hundred lines like this, basically.
Goodness those are embarrassing.
Okay I’m gonna go fix this.
Some links just won’t work with this code. I may revisit this later, but I got what I need. All links should at least work in a browser for now.
An added bonus that I didn’t expect: this code also ran on Windows 10 with no changes needed.
More Ideas
Improvements that I thought of while putting this together, which I may eventually try out.
- Be a good bot citizen by paying attention to
robots.txt
. I tried that in an early version of the script, but hardly any of the sites provided one. I’ll ponder and try not to run the script too often for now. - Wrap things up in a Mojo::Base class for organization.
- Run an instance and scrape that live - see if it makes a difference!
- Use non-blocking requests, since Mojo::UserAgent supports them.
- Cache results to disk, since working links tend to stay that way for at least a few days.
- Find out why some URLs didn’t work. Was it a
robots.txt
thing? A weird redirect? They worked in the browser, after all.
Honestly the script does what I need it to, and I might never implement these other ideas.
Backlinks
Got a comment? A question? More of a comment than a question?
Talk to me about this page on:
Added to vault 2024-01-15. Updated on 2024-02-02