I got an idea in my head a while ago to reduce image sizes for the site. Some of my drawings and photos are a little big. On a slower connection, a visitor could spend a while waiting. And if their bandwidth is metered? Oh I’d hate to think one of my sketches was what put their account over the cap, or got their account throttled to Edge speeds for the rest of the month.
I know I can make it better.
The problem with my idea
Well, I don’t really know. I suspect a handful of files are big, but how many files? How big? Big for who? And what about after I do the work? What is less than big? How will I know what work to do, and how will I know the effectiveness of what work when I’m done?
VM Brasseur gives excellent advice on many topics. One tip sticks in my head: I need numbers for my accomplishments. Heck, right now I need numbers to see if this accomplishment is necessary.
What numbers should I care about?
Of course the problem with data is that there is so much of it. What should I care about, if the goal is making a visit easier for visitors on limited connections?
This is the more obvious and easily measured. This site (and most others) consists of files, right? Text files, image files, the occasional video file. All else being equal, a file that takes up more storage will also take more time to download.
“All else being equal” gets a little tricky though.
How long does it take for the user to see something useful when interacting with your site? (loading a page, clicking a link, doing things with web apps). It’s affected by – well, everything really. Network speed, server resources, sunspots.
If latency is high enough, one big file may reach a visitor quicker than a dozen small requests. If they spend too long waiting for too many pieces, they’ll go elsewhere in a heartbeat.
This Twitter thread by Andrew Certain provides an interesting look at how a large organization like Amazon takes latency seriously. It’s far deeper than I plan to measure, but it might help build more context.
Unfortunately latency can be hard to predict for one person with a blog. I do not yet know what tools work best for evaluating the effect of latency on site performance.
This is helpful on a page-by-page basis, and probably very helpful for evaluating a single page application. It doesn’t translate easily to checking an entire site. I suppose I could use Comcast, a command line tool for “simulating shitty network conditions” and maybe HTTPie to crawl the site under those conditions.
We’ll ignore latency for now. Besides, I’ve already managed many major elements of latency. Hugo creates a static site. Every page already exists by the time you visit. No extra time needed for database lookups or constructing views. I use AWS S3 to host, and Cloudfront as a CDN. This is probably the fastest and most reliable approach possible with my resources.
I do have an issue with the CDN not promptly updating some files when I upload the site, but I’m working on that.
Measuring file sizes
$ perl -MFile::Find::Rule -E 'say for find(file => size => "> 6M" => in => "public");' public/2015/08/01/zentangle-doodle/cover.png public/2017/04/22/kalaidoscope-symmetry/cover.jpg public/2017/11/07/something-colorful/cover.jpg public/2019/04/14/psychedelic-playing-card/cover.png public/2018/09/30/cougar-mountain/fantastic-erratic.jpg public/2018/09/30/cougar-mountain/old-stump.jpg public/2018/09/30/cougar-mountain/mossy.jpg public/2018/09/30/cougar-mountain/tall-stump.jpg public/2018/09/30/cougar-mountain/cover.jpg
Or maybe find the median between my biggest and smallest files, flagging everything bigger than the median. I promised Python in the tags, so let’s move away from Perl.
Running this gives me the same files as my one-liner. Good choice for an arbitrary number, right?
$ python median.py FileWeight(path='public/2015/08/01/zentangle-doodle/cover.png', size=8964751) FileWeight(path='public/2017/04/22/kalaidoscope-symmetry/cover.jpg', size=7729604) FileWeight(path='public/2017/11/07/something-colorful/cover.jpg', size=9594815) FileWeight(path='public/2019/04/14/psychedelic-playing-card/cover.png', size=13088396) FileWeight(path='public/2018/09/30/cougar-mountain/fantastic-erratic.jpg', size=7114429) FileWeight(path='public/2018/09/30/cougar-mountain/old-stump.jpg', size=7672471) FileWeight(path='public/2018/09/30/cougar-mountain/mossy.jpg', size=6639527) FileWeight(path='public/2018/09/30/cougar-mountain/tall-stump.jpg', size=7052340) FileWeight(path='public/2018/09/30/cougar-mountain/cover.jpg', size=8412560)
I learned that this technique of grabbing everything on one side of the median is called a “median split.” I also learned that however convenient it might be, a median split doesn’t mean anything. It’s the value halfway between two numbers. Is it a big download size? Maybe. What if I have a bunch of 5.9MB files? Those would be kind of big too, right? If I keep optimizing the biggest half and the median steadily moves down, how will I know when I’m done? What’s a small download?
Okay. I’m okay. I need to breathe for a minute. Once you start asking questions, it can be hard to stop.
So I need to know what the numbers mean, and what a good threshold is. Come to think of it, there might be a few thresholds.
Estimating download time
I care about how long it takes to download a file, assuming latency is as good as it’s going to get. The file size is one part of the download question. The visitor’s connection is the other part. I usually have a nice high speed connection, but not always.
Often I’m on LTE with one bar. Sometimes I’m on 3G. Very occasionally I find a dark corner that only gets me an Edge connection.
Sometimes I have no connection at all, but site optimization can’t help with that.
The Firefox throttling tool documentation includes a chart specifying what its selections represent. I know from site analytics that a third of my visitors use mobile devices. I don’t know what their connection speed is, but I find myself on 3G often enough that I think “Regular 3G” is an acceptable choice.
That 750 Kbps number represents 750,000 bits. There are eight bits in a byte. Divide 750,000 by eight and that’s only 93,750 bytes per second. The site’s median size of roughly six megabytes suddenly feels a lot bigger.
Let’s teach the FileWeight class to estimate downloads. I’ll clarify its printed details while I’m at it.
The script is still focusing on the median, but the extra information should give us a little context.
$ python download-time.py <public/2015/08/01/zentangle-doodle/cover.png> (8.55 MB) 3g=95.624s <public/2017/04/22/kalaidoscope-symmetry/cover.jpg> (7.37 MB) 3g=82.449s <public/2017/11/07/something-colorful/cover.jpg> (9.15 MB) 3g=102.345s <public/2019/04/14/psychedelic-playing-card/cover.png> (12.48 MB) 3g=139.610s <public/2018/09/30/cougar-mountain/fantastic-erratic.jpg> (6.78 MB) 3g=75.887s <public/2018/09/30/cougar-mountain/old-stump.jpg> (7.32 MB) 3g=81.840s <public/2018/09/30/cougar-mountain/mossy.jpg> (6.33 MB) 3g=70.822s <public/2018/09/30/cougar-mountain/tall-stump.jpg> (6.73 MB) 3g=75.225s <public/2018/09/30/cougar-mountain/cover.jpg> (8.02 MB) 3g=89.734s
Oh that’s not good. The biggest file would take almost two and a half minutes to download, while the smallest above the median would still take over a minute. That’s on top of whatever else is on the page.
My threshold should be far less than the median. How much less?
Picking my thresholds
Jakob Nielsen summarized how different response times feel to a user when interacting with an application – and yes, loading a post from your blog in a browser is interacting with an application, affected by the browser and your site (and the network, and so on).
- less than 0.1 seconds is fast enough that it feels like they’re doing it themselves
- less than 1 second is slow enough that it feels like they’re telling the computer to do something
- less than 10 seconds is so slow that you’re starting to lose their attention
Beyond ten seconds and you’re wrestling with the limits of a normal human brain that already has plenty of stuff to think about.
I can and do make excuses –
- “They probably came here on purpose, so they’ll wait!”
- “This is so cool that they won’t mind waiting!”
- “So many factors are beyond my control that there’s no point worrying about it.”
- “Everybody else’s site is even worse!”
– but no. The first two are lies from my ego, the last two are terrible arguments from my apathy.
I know my thresholds. Let’s teach FileWeight about them so it can report the news.
I also added a little emoji quick reference so I can tell at a glance the expected user reaction at the file’s download rate.
$ python download-time.py <public/2018/12/31/hopepunk-for-2019/cover_hu302a359ad2f64a42481affbc4fbbb8c4_4191368_1000x0_resize_q75_linear.jpg> (156.96 KB) 3g=1.714s😐 <public/2018/10/27/winter-hat-and-gloves/cover_hu42513aeed6d773f768448596f8f497f6_2320770_1000x0_resize_q75_linear.jpg> (182.43 KB) 3g=1.993s😐 <public/tags/pagetemplate/index.xml> (128.25 KB) 3g=1.401s😐 <public/2018/05/26/crafts-are-now-posts/index.html> (17.27 KB) 3g=0.189s😊 <public/2001/01/17/python/index.html> (6.44 KB) 3g=0.070s😁 <public/post/2013/fickle/index.html> (328 bytes) 3g=0.003s😁 <public/2018/08/11/satellite/satellite-lines-black.jpg> (476.27 KB) 3g=5.202s😐 <public/coolnamehere/2007/04/19_01-handling-a-single-round.html> (469 bytes) 3g=0.005s😁 <public/2018/08/19/island-center-forest/mossy-trees.jpg> (3.56 MB) 3g=39.789s🙁 <public/2008/10/01/natalies-hat/cover.jpg> (84.94 KB) 3g=0.928s😊
Plenty of build process artifacts in there. The long image names come from using Hugo image processing functions for thumbnails and inline images. I also have many tiny redirect files, letting Hugo’s aliasing behavior make up for the site’s inconsistent organization over time.
A FileWeight object can now describe the details I care about for a single file, including where it fits in the attention span thresholds. How many of my files are too big?
Putting it all together
All the files
I spent too much time on that download table. I could have spent even more, sizing each column to the longest field and anyways that wasn’t the point. Let’s look at my download estimates.
$ python report-weight.py All files in public 1,886 files (418.71 MB) Download guesses for 750.00 Kbps 🙁 > 10s > 915.53 KB 111 😐 1s - 10s 91.55 KB - 915.53 KB 269 😊 0.1s - 1s 9.16 KB - 91.55 KB 611 😁 ≤ 0.1s ≤ 9.16 KB 895
Way too many files take more than ten seconds to load. I know better than to be pleased about the large number of files that load instantly. As I mentioned, quite a few of them are redirects. On the latency side of things those are worse because the visitor then has to load the real post.
I also said I’m not worrying about latency today.
The median list was helpful in showing me that my biggest offenders are image files, so what about adding a report on those?
Just the media files
The easiest way would be to base it off file extension. But that ends up
looking a bit untidy, because extensions have accumulated over the
years. JPEG files are the worst offender, being stored as
.jpg, and even
I’ll use the standard mimetypes library instead. FileWeight can use that to guess what kind of file it’s looking at, and SiteWeight will make another download table for media files. It still uses file extensions, but with a smarter list than what I could build.
The script makes two download reports now, with only a little more work!
$ python weight-with-media.py --- All files in public 1,886 files (418.71 MB) Download guesses for 750.00 Kbps 🙁 > 10s > 915.53 KB 111 😐 1s - 10s 91.55 KB - 915.53 KB 269 😊 0.1s - 1s 9.16 KB - 91.55 KB 611 😁 ≤ 0.1s ≤ 9.16 KB 895 --- Media files in public 802 files (392.19 MB) Download guesses for 750.00 Kbps 🙁 > 10s > 915.53 KB 107 😐 1s - 10s 91.55 KB - 915.53 KB 236 😊 0.1s - 1s 9.16 KB - 91.55 KB 339 😁 ≤ 0.1s ≤ 9.16 KB 120
Yeah, that’s what I thought. The majority of those small files are text, and the vast majority of the large files are image or video. Yes, I noticed that a few of my excessively large files are text. Probably archive pages of one sort or another. I’ll gather the information on those later.
But it looks like I have an answer to my question.
Whether it’s worth my time to try optimizing image file sizes.
Oh right right. The answer?
The answer is “yes.”
Nearly half of my media files would be noticeably slow to download on a 3G connection. Over a hundred are large enough to stretch the patience of any visitor not blessed with a constant high speed pipe. That’s not very nice on my part.
Now I know I can make it better. Even better: with this script, I can ask the question again whenever I want!
Optimizing images is another post, though.
Could I improve my weighing script?
Of course! Here are some ideas I got while writing this, including a couple I included but removed to maintain focus.
- additional thresholds beyond “excessive” so I can determine how many files contribute to painfully long download times
- verbose mode to list file details on request
- options to estimate for different download rates
- more detail on media files, perhaps to see if compression has been applied and how much
- report base on page weight rather than individual fileweight, to get a more realistic idea of visitor experience.
- format the list in JSON to simplify handing off to other reporting tools
- include median and mean file sizes for more number crunching goodness
- list the ten largest files, so I know where to focus my optimization efforts
Should I optimize my weighing script?
Good question! Let’s look at the numbers.
$ time make weight python weight-with-media.py ... real 0m0.131s user 0m0.083s sys 0m0.044s
Seriously though. I assembled this in a few hours. Half of it’s too clever and the other half’s too stupid. But it gets the answers I need in a timely fashion. That’s plenty good enough.