I posted Counting Words in Blog Posts on Thursday, 2 October, 2014

Using Ruby to track my verbosity

Post ruby

Counting Words in Blog Posts

I want to write at least 250 words per day. This is not a 30 day challenge. It is just something I want to do. I write more than 250 words daily when you count social network posts and chat text. Wouldn’t it be nice if some of those words were organized around a single idea?

I need some way to count those words, of course. The obvious solution is wc.

$ wc counting-words.markdown
     106     464    3108 counting-words.markdown

The documentation tells me that the first column is the number of lines, the second column is the number of words, and the third column is the number of characters. I can train my brain to remember this, but instead I use the -w flag to get only the word count.

$ wc -w counting-words.markdown
     464 post.markdown

That is better, but it is not an accurate word count. I am currently using Jekyll for blogging, and every blog post file includes a section of front matter a section of Markdown content. My goal is 250 words of prose, not 250 total words. I do not want to count the front matter.

I could use assorted shell tools to accomplish this, but I would rather make a Ruby one-liner.

First I get the basic information I was already getting from wc.

$ ruby -e 'puts ARGF.read.split.count' counting-words.markdown
464

How do I separate the head from the body of the post? I could do some fiddly bits using ARGF.readlines with a separator argument, but I will keep going with what I have.

$ ruby -e 'puts ARGF.read.split(/^---$/).inspect' counting-words.markdown
["", "\nlayout: post\ntitle: Counting Words in Blog Posts\ndescription: Using Ruby to track my verbosity\ncategory: Programming\ndate: 2014-10-02\ntags: ruby\n", "\nI want to write at least 250 words per day. ..."]

How many words are in the body?

$ ruby -e 'puts ARGF.read.split(/^---$/)[-1].split.count' counting-words.markdown
317

I did say that I wanted my word count to be prose. I should exclude code blocks. That calls for a multi-line regular expression, stripping out the fenced code blocks in my post.

$ ruby -e 'puts ARGF.read.split(/^---$/)[-1].gsub(/^~~~ .+?^~~~ $/m, "").split.count' counting-words.markdown
357

I do not want to count link definitions either.

$ ruby -e 'puts ARGF.read.split(/^---$/)[-1].gsub(/^~~~ .+?^~~~ |\[.+?\]:.+?$/m, "").split.count' counting-words.markdown
341

This is good enough. Now I turn it into a bash alias.

# words in post / work in progress
alias wip='ruby -e '"'"'puts ARGF.read.split(/^---$/)[-1].gsub(/^(~~~ .+?^~~~ |\[.+?\]:.+?)$/m, "").split.count'"'"

Oh jeez those quotes hurt my brain. It was the first solution I came across to handle shell quoting, though. I may come up with something prettier. Perhaps a full script or looking for an existing tool. This will do for now.

$ wip counting-words.markdown
341

Indieweb Social

Did you mention this somewhere? I'd love it if you sent me the link!

disclaimer about timing

Mentions are sent to webmention.io. I fetch the latest mentions when building the site, so I may not see your feedback right away. Especially if my site's broken, which is often the case.

Public replies and mentions might be shared on the site, but I try to do a little quality check first.

Site Links