Among the many moving pieces involved in re-launching Creatavist as Atavist, there was the issue of what to do with all of the old “handles” occupied by now-defunct names. When The Atavist relaunched as The Atavist Magazine, Crystal and Katia had the really clever idea to turn the old @theatavist Twitter handle into a bot! I was delighted to help make their Atavist Magazine bot become real.
By far the easiest kind of bot to make is what’s called a Markov bot. These bots, often living on Twitter, take as fodder a corpus of raw text from... anywhere. This corpus is processed thru something called a Markov chain, which is a “a stochastic process with the Markov property,” per Wikipedia.
Perhaps more helpfully, an example of a Markov chain:
Another example is the dietary habits of a creature who eats only grapes, cheese, or lettuce, and whose dietary habits conform to the following rules:
It eats exactly once a day.
If it ate cheese today, tomorrow it will eat lettuce or grapes with equal probability.
If it ate grapes today, tomorrow it will eat grapes with probability 1/10, cheese with probability 4/10 and lettuce with probability 5/10.
If it ate lettuce today, tomorrow it will eat grapes with probability 4/10 or cheese with probability 6/10. It will not eat lettuce again tomorrow.
By running a large body of text thru a Markov chain, you can create new text which feels very much of a piece with whatever the old text was like. In the case of The Atavist Magazine, the hope and hypothesis was that we’d end up with a sort of bizarro Atavist, populated with some of the same writing styles and characters from real Atavist Magazine stories, but decidedly more uncanny.
To assemble this Markov bot, I asked Crystal for a copy of the text from all Atavist Magazine stories. Though the medium-of-choice for our stories is HTML, in this case I requested plain-text versions, which are easier to ingest and use than HTML (which would need to be filtered). It’s actually kind of frightening to see the sum total of an enormous amount of editorial work represented as a mere folder of word files!
Once collected, there were a few post-processing steps required before this text was ready for the bot. From prior experience with Markov bots, I knew I wanted to use the “twitter_ebooks” Ruby library to make the bot. This library, written to more easily create bots in the style of the “horse_ebooks” bot, can take a person’s tweets, transform them into a text “model” suitable for a Markov chain, and does some of the grunt work around posting tweets on a randomized schedule.
Though the twitter_ebooks library is often used to create a bot using your own tweets as a corpus, it’s also possible to use it with any old bit of text. Unfortunately Microsoft Word files aren’t really “text” in the truest sense! They’re text wrapped in all kinds of formatting and proprietary crap. So without even looking into it I knew I needed to convert all of these stories from Word files to plain text.
Additionally, the twitter_ebooks library expects plain text input to be delimited by line returns. Which means that in an ideal world I’d have one giant text file with all the stories lumped together, but where each line in the text file consisting of a separate sentence from each of the stories. Quite literally a tall order!
Converting from Word to plain text
As a proud non-owner of Microsoft Word, and a developer who truly believes that laziness is an important creative constraint, I was interested in accomplishing both of these text processing tasks on the command line! If you’re not comfortable with the command line, it may seem like a non-sequitur to think of the command line as an environment where laziness is possible or encouraged. But what’s nice about using the command line is that it’s very forgiving, because it makes it abundantly clear where and why things go wrong. In command line programs, there’s often a setting called “verbose,” which means nothing more than “tell me everything you know.” I find this setting to be very helpful when learning!
Leveraging decades of professional experience, I immediately “googled it”:
It turns out that on the Mac, there is a command line program called “textutil” for converting various text formats to other text formats. From inside a folder full of “.doc” files, you can convert them to “.txt” with this simple one-liner:
Which loosely means “convert to txt if the file matches the pattern ‘docx.’”
If you also have “.docx” files, you may need to run this as well:
Combining plain text files
Now that we have a bunch of plain text files, we’ll concatenate them using a classic program called cat:
So this file now has all of the Atavist Magazine stories, jumbled together. I was amazed to see that this is a whopping 3.4MB of raw text!
The only thing which remained to be done with this text file is separating each sentence into a separate line. This is important, because our bot will only produce “good” sentences if the sentences it consumes truly resemble sentences!
You might think that splitting out text into sentences is a simple task of finding periods and creating line returns after them. But it turns out that this is not nearly sufficient. From the “Sentence boundary disambiguation” Wikipedia article:
sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence
Again, relying on years and years of experience in the field of sentence boundary disambiguation:
It turns out there’s a suite of programs called “Apache OpenNLP,” NLP here standing for Natural Language Processing (I think!). I have only the vaguest understanding of when to use some of the tools for parts-of-speech tagging, tokenization, etc. But all we need for this bot was OpenNLP’s sentence detection tool.
To install OpenNLP, I used the Mac package manager called homebrew:
Note that OpenNLP requires Java—here are some instructions on how to install that if you’re not sure what a Java runtime is, or haven’t installed one.
Before using this sentence detector, we’ll need to download a file called “en-sent.bin.” Download it here and place it in the same folder as your concatenated file. This file helps the sentence detector find sentences, having been trained by a massive corpus of english language data!
OK, now we should have all that’s required to turn our big text Atavist Magazine stories text file into a list of Atavist Magazine sentences. From the folder with our “concatenated.txt”:
Feeding the bot
Now we finally text which we can feed to our bot! Let’s return to the “twitter_ebooks” library. It’s installed as a rubygem, and from there run as a command line utility. To create a fresh bot:
This will create a folder with a bunch of bot boilerplate. From inside this folder, we’ll take the sentences from before and “consume” them into a markov model:
Here’s what twitter_ebooks tells us about this process:
Now the text is all ready for the bot! I’ll skip the part where we setup the bot for Twitter and design—this process loosely involves registering for Twitter developer oauth keys, generating an “access token” for your bot, and plugging those details into the “bots.rb” file located in your bot’s folder.
Now that the text was prepared for the bot, and the twitter_ebooks library properly configured, all that remained was to choose how frequently the bot would post and to get the thing actually running.
From experience with past bots, I chose to have The Atavist MagBot post every 5 hours or so—not frequently enough so as to be totally jamming the feed, but not so infrequently that you forget it exists. We’re running the bot locally, on a Mac Mini which we use for prototyping — the running bot looks something like this:
So far, its tweets have been really stellar! An “Atavist style” seeps thru them, in a way that might be otherwise hard to see reading any one Atavist Magazine story alone. Having created the bot, it’s hard for me to not like every tweet it produces. In general, it’s easy for me to anthropomorphize bots like this, forgiving them for their mistakes (grammatical or otherwise) and finding every precious thing they say to be a miracle of computing. But it’s humbling to realize that every bot is a product of its design, and in this case the design owes a deep debt of gratitude to the enormous list of lovingly-edited Atavist Magazine stories—there are 47 of them at this point!
Another bot thought: I’ve noticed that a bot will continue to be interesting and fun for a much longer period of time than a human-managed parody or joke account on Twitter. I’m not totally sure why this is? Perhaps a bot is almost as hit-or-miss as a normal human’s account, so it’s far less likely that a bot will run into a wall where its tweets aren’t good anymore? For a far more interesting and fascinating perspective on “what makes _ebooks bots good” and “why horse_ebooks being revealed to be a human was disappointing,” consider reading this post from botmaker Leonard Richardson.
OK! You can follow the Atavist Magbot on Twitter here:
And also check out its parallel universe brethren, where the real stories are made:
...and to hear more disjointed thoughts on bots, I’m on Twitter: @saranrapjs or send me an email: email@example.com