I was originally introduced to Markov chains by Josh Millard’s Garkov, a project that applies this mathematical model to Garfield comics. I thought it was an amusingly random idea, but didn’t fully appreciate the concept until Jeff Atwood wrote a post detailing how Markov Chains work.
To quickly summarize, the algorithm chooses its next word using a probabilistic function based on two criteria: the current word, and a large sample of coherent text. The chance of a word being chosen is based entirely on how frequently it follows the current word in the sample text. For instance, if the sample was “To be, or not to be, that is the Question”, the word “be” would have a 50/50 chance of being followed by “or” or “that.” As the sample size increases, the text produced becomes increasingly sound. [Update: The size of the sample does not necessarily increase the coherency of the text, the number of words used in the seed is the biggest determining factor. Thanks for the correction, Josh.]
Reading about this got me very curious. What would a Markov chain generated using my own writing look like? Would it overuse “however” and frequently misapply the word “parse” like I do? I decided some experimentation using the Markov Text Synthesizer was in order.
The first test sample was my Rockin’ The Boat series on rock ‘n’ roll and race relations in 1950’s America. At a scant ~4000 words it was a poor sample, which made it quite possible to recognize parts of the original sample in the text:
Rock ‘n’ roll party. Freed considered completely alien. Mintz had his personal success, but rather the Speeds had been a little pressed for school facilities. They were subject to unsuccessful alternative names such as much for the trend was classified under the Brown v. Board of, in ten in, most popular opinion in the same records was no memory of the major label was a lesson. He was a token job. The major record store owner in postwar economy. During the following lyrics which were localized in America. A successful rhythm and I feel would sing to popular.
For my next attempt, I used every post listed under the Video Games category. This was over ten times larger than my original sample, and the text was much more randomized as a result (it even snuck in a “however”, hooray!):
There are games have two things spectacularly well. One where it will work at first. They mix means of Internet hype around the U.S. Addresses Only. Heartbroken, I was, however, describe what is one of the censor, the creative Wiimote according to the end. Plot holes and Escapist feature. Embedded below is minutes of military shooters had her to convince you feel that the game will work done with blood all available up the narrative. He created the Medic class impressions. The use of his weekly feature film was already breaking my favourite newspaper articles. Read the end. The very effective ways. My favorite aspect of a short stories, particularly dystopian centuries. At first game Rock Band, but the best like a timeless void movie revolved largely passed under them all.
Next, I decided to use the same sample but increase the “word grouping” to two. This means that the function will now determine how frequently a word follows the two previous words (in order.) This made the text much more coherent, but brought back the problem of being similar to the original sample:
There have been used by other games in the villa’s lobby. Recharging your beam sword’s energy with a Little Sister crying for “Mr. Bubbles”, but it costs a whopping. December is a classic for years to come. Valve could have easily rested on their laurels and sold Portal on its PS release Tools of Destruction, with some great part series detailing each other, it seems that the FPS genre has been getting a lot of the murder investigation. The evidence found and is by far the best thing the Bioshock movie revolved largely around and go buckwild with ADAM. I would rescue them all. It was also annoying at times. To move Zack around you have to mimic it in real life.
Finally, I decided to test the program with a very very large sample: Cory Doctorow’s free e-book Little Brother. I don’t think this is the sort of remixing Doctorow had in mind when he released his book under Creative Commons, but here goes:
I’m a stone Disney park freak see my fear. Marcus, it can fill you with our magic coffee bean supply. If you’ve got it for a lot longer than me. One only spoke Arabic he began, when one of the country, the world to know it’s there. The truck drew closer. I kept hitting it with both hands. She was dressed in a drawer with the texture of very ugly stuff on the back of a polished marble plinth. I decided I wanted to do stand down. Will you promise me that? They promised with all the walls and other little devices, my wallet intact. Forbidden Planet rarely do I go to jail, Marcus, and not that many cool puzzles that you wanted to be a part of that Hummer, my head for my book by net or phone and memory sticks are private. Thats because of cryptography message scrambling. The math behind crypto is good and solid, and you wonder, Was it always like that? I hadn’t thought about trying to intimidate me the picture she’d snapped of us, if you wanted to.
I’m amazed, a very large sample with a word grouping of two or greater can produce some terrific results. It makes me wonder why spammers always end up with such nonsense titles, use a larger sample you jerks!
In my various attempts, I also ended up with some rather spectacular phrases. Here are some of my favourites, I swear I did not make any of these up:
The whimsical charms of each of disappointed children
Holding the Wiimote in the holiday rush? Either way, you’re probably online looking for awesome news like this
Your character’s level is on a couch. They’re comfortable for both sides.
a combination of raw stem cells, dubbed thing, and the USA
I believe I was blindsided by exploiting my childhood memories
…and the best one by far:
This essay will be successfully ignored altogether.
Check out Ben Abraham’s post for some more great ones (he beat me to the punch by a few hours.) Finally, I’ll leave you with a challenge. Can you guess what sample I used to produce the text below?
Peyote solidities of halls, backyard green tree cemetery dawns, wine drunkenness over the rooftops, storefront boroughs of teahead joyride neon blinking traffic light, sun and moon and tree vibrations in the roaring winter dusks of Brook-lyn, ashcan rantings and kind king light of mind, who chained themselves to subways for the endless ride from Battery to holy Bronx on benzedrine until the noise of wheels and children brought them down shuddering mouth-wracked and battered bleak of brain all drained of brilliance in the drear light of Zoo,
***********ANSWER***********
Trick question, it’s actually a word-for-word excerpt from Allen Ginsberg’s Howl. Only the formatting has been edited. I think the beat poets would have dug Markov chains, don’t you?
June 12th, 2008 at 12:32 am
I actually prefer the ones with higher word groupings. Sure, you get more of the original, but they tend to make more of an eerie kind of sense. I love your short phrases, though – the shorter ones can sometimes be the funniest.
Next project: create a SOURCE that makes for awesome Markov chains!
June 12th, 2008 at 12:42 am
The higher word grouping does make the text much more coherent. The most effective solution seems to be a gigantic sample, which really obfuscates all the original sentences.
That being said, I think I’ve fried my brain a little by reading all these nonsense paragraphs. If I develop word salad I’m blaming it all on Andrey Markov!
June 12th, 2008 at 1:01 pm
Ha! Nice work, and welcome to on of my favorite obsessions.
A quick nitpick:
As the sample size increases, the text produced becomes increasingly sound.
It’s not totally clear which of two things you’re referring to with “sample size” here, and whether the above is true depends on that.
If by sample size you mean e.g. the number of words in the lookup string that you use to choose the next word, you’re spot on. So for “sample size” = key length or “order”, an order of one (a single word: [be] – {or, that}) produces less coherent — less sound — text than an order-2 ([to be] – {or, that}) model, and so on up.
But if by sample size you meant the size of the sample text you feed into the model — “corpus” is a common term for this, from the more general notion of corpus-as-collection-of-texts — then it’s actually more the opposite case that’s true: the bigger your corpus, the more chance for entertaining rail-jumping transitions but a corresponding decrease in coherence on average, to the point where a very large corpus might mean you want to jump to a higher-order model (perhaps from 2-order to 3-order) just to keep the output a little more coherent — a little more grammatically sound, in other words.
But, as I said, nitpick. You’ve got the general thrust of it right, and it’s not even clear to me that this detail is wrong so much as just not clear from that paragraph.
June 12th, 2008 at 7:00 pm
Thanks for the correction, I was under the mistaken impression that a larger corpus would statistically result in more coherent text. Your argument makes sense though, a larger sample will return more out-of-context words and may ultimately lead to strange, disjointed text. I’ll amend my post accordingly.
Also, if you’re ever looking to do another comic project, might I suggest Kate Beaton’s work? I think her writing style would create some very interesting results!
June 13th, 2008 at 1:20 am
Ah, yeah, I found out about Kate’s stuff recently via a post on Metafilter I think. Very fun. Could be a bit more of a challenge than Garfield just for the idiosyncracies of the panels and the lettering, but that’s half the fun.
The Garkov code is largely Garfield-independent; in a couple months I’ll probably want to go back and do some polishing to make “largely” into “completely” by off-loading a few vestigial garfieldisms in the Perl itself to config files, and at that point I’d like to do a few other comics as proof-of-concepts. The folks who have significant fan-contributed transcripts via Oh No Robot are actually my likeliest next target — saving myself several hours of manual typing is a good thing.
So probably something like Achewood will be next on the block, but if I can get all the kinks worked out of the generalized Markov-Comic setup, I could even hand of the basic package itself to other enterprising fans.
June 13th, 2008 at 1:39 am
A general purpose package sounds like a terrific idea, harness the power of crowdsourcing and Markov comics could be the next LOLcats ;) I look forward to seeing how this develops, Josh.