Babbler
Here's a useless cute little program I wrote yesterday, called "Babbler", which creates new text based on how patterns of characters appear in a piece of text used to train the program. It is a Markov generator, which means that it uses the last N (N being the order) produced characters to determine which character comes next.
During training, Babbler analyzes all sequences of 1, 2, ... N characters and records the number of times each shows up in the original text.
To generate a character, it picks one at random from the set of all characters that may follow the sequence of the last N-1 output characters. The random function honors the frequency distribution of the characters in the source text. If the sequence of N-1 characters hasn't been seen in the source text, the program falls back to a N-2 sequence and so on.
Let's see an example. Using the book "The Prince" by Nicolo Machiavelli as training data, here's a fragment generated by an order-1 Babbler:
"gstkln tyeannioi rfa iaeafalrtutficbah tat eri de oesowaitpt neeff p odhi nn .ett borege r susnhsardlreneftgbnwauihn"
This is garbage. Essentially, an order-1 Babbler generates random characters based on their frequency in the source text. It gets better with higher order Babblers:
Order 2: "pom so owins ase is ibend antusirty hothlest r t nnd cthoath we wnt ony as y sizistaul, cos."
Order 3: "sse mays ing per bringet beforgeopers to st yound thus en of wily pir i same of cousembing of the vin to came they caussuccul is much aggrands"
Order 4: "runnibaldo, whers, count, and in had the for of meason thirds heles, markable in ought a luccion of france cound seat to blus, on of his rose they againtate trodus, he prince them will for have thing thould by have he abilitated the prince fro"
Here you can see that words start taking a definite english-like form -- they can be pronounced in english -- even if they are meaningless. Compare this to an order-4 Babbler trained with Don Quijote, by Cervantes (a well-known Spanish writer):
"no la tale pensigida, porque vaya ena gana, sea desto, y, aun llencio que durmino a mayor de su mores quelevas ciel tien del me dar locura razos mucho malano que alma: no hacer punto de me de sea; pensaminea quijo me hartida"
Lastly, here's an order-5 Babbler for The Prince:
"oned again they always or maste; and not. in 1510. thing except down they have saw the duke, and was defend when case helples. fifteen very were admitted by those offerent him outral; and intered himself anothing his notwith him near again"
The input to Babbler is a tokenizer object, which must implement the each() method. This allows for babbling randomly selected characters (what I've shown you so far) or words. Here's an example of an order-4 Babbler that uses entire words as tokens:
"should now take part; but nothing being concluded, Oliverotto da Fermo was sent to propose that if the duke wished to undertake an expedition against Tuscany they were ready; if he did not wish that the fortresses, which he did not recognize any need for doing so, he begged Castruccio to pardon the other members of his family by reason of the wrongs recently inflicted upon them."
If you're interested, here's the source: babbler.rb. It's written in Ruby, but may contain many Javaisms due to the fact I'm still getting used to the language (I started learning Ruby this weekend and this is my version of a "Hello World" -- on steroids)
During training, Babbler analyzes all sequences of 1, 2, ... N characters and records the number of times each shows up in the original text.
To generate a character, it picks one at random from the set of all characters that may follow the sequence of the last N-1 output characters. The random function honors the frequency distribution of the characters in the source text. If the sequence of N-1 characters hasn't been seen in the source text, the program falls back to a N-2 sequence and so on.
Let's see an example. Using the book "The Prince" by Nicolo Machiavelli as training data, here's a fragment generated by an order-1 Babbler:
"gstkln tyeannioi rfa iaeafalrtutficbah tat eri de oesowaitpt neeff p odhi nn .ett borege r susnhsardlreneftgbnwauihn"
This is garbage. Essentially, an order-1 Babbler generates random characters based on their frequency in the source text. It gets better with higher order Babblers:
Order 2: "pom so owins ase is ibend antusirty hothlest r t nnd cthoath we wnt ony as y sizistaul, cos."
Order 3: "sse mays ing per bringet beforgeopers to st yound thus en of wily pir i same of cousembing of the vin to came they caussuccul is much aggrands"
Order 4: "runnibaldo, whers, count, and in had the for of meason thirds heles, markable in ought a luccion of france cound seat to blus, on of his rose they againtate trodus, he prince them will for have thing thould by have he abilitated the prince fro"
Here you can see that words start taking a definite english-like form -- they can be pronounced in english -- even if they are meaningless. Compare this to an order-4 Babbler trained with Don Quijote, by Cervantes (a well-known Spanish writer):
"no la tale pensigida, porque vaya ena gana, sea desto, y, aun llencio que durmino a mayor de su mores quelevas ciel tien del me dar locura razos mucho malano que alma: no hacer punto de me de sea; pensaminea quijo me hartida"
Lastly, here's an order-5 Babbler for The Prince:
"oned again they always or maste; and not. in 1510. thing except down they have saw the duke, and was defend when case helples. fifteen very were admitted by those offerent him outral; and intered himself anothing his notwith him near again"
The input to Babbler is a tokenizer object, which must implement the each() method. This allows for babbling randomly selected characters (what I've shown you so far) or words. Here's an example of an order-4 Babbler that uses entire words as tokens:
"should now take part; but nothing being concluded, Oliverotto da Fermo was sent to propose that if the duke wished to undertake an expedition against Tuscany they were ready; if he did not wish that the fortresses, which he did not recognize any need for doing so, he begged Castruccio to pardon the other members of his family by reason of the wrongs recently inflicted upon them."
If you're interested, here's the source: babbler.rb. It's written in Ruby, but may contain many Javaisms due to the fact I'm still getting used to the language (I started learning Ruby this weekend and this is my version of a "Hello World" -- on steroids)
0 Comments:
Post a Comment
<< Home