Ever wondered about what Einstein would say if he were alive today? Well, you may not have to wonder anymore. It turns out that we are able to generate new statements by Einstein (or anyone) by using approximations of Hidden Markov Models (HMMs) based on a corpus of previous statements. In another open source software release, called, HMM Text Synthesizer, I have implemented a bigram approximation of a HMM to synthesize new sentences based on a corpus of existing statements. I won’t go into explaining HMMs or bigrams/n-grams (consult the links above for starters), but I will quickly go over the use of HMM Text Synthesizer.
HMM Text synthesizer is available for download as a binary distribution and its source is also available for download. It is licensed under the Apache 2.0 license. To download the binary distribution, click here. To download the source distribution, click here. After you download the binary distribution, unzip it. In the unzipped directory, there will be a file, run.bat. Execute run.bat. Note, you will need a Java Runtime Environment (JRE) v1.4.
When HMM Text Synthesizer starts, you will see something like the following screen shot. There is a top text area used for input. There is a bottom text area used for output. The input text area is used to place the corpus of sentences that will be used to build/learn/train the bigram. You can type text into the input text area, or, click on the Open File button to open a text file (only text files are supported). Once you have text in the input text area, click on the Generate button to learn the bigram and generate new statements. There is a checkbox, Normalize Text, that if checked, will turn the input text into lower case before bigram learning/construction begins.
I took some quotes by Einstein from this website and used them as input into HMM Text Synthesizer. Here are some interesting statements generated by the bigram and corpus of Einstein quotes (with and without the text normalized).
- Imagination is a very long cat.
- free ourselves from the distinction between past present concern.
- awe is keeping your difficulties in the world a very persistent illusion.
- the world is as far more certain as first love?
- can’t solve problems by fear of mathematics.
- our mathematical equation stands forever.
- Do not play dice.
How much input text you can load and how fast is limited by your computer’s memory and processing power, respectively. By default, the run.bat script sets the minimum and maximum heap sizes to 256M and 512M. I did try to load a large book from Gutenberg, however, I got an OutOfMemoryError. As such, an important improvement to HMM Text Synthesizer will be to use a more efficient means of storing the bigram model. To get better accuracy, a trigram model may be used.
At any rate, enjoy this software and play with it for fun.