I recently released MRTMT v0.1. In that article, I stated there were still a lot of work to be done. Since, I have simplified the process of building a vector space model (VSM) in a newer version, MRTMT v0.2. The full source code may be downloaded here. Again, this API is released under the Apache 2.0 license.
In MRTMT v0.1, I showed how a user may sequentially run Map/Reduce (M/R) Jobs to build a VSM. However, in MRTMT v0.2, I have a built a driver program to run all M/R Jobs. Assuming you have a directory of sequence files, where each row represents a text file, you may simply issue the follow shell command.
hadoop jar mrtmt-0.2.jar net.mrtmt.job.AltMainJob -Dmapred.input.dir=/path/to/sequencefiles -minl 5 -maxl 30 -minf 10 -maxf 200
One thing you will notice now is that there is no more need to copy dependent jars to $HADOOP/lib and use -libjars to reference the dependent jars. When you use ant to build the source code, the dependent jar files are now placed inside the /lib directory of mrtmt-0.2.jar.
Also, another improvement is cutting down the number of steps. In MRTMT v0.1, we had to go through 5 M/R Jobs. In MRTMT v0.2, with a some helper data structures, we only go through 4 M/R Jobs. The four M/R Jobs are as follows.
- Job 1: Extract the terms to use to build the VSM.
- Job 2: Extract the term frequencies (TFs) for each document.
- Job 3: Extract the inverse document frequencies (IDFs) across all documents.
- Job 4: Build the VSM.
I tested MRTMT v0.2 on a set of 423,222 documents from a usenet group. The amount of time taken for each M/R Job are as follows.
- Job 1: 3 minutes 34 seconds
- Job 2: 1 minute 35 seconds
- Job 3: 1 minute 22 seconds
- Job 4: 3 minutes 3 seconds
So, the total time is 9 minutes 34 seconds. Not bad, in my opinion. I had a Hadoop cluster of 6 data/task nodes. The VSM is built upon 176,625 words. I tried other tools that do not use the M/R framework to try to achieve similar results and I invariably got an OutOfMemoryError. So, M/R is definitely a way to scale.
Also, you will notice in the newer API, the util classes are revised. First, net.mrtmt.util.FromSequenceFileUtil now recurses into a base directory and outputs the binary sequence file to a corresponding text file. You can pass in as input a sequence file or a base directory. If there is a file called, part-r-00001, then this util class will output, part-r-00001.txt. Second, net.mrtmt.util.ToSequenceFileUtil, now splits the output into 64 MB files. You may still pass in a single directory or a directory of sub-directories to ToSequenceFileUtil (it will recurse and try to find all text files and append them to a sequence file). I have not really tested thoroughly yet, but, depending on your DFS block size (mine is 64 MB), it seems M/R Jobs process much more quickly if you split your input files according to the DFS block size.
At any rate, cheers! I hope you may find this toolkit useful. More development coming in the future.