I am going to continue on the Map/Reduce Text Mining Toolkit (MRTMT) API in this blog post. I have worked on it a little bit more, and now I will be releasing v0.3. The added improvements include allowing the user to specify the local and global weights used to build the vector space model (VSM).
If you look at this page, Latent Semantic Indexing (LSI), you will see that the value of each term in each document (in the VSM) is the product of a local and global weight. The local weight usually only considers the term with respect to other terms of the document. On the other hand, the global weight may take into consideration the distribution of the term across all documents. For MRTMT v0.3, I have implemented all the local and global weights described by the LSI link above. The local weights are as follows.
- term frequency
The global weights are as follows.
- inverse document frequency
To run the new driver to produce a VSM is as follows.
hadoop jar mrtmt-0.3.jar net.mrtmt.job.GenericMainJob -Dmapred.input.dir=/path/to/sequencefiles -minl 5 -maxl 30 -minf 10 -maxf 200 -lweight tf -gweight idf -tempDir /results
I will explain the parameters below. All the parameters are optional and there are defaults.
- -minl is the minimum length a word has to be. Default value is 2.
- -maxl is the maximum length a word has to be. Default value is 30.
- -minf is the minimum frequency a word has to have. If no value is passed in then a word with any length will be considered.
- -maxf is the maximum frequency a word has to have. If no value is passed in then a word with any length will be considered.
- -lweight is the local weight. Options are tf, binary, log and augnorm. Default value is term frequency (tf).
- -gweight is the global weight. Options are idf, binary, normal, gfidf and entropy. Default value is inverse document frequency (idf).
- -tempDir is the temporary directory where all results will be stored under.
In MRTMT v0.1 and v0.2, I counted all the records in the original input directory every time I needed this number. However, in MRTMT v0.3, as its own job, I do the counting once, store it in HDFS, and then retrieve it from there whenever I need it. Instead of four steps: 1) extract the terms, 2) compute the local weights, 3) compute the global weights, and 4) compute the VSM, I now have five steps, which includes this counting of all records. This step is an optimization step to avoid recounting the total number of documents every time I need it (if I had 5 reducers that required this number, in the previous releases, I would have to do the counting 5 times!). The M/R jobs are as follows.
- Job 1: count the number of documents
- Job 2: extract the terms
- Job 3: compute the local weights
- Job 4: compute the global weights
- Job 5: compute the VSM
I ran this latest, more flexible VSM builder on 423,222 documents from a usenet group (using tf-idf). The following are the running times.
- Job 1: 0mins, 40sec
- Job 2: 2mins, 13sec
- Job 3: 2mins, 9sec
- Job 4: 1mins, 18sec
- Job 5: 1mins, 18sec
On a Hadoop cluster with 6 data/task nodes, it takes a total time of 7 minutes and 38 seconds (compared to v0.2 it took a total time of 9 minutes and 34 seconds). Not bad at all. But of course, these times should be taken with caution as there may be external factors (failed tasks, network issues, corrupt files, etc…).
As always, download the new API by clicking here.
Cheers and happy programming! Sib dua! αντίο!