In this blog I will discuss a simple toolkit you may use to create a vector space model (VSM) (Salton 75). The toolkit is called, Map/Reduce Text Mining Toolkit (MRTMT), however, for now, its accomplishments does not entirely cover the scope of text mining and just merely creating a VSM from text documents.
The purpose of MRTMT is to create a VSM from a very large corpus of documents using Hadoop’s M/R programming paradigm. For a smaller corpus of document, please visit Word Vector Tool.
The input required by MRTMT is a set of sequence files where the keys are the document names and the values are the document text. There is a utility class, net.mrtmt.util.ToSequenceFileUtil, that you may use to convert your text files into a sequence file. Let’s say you have a root directory called, baseDir, and it has text files as well as sub-directories with text files (see below). You may use ToSequenceFileUtil to recurse that directory, baseDir, to produce a single sequence file. The program will recurse into all sub-directories (as deeply as possible) and append all text files into one single text file called, data.seq.
The usage in a DOS command prompt is as follows. It only requires 1 input parameter, and that is the location of the base directory with text files.
java -cp mrtmt-0.1.jar;hadoop-0.20.2-core.jar;commons-logging-1.0.4.jar net.mrtmt.util.ToSequenceFileUtil c:/baseDir
After you have finished create the data.seq file, you may copy it to your hdfs cluster. Assuming you have Cygwin + Hadoop installed, you may type in something similar to the following.
$HADOOP_HOME/hadoop dfs -copyFromLocal /path/to/data.seq hdfs://server:port/path/
Now you can run the map/reduce (M/R) Jobs to produce a VSM on the data. A more detailed explanation of the steps required to build a VSM is outlined below, however, I explain the general steps required.
- The first thing to do is to produce a list of words that will define the VSM.
- For each document, a term frequency (TF) vector is created.
- For each word, determine its inverse-document frequency (IDF).
- For each document, compute the TF-IDF vector (which will be a part of the VSM).
Before you may actually use MRTMT to produce a VSM for your corpus of text documents, a little setup is required. First, you must copy lucene-core-3.3.0.jar and commons-cli-2.0-mahout.jar to your $HADOOP/lib directory. This puts the jars in the classpath.
Select Which Words to Use
The words to use may be anything you define, but ideally, these words should be normalized, stemmed, and stop words should be removed. For MRTMT, we use Lucene’s StandardAnalyzer to process the words. Furthermore, we may define the length (how long or short) of a word to be valid for inclusion. We may also define its frequency.
Assuming you have copied data.seq to /data-seq-text (/data-seq-text/data.seq) onto HDFS and you are logged onto a Hadoop node, you may type in the following command line to generate the list of valid words that will define the VSM. As you can see, we will run the job, AltWordCountJob. We have also specified that a valid word must be of minimum length 5 (-minl) and maximum length 30 (-maxl). The minimum frequency is set at 10 (-minf) and maximum frequency is set at 200 (-maxf). The input directory is set to /data-seq-text (-Dmapred.input.dir) and the output directory is set to /results-text (-Dmapred.output.dir).
Note that we also use the -libjars flag and pass in two jar files. This flag signals to Hadoop to copy these jars to the other task nodes.
hadoop jar mrtmt-0.1.jar net.mrtmt.job.AltWordCountJob -libjars commons-cli-2.0-mahout.jar,lucene-core-3.3.0.jar -Dmapred.input.dir=/data-seq-text -Dmapred.output.dir=/results-text -minl 5 -maxl 30 -minf 10 -maxf 200
Create the Term Frequency (TF) Vectors
To create the TF vectors for each document, run the following command. In this step, as before, we define the input and output directories. Furthermore, we also specify the directory where the valid terms are located (-tdir). The TF vectors will only be built on the set of valid words we created in the step before.
Note that TF vectors are created per word per document. The TF is a local measure of the importance of the word in the document.
hadoop jar mrtmt-0.1.jar net.mrtmt.job.TfJob -libjars commons-cli-2.0-mahout.jar,lucene-core-3.3.0.jar -Dmapred.input.dir=/data-seq-text -Dmapred.output.dir=/results-tf -tdir /results-text
Create the Inverse-Document Frequency (IDF) Vectors
In this step, we want to create the IDF for each word. Unlike TF, we do not compute IDF per word per document—just per word. The IDF is the global measure of the importance/weight of the word across all documents.
First, we create a list of documents and the terms they have. This step is an optimization step that may make the whole process seem un-natural or un-easy. We could have gone straight to computing the IDF, but preliminary experimental observation has seen slow performance. In this step, we only specify the input and output directories.
hadoop jar mrtmt-0.1.jar net.mrtmt.job.DocumentJob -libjars commons-cli-2.0-mahout.jar,lucene-core-3.3.0.jar -Dmapred.input.dir=/data-seq-text -Dmapred.output.dir=/results-docs
After we have created a list of documents and the terms in the document, we are then ready to compute the IDF per word. In this step, in addition to the input and output directories, we also specify the directory (-ddir) where previous documents and their terms are stored.
hadoop jar mrtmt-0.1.jar net.mrtmt.job.AltIdfJob -libjars commons-cli-2.0-mahout.jar,lucene-core-3.3.0.jar -Dmapred.input.dir=/results-docs -Dmapred.output.dir=/results-idf -tdir /results-text -ddir /data-seq-text
Create the VSM
Now we are finally ready to create the VSM. We simply take the TF vectors (as the input directory) and multiply it by the IDF vector (-idfDir). The output will be stored in /results-vsm as a sequence file.
hadoop jar mrtmt-0.1.jar net.mrtmt.job.VsmJob -libjars commons-cli-2.0-mahout.jar,lucene-core-3.3.0.jar -Dmapred.input.dir=/results-tf -Dmapred.output.dir=/results-vsm -idfDir /results-idf
To spit out the VSM in binary format (sequence file) to a readable format, there is another util, net.mrtmt.util.FromSequenceFileUtil, that you may use. Download your VSM file(s) from HDFS locally, and run FromSequenceFileUtil to output the data into readable format.
Summary and Conclusion
MRTMT is a simple toolkit you may use to generate a VSM. There are several alternative Jobs you may use, and some may be more appropriate than others. Examples of using MRTMT are in the README.txt file. There are definitely ways to improve MRTMT. For example, in creating the TF vectors, we may indicate term positions (not just term and frequency) thereby avoiding the step of having to produce Documents (before we compute IDF). Also, in creating TF and IDF vectors, we may use distributed caching to avoid loading the terms from HDFS for every mapper and/or reducer. At any rate, MRTMT should be a fun toolkit ready for you to study, optimize, and use. Hopefully I have more time and may make these optimizations.
Cheers and have fun with MRTMT. Sib ntsib dua nawb mog! Hasta luego!
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975