About Jee Vang, Ph.D.

AI/ML innovator

Secondary sorting aka sorting values in Hadoop’s Map/Reduce programming paradigm

Introduction

Sometimes, we would like to sort the values coming into the Reducer of a Hadoop Map/Reduce (MR) Job. You can indirectly sort the values by using a combination of implementations. They are as follows.

  1. Use a composite key.
  2. Extend org.apache.hadoop.mapreduce.Partitioner.
  3. Extend org.apache.hadoop.io.WritableComparator.

Other tutorials that explains this approach on sorting values going into a Reducer are explained in the links below. In this blog, I summarize what I have learned from the links below and also provide a self-contained example. The main difference between this blog and the links below is that I will show how to do this using the new M/R API (i.e. org.apache.hadoop.mapreduce.*).

Continue reading

The “in-mapper combining” design pattern for Map/Reduce programming in Java

Introduction

I am reading a book by (Lin and Dyer 2010). This book is very informative about designing efficient algorithms under the Map/Reduce (M/R) programming paradigm. Of particular interest is the “in-mapper combining” design pattern that I came across while reading this book. As if engineers and data miners did not have to change their way of thinking enough while adapting to the M/R programming paradigm, our change in thinking and development must also be sensitive to the particular M/R framework as well. The in-mapper combining design pattern is meant to address some issues with M/R programming, and in particular, M/R programming under the Hadoop platform. In this blog I will discuss this in-mapper combining design patterns and show some examples. This design pattern seems to me an excellent technical screening problem—if you are so (un)fortunate. 🙂 Hereafter, I will refer to the in-mapper combining design pattern with the acronym IMCDP.
Continue reading

Map/Reduce Text Mining Toolkit (MRTMT) version 0.3 Released

I am going to continue on the Map/Reduce Text Mining Toolkit (MRTMT) API in this blog post. I have worked on it a little bit more, and now I will be releasing v0.3. The added improvements include allowing the user to specify the local and global weights used to build the vector space model (VSM).
Continue reading

Map/Reduce Text Mining Toolkit (MRTMT) version 0.2 Released

I recently released MRTMT v0.1. In that article, I stated there were still a lot of work to be done. Since, I have simplified the process of building a vector space model (VSM) in a newer version, MRTMT v0.2. The full source code may be downloaded here. Again, this API is released under the Apache 2.0 license.

Continue reading

A Simple Toolkit to Create a Vector Space Model using Map/Reduce

In this blog I will discuss a simple toolkit you may use to create a vector space model (VSM) (Salton 75). The toolkit is called, Map/Reduce Text Mining Toolkit (MRTMT), however, for now, its accomplishments does not entirely cover the scope of text mining and just merely creating a VSM from text documents.

The purpose of MRTMT is to create a VSM from a very large corpus of documents using Hadoop’s M/R programming paradigm. For a smaller corpus of document, please visit Word Vector Tool.

Continue reading

Computing Pearson Correlation using Hadoop’s Map/Reduce (M/R) Paradigm

Last time I analyzed Mahout’s collaborative filtering algorithm. In this blog, I will be writing about computing the canonical Pearson correlation between two variables for a set of data using Hadoop’s M/R paradigm. If you have already written your own M/R tasks for Jobs, this tutorial is not for you. If you are just starting out, this article might help you to start learning.
Continue reading

Learning the Mahout v0.4 Collaborative Filtering API

I have been studying the Mahout v0.4 API. This API is a machine learning API and may be used on top of Hadoop. It is built in Java. In particular, I have been digging into the clustering and collaborative filtering code of Mahout. In this blog, I will not say much, since most of what I would be saying were put into Word and PowerPoint files already. I will merely give you the link to download the document and slides.

The document and slide deck are licensed under the Creative Commons Attribution 3.0 Unported License.

I think anybody interested in machine learning and/or Hadoop’s Map/Reduce (M/R) programming paradigm may find the files helpful. The document is a bit more detailed than the slides, and it also uses a working toy example to let the reader observe the inputs and output of the Map/Reduce classes/tasks/jobs.

I did not make too many subjective statements in the document and slides. But in this blog post, let me say that I find the API fascinating. First, if you just want to know how to implement your own M/R Jobs, study this API since they use a lot data structures and “tricks” in intermediary steps to achieve the machine learning goals. Second, for some algorithms in this API (i.e. clustering algorithms), you can run them in standalone mode or on a Hadoop cluster (you do not necessarily need a Hadoop cluster or Cygwin + Hadoop to use this API). Third, they have a very good (progressive and responsive) team working on this API, and you can study the code for real-world best-practices. For example, in the recommender algorithm (collaborative filtering), it is interesting to note the required input. The required input is just a plain old text file where each line holds the user’s ID, item’s ID, and preference weight (the line is comma-delimited). The input data format consequentially means each user’s association with an item is its own record. Usually, the way I have worked with data is that a record would be equivalent to a user. You may find that “billions” of records in Mahout may not equate with your own familiarity or understanding of a record (e.g. as in a database).

At any rate, I hope the document and slide files help you to understand Mahout’s API a little bit better. Enjoy. Zai jian.

How to display an End User License Agreement (EULA) in Windows Phone 7

Introduction

In this blog, I will demonstrate a way of displaying an End User License Agreement (EULA) in a Windows Phone 7 (WP7) application (app). Why is a blog like this one necessary? To be honest, showing an EULA is not as easy as it seems. Here are some problems that I have encountered.

1. You usually want the EULA to be the first thing a user sees after they launch your app. One may think, “That’s easy, put it in a page.” The problem is that you do not really want to put an EULA into a page. Placing an EULA into its own page will mean that that page will be placed on the page backstack. This option can lead to a whole host of problems. For example, imagine you have a page, MainPage.xaml, and in that page, you detect if a user has accepted the EULA, if not, then you navigate to EulaPage.xaml. Once the user is on the EulaPage, if they accept the EULA, you can navigate back to MainPage, but what happens if the user declines the EULA? You can navigate back to MainPage too, but what you really want is to quit the application. That means you have to pass some parameter back to MainPage from EulaPage, which is very possible (pass using a querystring value or global value in App). But then in the MainPage, you have to detect, (most likely by overriding Page.OnNavigatedTo) if there’s any parameters passed in and handle it appropriately. This option may quickly create a mess out of your coding logic as you may have to handle multiple unrelated concerns when MainPage is navigated to. Another good discussion of not using a separate XAML page to display an EULA is available here at Exiting a Windows Phone Application.

2. An EULA is very long. In WP7, you cannot display very long texts in a TextBlock control. Even if you place a TextBlock inside a ScrollViewer, you will not be able to show a very long text because no UIElement can be longer than 2048 pixels (in width or height). (Unless, of course, your EULA is very short). The advised method, or a method, of displaying a very long text is to break it up and store the broken components inside multiple controls (e.g. TextBlocks).
Continue reading

How to use and not use iText and jFreeChart

Introduction

In this blog, I will talk about, iText, a Java API used in creating/manipulating PDFs and, jFreeChart, a Java API used in creating charts/graphs. I will make some suggestions based on my experience with iText and jFreeChart on how to use AND not use these APIs for generating PDFs with charts/graphs. I really wanted to write about topic because I found a lot of examples showing how to use iText and jFreeChart that resulted in unacceptable quality for professional use or display. On the other hand, I did see one example that showed how to generate high quality charts/graphs using iText and jFreeChart, however, I considered it incomplete as I was left wonder about “what-ifs.”

Before we proceed, you will need iText v5.0.1 and jFreeChart v1.0.13 to get the examples following to compile and execute. You may download iText at http://itextpdf.com/ and jFreeChart at http://www.jfree.org/.
Continue reading

How to create an in-memory PDF report and send as an email attachment using iText and Java

In this blog entry, I will show with a few lines of code how to create an in-memory PDF report and send it as an email attachment. Why is this exercise or illustration important? It is important to me because I’m involved in a lot of report generation projects where the reporting logic and display center around PDFs, and due to server constraints (i.e. space), it is not desirable to store these PDF reports as files; the goal is to generate PDF reports on the fly and send them off to the client. Furthermore, I found a lot of search results from the major search engines that discussed the same issue, but no clear example was given in Java. Two such discussions are below.

On the other hand, a full example in C# was also discovered here http://stackoverflow.com/questions/1196059/itextsharp-sending-in-memory-pdf-in-an-email-attachment.
Continue reading