Map-Reduce

Slides
The slides of the presentation (October 13, 2014) about MapReduce are uploaded here. The slides of the project presentation (December 17, 2014) about Hadoop and PyMR are here.

Python implement of MapReduce (PyMR)
The first goal of this project was to implement an easy-to-use and user-friendly python version of Map-Reduce in order to easily design and prototype Map-Reduce algorithms.

How to use PyMR?
This section will give you general instructions in order to use the PyMR library. The complete documentation of PyMR is available here.

Step 0: Software requirements
If you want to use the PyMR library, you must have: Don't forget to add the line at the beginning of your script.
 * Python 2.7.x (not Python 3.0!)
 * PyMR library, available here.

Step 1: Parsing data
The data given to the MapReduce algorithm must be files with one input/line. For instance, If you have a file with text, say "Hello World!", and the mapper needs to have only one word as input, then you have to transform your file into: hello world You can also use the class fileHelper, available in PyMR library. This class can help you to parse easily your input files. The full documentation is available here'''. '''Don't forget the fact that you can also contribute to improve the PyMR library. So if you program a new function for parsing files and you think it can be helpful for some people in the future, you are encouraged to submit your method :-).

Step 2: Create the Mapper and the Reducer
Since the MapReduce algorithm requires a user-defined mapper and grouper, you need to implement yourself a mapper class and a grouper class.

The Mapper
The mapper needs to have a method called "map", which takes two arguments : __self__ self and MapContext ''theContext. ''You can read the documentation of MapContext class here. Here is an empty mapper class:

The Reducer
The reducer needs to have a method called "reduce", which takes two arguments : __self__ self and ReduceContext ''theContext. ''You can read the documentation of ReduceContext class here. Here is an empty reducer class :

Step 3: Create the MapReducer and launch the algorithm
Once the mapper and reducer are created, you juste need to instanciate them, create a instance of MapReduce class and call the execute routine. You can see a generic example below. The execute routine will give you as output a dictionnary with all pairs key/values generated by your reducer. For more information, you can read the documentation over MapReduce class here.

Warning : since you can create your own mapper and reducer, you might be willing to do some strange things inside the methods reduce and map, like creating and accessing instance variable. Note that if you intend to run you program using the parallel version of PyMR, the algorithm will call the map and reduce functions multiple times simultaneously, so don't write anything outside the map and reduce function. You can safely access shard variable but don't write anything !

Counting words with PyMR
Your can find below a complete example of counting words with PyMR. You can also run the file demo_CountingWords.py.

Input File
Create a file named dataFile in C:\mapReduceCountingWords with some content, for example : Hello! It is a Hello world!

Output
The output file, named coutingWordsResults.txt will be in the file C:\mapReduceCountingWords\. The content of this file must be : a : 1 world : 1 is : 1 hello : 2 it : 1

Matrix-vector multiplication with PyMR
Your can find below a complete example of a matrix-vector multiplication with PyMR. The goal of this example is to compute a matrix-vector product with MapReduce.

The general Map-Reduce algorithm for computing Matrix-Vector multiplication is the following We want to compute $$ v_i = \sum_{j = 1}^n A_{ij} b_j $$The Map-Reduce algorithm is then :
 * Map : the input is an entry of the matrix A stored as $$(i,j,A_{ij})$$. The output is the following key-value pair : $$(i, A_{ij} b_j)$$
 * Reduce : the input is a key-values pair $$(i, [A_{i1}b_1,\dots,A_{in}b_n])$$ and the output is $$(i, \sum_{j=1}^n A_{ij}b_j) $$

The specific operation made in this example is (using Matlab notation): [1 2 ; 1 0] * [1 ; 1] The result is indeed : [3 ; 1]

Inputs files
Here is the content of file "A_matrix". Create this file with the name "A_matrix" in C:\mapReduceMatrixMultiplication with the following content. Note that zero values are ommitted (this is a sparse notation). 1 1 1.0 1 2 2.0 2 1 1.0 In addition, create the file "b_vector" in C:\mapReduceMatrixMultiplication with the following content. 1 1

Mapper
The following mapper load the b vector from file into memory at initialization.

Output
The output is the file "MatrixVectorResults.txt". As expected, the file contains the following lines : 3.0 1.0

Performances
We can now start to wonder about the performance of our MapReduce implementation. The matrix-vector multiplication is a simple algorithm, well suited for performances analysis. Here, we created tridiagonal sparse matrices with increasing sizes. Time execution is presented in the following table As expected, the algorithm is linear at the beginning, but then performances deteriorate, probably due to a overflow of the hard drive cache.

Simulation of picture similarities
In this section we will se a general model to compute similarities between two pictures for the whole set of picture. Since the goal of this project is not signal processing, we will just use a very simple model of a function which compute similarity between two pictures.
 * Mapper: The goal of the mapper is to generate all possible pairs of pictures (without the symmetry). To do this, we create $$n$$ keys $$i$$ = $$0,1,...,n-1$$ (where $$n$$ is the number of pictures). For each keys we give all possible value $$j$$ in $$[0,n-1]$$ subject to $$i\leq j$$. In our case, the input of the mapper is simply the first picture. The key is this picture, and the value is a list
 * Reducer: The goal of the reducer is to analyse all pairs key-values with a function which gives a similarity measure between two pictures $$(i,j)$$. This reducer will keep the key, which is the picture $$i$$, and iterate over all element in the list, which will be the picture $$j$$ , and compute the similarity bewtween $$i$$ and $$j$$. This value is stored in a list which will be returned at the end of the execution.

Input file
The input file will contains only references to the picture or the picture itself. To make the code simplier, we will juste write a number (we can suppose that with this number we can get the right picture). For example, we can generate this content : 0 1 ... (continue) 18 19

Mapper
The goal of the mapper is to create all pairs of possible pictures. We assume that the similarity between two pictures is symmetric, so we will only generate pairs $$(i,j)$$ where $$i \leq j$$. Also, to make the mapper simpler, we suppose that we know in advance the size of the set of pictures that needs to be analysed.

Reducer
The reducer will use the function which computes similarity.

Execution file
The execution file is really simple since the inputs are already formated.

Output
The output is supposed to be : 11 : [11] 10 : [10] 13 : [13] ... (etc) 9 : [9] 8 : [8] Note that the order doesn't matter. You just need to have all numbers between 0 and 19 included.

Multi-thread performances
We can now assess the advantage of having more than a single thread when the computation of the map or the reduce operation is expensive. To do this test, we have defined a new function computeSimilarity. In this function, we simulate the fact that computing the similarity for a pair of pictures can take a non-negligeable amount of time. Also, to avoid really long total computation time, each chunk contains only one integer. Then, the following table shows the evolution of computation time in function of the number of threads.

We can clearly see the almost linear speedup at the beginning, followed by a stagnation. This is due to the fact that first we have a limited amount of cores available, and also that this simple algorithm have a pretty bad load balancing : the list of values for the first image is [1, 2, 3, ..., 20] while for the last one it is only [20] !

Hadoop and MapReduce
Hadoop (official webpage) is one of the most famous implementation of DFS (Distributed File System) and MapReduce. In this section, we provide a simple tutorial on how to install Hadoop and how to run some basic algorithms, like counting words or matrix-vector multiplication.

All the code used in this section can be downloaded here.

Prerequisites

To install and run simple Hadoop jobs, you will need
 * A Linux or Mac OSX system (Windows doesn't seems to be fully supported yet)
 * Java

Installing Hadoop
Hadoop can run: We will show you how to install and use the first mode.
 * 1) In  Standalone mode: this is the simplest mode. It only uses one core of your computer, and doesn't takes advantage of the DFS. It is useful to debug and prototype algorithms.
 * 2) In Pseudo-distributed mode: this is the intermediate level. It uses the DFS and multiple cores, but still runs on one single computer.
 * 3) In Full-distributed mode: this is the more complex setup : it uses multiple computers (a cluster).

You can find an official tutorial for the first two modes here.

Download and install
First download hadoop from one of the official mirrors.

Uncompress the archive and copy it in any folder of your computer. In this tutorial, we'll assume you copy it into /Users/yourname/Hadoop/hadoop-2.5.2 (replace 2.5.2 by the version you downloaded)

For example, the path to the README.txt file is /Users/yourname/Hadoop/hadoop-2.5.2/README.txt

Update environment variables
Now, we need to modify some environment variables. Open the file /Users/yourname/Hadoop/hadoop-2.5.2/etc/hadoop/hadoop-env.sh and add or modify the following line export JAVA_HOME=/library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/ export HADOOP_PREFIX=/Users/yourname/Hadoop/hadoop-2.5.2 The JAVA_HOME line should be adapted to your system. It should points to the JAVA home directory (containing the LICENSE and COPYRIGHT files)
 * 1) set to the root of your Java installation
 * 1) Assuming your installation directory is /Users/yourname/Hadoop/hadoop-2.5.2

Hadoop is now installed. Open a terminal and navigate to /Users/yourname/Hadoop/hadoop-2.5.2/ then type bin/hadoop It should display Hadoop's help.

Running MapReduce : the WordCount example
Hadoop is now installed, and you should be able to run basic MapReduce jobs using only one JAVA process.

Let's see how to run a basic WordCount example (the code of this example comes from both Hadoop official tutorials and here).

First, create the JAVA class. Simply download the following JAVA class and save it as /Users/yourname/Hadoop/hadoop-2.5.2/WordCount.java

In this folder, create a folder named /Users/yourname/Hadoop/hadoop-2.5.2/input_wordcount and fill it with text files you want to analyse. For example the Twitter data. You can have multiples files.

Then, open a terminal, navigate to the Hadoop folder, and enter the following commands.

First, update environment variables in terminal : export JAVA_HOME=/library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/ export PATH=$JAVA_HOME/bin:$PATH export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar Then, compile the JAVA class, bin/hadoop com.sun.tools.javac.Main WordCount.java create the JAR file, jar cf WordCount.jar WordCount*.class and finally run Hadoop : bin/hadoop jar WordCount.jar WordCount input_wordcount output_wordcount At the end, the output_wordcount folder will contains two file. The first one, named _SUCCESS simply means everything went fine. The second one, named part-00000 contains the output of MapReduce, namely  pairs. In this case, the keys are the word and value are the number of occurences.

WordCount in more details
Let's now analyse the WordCount example in more details.

The WordCount.java file consists of 3 classes : the mapper, the reducer and the main class. The main class is mainly here to define the classes of the  pairs, of the inputs and of the outputs. The other two classes are more interesting.

Mapper
The inputs of the map function consist of 4 arguments. The first one is the key, followed by the value, then by the output and by some reporter. Exception made of the last one, this is exactly the regular Map-Reduce standard : the mapper takes one  as input, and output zero, one or more  in the output.

Basically, in this case, the value is a line from a text file, while the key is irrelevant. First, we break the line into words using the StringTokenizer classes. Then, while the line contains words, we set value to the current word, and update the output by This is the classical  key-value pair of the WordCount example.

Reducer
The code of the reducer is even simpler. Basically, we iterate over the elements of values (with the hasNext and next method), and we simply increment sum by the values. At the end, we return the  where Key is the same key as in the input (the word), and Sum is the sum of the values in the values list.

A few results with Twitter data
Now that we know how to use Map-Reduce and Hadoop, we're ready to tackle some big problems.

The first one - maybe the easiest - would be to count for each word the number of occurrences in the tweets. To do that, the first step is to extract the tweets content. Then, we can simply run the WordCount algorithm on it. Finally, we can sort the words by the number of occurrences to extract the most frequent words.

Extension of WordCount
To avoid some of the strange results in the previous table (?????? or :) for example) we can add some criterion about the word we select directly inside the mapper. The mapper can also take care of some "conversion" (like putting all words to lower case, etc). The code of the map fonction then become slightly more complicated where deAccent is function removing accents.

In this case, results are more regular :

Now, let's imagine we only want to count words with more than 10 letters, beginning with an a. How can we do that using Map-Reduce ? Of course, we could use the previous algorithm, and then extract the data we care about. But let's be smarter : we can modify the map function directly. The code will then be

And the results are Finally, if we only want to extract useful words (defined as words with more than 8 characters and beginning with an actual letter), we need to have this mapper In this case, results are the following This seems interesting: there is a very high number of occurences for the names of the flemish provinces in Belgium. This might of course be due to the fact that there are more Dutch-speaking people than French-speaking people. Still, they appear so many times that this result might seems counter-intuitive at first sight.

The Matrix-Vector Multiplication using Hadoop
Another example of computation that can be done with Map-Reduce is the Matrix-Vector multiplication.

In this case, we want to compute $$ v_i = \sum_{j = 1}^n A_{ij} b_j $$

The Map-Reduce algorithm is then the following. The Java code for this MapReduce is the following
 * Map: the input is an entry of the matrix A stored as $$(i,j,A_{ij})$$. The output is the following key-value pair : $$(i, A_{ij} b_j)$$
 * Reduce: the input is a key-values pair $$(i, [A_{i1}b_1,\dots,A_{in}b_n])$$ and the output is $$(i, \sum_{j=1}^n A_{ij}b_j) $$

This code can then be compiled and run by typing bin/hadoop com.sun.tools.javac.Main SparseMatrixVector.java jar cf SparseMatrixVector.jar SparseMatrixVector*.class bin/hadoop jar SparseMatrixVector.jar SparseMatrixVector b2_vector 200000 A/ output_sparseMatrixVector/ Where b2_vector is the path to the $$b$$ vector, 200000 is the size of the vector, A/ is the path to the folder containing the matrix (which can be divided into small files) and output_sparseMatrixVector/ is the folder where the results should be written.

Performances
We can now try to assess the performances of MapReduce on this Matrix-Vector multiplication. In this example, we used sparse tridiagonal matrices.

We see the the complexity is asymptotically linear, but at the beginning the algorithm isn't effective at all. This shows that Hadoop is useful for tacking large problems, but shouldn't be used when solving small instances problems.