Map Reduce -Working & Functionality

Map Reduce is used for processing data in a distributed manner.

Map Reduce has 2 phases:

1)Map

2)Reduce

𝐌𝐚𝐩 :

Map is a piece of code that runs on each block of data parallelly.

Mapper output is intermediate output.

𝐑𝐞𝐝𝐮𝐜𝐞𝐫 :

The output of mapper is given as input to reduce machine.

Reducer will processes and produces results, which will be stored in HDFS.

Map and Reduce both work only on key-value pairs

Both input and output of map and reduce are key value pairs.

𝐒𝐡𝐮𝐟𝐟𝐥𝐢𝐧𝐠 :

It is the movement of the data in <K, V> pairs from the mapper machine to the reducer machine so that reducer machine get data to work on.

The output after the shuffling will act as the input to the reducer

𝐒𝐨𝐫𝐭𝐢𝐧𝐠 :

Sorting is a operation done on reducer machine to get the same keys together in a group

Shuffle and Sort are internal processes and will be taken care by MR framework and not by the developer.

𝐖𝐡𝐞𝐧 𝐭𝐨 𝐮𝐬𝐞 𝐏𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐞𝐫 𝐢𝐧 𝐌𝐚𝐩𝐫𝐞𝐝𝐮𝐜𝐞 ?

--> By default the number of Reducers is 1.

--> Partitioner comes into play if we want to change the number of reducers

--> Partitioner will tell which <K,V> pair goes to which reducer so that it will distribute the data equally among all the reducers.

--> By default there is a system defined Hash function which tells which <K,V> pairs go to which reducer

-->We can also have our own custom partition logic

𝐄𝐱 : All the words with length < 4 should go to reducer 1 and words with length > 4 goes to reducer 2.

𝐖𝐡𝐞𝐧 𝐭𝐨 𝐮𝐬𝐞 𝐜𝐨𝐦𝐛𝐢𝐧𝐞𝐫 ?

--> Combiner is used to do local aggregation on mapper machines. It reduces the amount of data that travels over the network to the reducer.

--> This will reduce the burden on reducer.

--> We can have our reducer code as combiner code.

--> Combiners are running on mapper machines.

--> In which scenarios can combiner be used:

𝐄𝐱 : Min,max,sum

𝐅𝐥𝐨𝐰 𝐨𝐟 𝐌𝐑 :

Mapper->Combiner->Partitioner->Shuffle & Sort ->Reduce

Kishan’s big data world