How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?

2012-04-05 15:09
by techlad

There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.

2012-04-05 15:42
by root1982

Thanks root.. Taking 48 GB RAM available on each machine and having a 8 core machine. Lets say we reserve 1 GB RAM for each mapred task, would the optimal value be 48 GB RAM - 1 GB for DataNode - 1 GB for TaskTracker = 46 GB available RAM. In this case should we have 8 as the mappers for 1 machine or should we increase it to say 46, considering the case that all reducers start after mappers complete - techlad 2012-04-05 15:52

Most of CPUs come with hyper threading and it is enabled by default. so, if you have 16 cpu threads, you can probably increase the numbers more. I would focus on the number of CPU. For memory, even you are not using all of it, system can always find some good use of it, like caching. 1G for the daemons is the default. I would monitor the system and consider a higher number. Most of the times mappers are running in parallel with reducers - root1982 2012-04-05 17:17

Running out of characters... So, if I were you, I would start with 10 mapper and 4 reducer. How many disks do you have? mappers are going to read in parallel. Do you have multiple disk devices - root1982 2012-04-05 17:19

Thanks root. As of now I have only 1 disk and than a RAID for redundancy. What I notice right now is that even though I allocate 1GB RAM for each mapper, when I run a top command I see it occupying around 1.6GB Virtual Memory and around 0.5 GB Real Memory. Is this behaviour normal - techlad 2012-04-06 04:19

Redundancy on each node is not suggested in general. HDFS has it own redundancy. Say you run 10 mappers will be reading the disk at the same time. For a normal 7200rpm disk, 2-3 mappers is a good number. For you system, with 48G mem and 16 cpu thread, I/O will likely to be the problem. I suggest you to get multiple disk for each node and set them up as JBOD. Regarding the memory issue, I wouldn't worry about it too much. Normally, if you specified 1G, the virtual memory could be more than 1G - root1982 2012-04-08 02:54

Quoting from "Hadoop The Definite Guide, 3rd edition", page 306

Because MapReduce jobs are normally I/O-bound, it makes sense to have more tasks than processors to get better utilization.

The amount of oversubscription depends on the CPU utilization of jobs you run, but a good rule of thumb is to have a factor of between one and two more tasks (counting both map and reduce tasks) than processors.

A processor in the quote above is equivalent to one logical core.

But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.

2014-09-01 05:49
by Jifeng Zhang

Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.

2012-04-06 20:40
by SSaikia_JtheRocker

That will be very application and hardware dependent. If data are aggregated very good on the mapper side, less data is traveling over the network. In this case, if the reducer starts too early, it will just be waiting for data to process. If you have fast network, it will be the same situation. On the other hand, delay the reducer will delay the job. The goal is not to run more mapper, but to get the job finish faster - root1982 2012-04-08 03:00

Root: Gave you a comment up!. I'm not sure but just to clarify myself, say we have a 8 core machine without HT. Lets say, here we run 5 parallel map tasks and 2 parallel reduce tasks. So, here we reserved 2 slots for reduce tasks. Is it not the case that if we lazily load the reducer, those 2 slots can be used by the map tasks instead, which increases the number of parallel map tasks to 7 - SSaikia_JtheRocker 2012-04-08 05:37

JtheRocker: If we set mappers as 5 and reducers as 2, we can't use the slots of Reducers. Max 5 mappers can run at any moment - techlad 2012-04-08 08:54

techlad: Got it. So will lazy loading work for you? I'm not sure though - SSaikia_JtheRocker 2012-04-08 19:56

Taken from Hadoop Gyan-My blog:

No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.

No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

2012-11-12 14:35
by Abhishek Jain