limit the number of records produced by all reducers collectively

Here is the use case:

input urls are read by maps and later emitted post some filtering. Then partitioners partition them based on their hostname.

I have a global limit over the output urls after running the map-reduce job. I distribute this evenly across all reducers. ie. if global limit is 1000 and number of reducers is 5, then every reducer will at most emit (1000/5 = 200) urls as output

The problem is that if there are urls from only 2 hosts (due to user input) and there are 100000 urls of each of these 2 hosts, the 2 reducers processing these urls (same host, same partition) will limit only 200 urls each to output. Rest reducers dont get any data for processing due to partitioning and emit 0 records.

So even though I had 100000 urls/host and global limit of 1000, output has 400 urls only (200 urls/host).

2012-04-05 02:51
by Tejas Patil

If you don't have to partition by hostname, you can solve your problem by a random partitioner.

If you have to partition by hostname, I don't think there is any easy answers. Each reducer doesn't know how much of records are coming. Each reducer has to accumulate 100000 records or as much as it receives. You need to override the cleanup function in your reducer. Reducers need to talk to each other(through counter, maybe) in the "cleanup" function and decide how many of records are needed and only write out the records in the cleanup function.

what do you think?

2012-04-05 15:33
by root1982

I have to partition by hostname as urls from same host are grouped together and simplifies further logic. Is cleanup function some part of hadoop map-reduce flow or your proposed mechanism ? Please elaborate more on it - Tejas Patil 2012-04-05 17:41

you can overwrite this function in your reducer class. In your reduce function, you should just store records in container instead of write them out. In the cleanup function, you can do the communication and decide how many record the current reducer needs to write out. Look for cleanup function int eh following page: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.htm - root1982 2012-04-08 02:41

Hadoop has built in support for global counters. You can define your own counters, and increment/read them from your mapper or reducer code.

2012-04-05 09:14
by jdhok

thanks. I will try out using counters. I think that maintaining counters (in a way shared variable) will hit the performance for larger inputs but there isn't any choice for me as of now - Tejas Patil 2012-04-05 17:37