Here is the use case:
input urls are read by maps and later emitted post some filtering. Then partitioners partition them based on their hostname.
I have a global limit over the output urls after running the map-reduce job. I distribute this evenly across all reducers. ie. if global limit is 1000 and number of reducers is 5, then every reducer will at most emit (1000/5 = 200) urls as output
The problem is that if there are urls from only 2 hosts (due to user input) and there are 100000 urls of each of these 2 hosts, the 2 reducers processing these urls (same host, same partition) will limit only 200 urls each to output. Rest reducers dont get any data for processing due to partitioning and emit 0 records.
So even though I had 100000 urls/host and global limit of 1000, output has 400 urls only (200 urls/host).
If you don't have to partition by hostname, you can solve your problem by a random partitioner.
If you have to partition by hostname, I don't think there is any easy answers. Each reducer doesn't know how much of records are coming. Each reducer has to accumulate 100000 records or as much as it receives. You need to override the cleanup function in your reducer. Reducers need to talk to each other(through counter, maybe) in the "cleanup" function and decide how many of records are needed and only write out the records in the cleanup function.
what do you think?
Hadoop has built in support for global counters. You can define your own counters, and increment/read them from your mapper or reducer code.