WordCount with custom word delimiters in Pig?

Go To StackoverFlow.com


I'm new to Pig, and I'm trying to write a word count program.

One way of getting words from text is to use the TOKENIZE function:

WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;

But I only want to split on whitespace, whereas TOKENIZE splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' '), but STRSPLIT seems to return a tuple whereas TOKENIZE returns a bag, so I'm not sure how to use STRSPLIT for this.

2012-04-04 07:42
by grautur


We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :

  1. Load your data
  2. Use STRSPLIT to split your value into a tuple
  3. Convert your tuples into a bag with an UDF
  4. Flatten you bag
2012-04-04 14:15
by Kevin


It depends on what your input data looks like, but the following could work for you:

  1. Use MyRegExLoader (in PiggyBank) with a regex to load your data.
  2. Use STREAM with Perl, sed, or your favorite scripting language to munge your input data into a format that TOKENIZE will then handle the way you want.

Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).

2012-04-05 07:33
by msponer