I'm new to Pig, and I'm trying to write a word count program.
One way of getting words from text is to use the TOKENIZE
function:
WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;
But I only want to split on whitespace, whereas TOKENIZE
splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' ')
, but STRSPLIT
seems to return a tuple whereas TOKENIZE
returns a bag, so I'm not sure how to use STRSPLIT
for this.
We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :
It depends on what your input data looks like, but the following could work for you:
Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).