I'm new to Pig, and I'm trying to write a word count program.
One way of getting words from text is to use the TOKENIZE function:
WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;
But I only want to split on whitespace, whereas TOKENIZE splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' '), but STRSPLIT seems to return a tuple whereas TOKENIZE returns a bag, so I'm not sure how to use STRSPLIT for this.
We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :
It depends on what your input data looks like, but the following could work for you:
Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).