R's text mining package... adding a new function to getTransformation - 【StackMirror】|r|text-mining|stemming|corpus

I am attempting to add a new stemmer that works using a table look up method. if h is the hash the contains the stemming operation, it is encoded as follows: keys as words before stemming and values as words post-stemming.

I would like to ideally add a custom hash that allows me to do the following

myCorpus = tm_map(myCorpus, replaceWords, h)

the replaceWords function is applied to each document in myCorpus and uses the hash to stem the contents of the document

Here is the sample code from my replaceWords function

$hash_replace <- function(x,h) {
if (length(h[[x]])>0) {
    return(h[[x]])
} else {
    return(x)
}
}

replaceWords <- function(x,h) {
y = tolower(unlist(strsplit(x," ")))
y=y[which(as.logical(nchar(y)))]
z = unlist(lapply(y,hash_replace,h))
return(paste(unlist(z),collapse=' '))
}

Although this works, the transformed corpus is no longer contains content of type "TextDocument" or "PlainTextDocument" but of type "character"

I tried using

return(as.PlainTextDocument(paste(unlist(z),collapse=' ')))

but that that gives me an error while trying to run.

In the previous versions of the R's tm package, I did see a replaceWords function that allowed for synonym and WORDNET based subtitution. But I no longer see it in the current version of tm package (especially when I call the function getTransformations())

Does anybody out there have ideas on how I can make this happen?

Any help is greatly appreciated.

Cheers, Shivani

Thanks, Shivani Rao

2012-04-05 16:25
by Shivani Rao

You just need to use the PlainTextDocument function instead of as.PlainTextDocument. R will automatically return the last statement in your function, so it works if you just make the last line

PlainTextDocument(paste(unlist(z),collapse=' '))

2012-04-06 17:48
by Fojtasek