I am attempting to add a new stemmer that works using a table look up method. if h is the hash the contains the stemming operation, it is encoded as follows: keys as words before stemming and values as words post-stemming.
I would like to ideally add a custom hash that allows me to do the following
myCorpus = tm_map(myCorpus, replaceWords, h)
the replaceWords function is applied to each document in myCorpus and uses the hash to stem the contents of the document
Here is the sample code from my replaceWords function
$hash_replace <- function(x,h) {
if (length(h[[x]])>0) {
return(h[[x]])
} else {
return(x)
}
}
replaceWords <- function(x,h) {
y = tolower(unlist(strsplit(x," ")))
y=y[which(as.logical(nchar(y)))]
z = unlist(lapply(y,hash_replace,h))
return(paste(unlist(z),collapse=' '))
}
Although this works, the transformed corpus is no longer contains content of type "TextDocument" or "PlainTextDocument" but of type "character"
I tried using
return(as.PlainTextDocument(paste(unlist(z),collapse=' ')))
but that that gives me an error while trying to run.
In the previous versions of the R's tm package, I did see a replaceWords function that allowed for synonym and WORDNET based subtitution. But I no longer see it in the current version of tm package (especially when I call the function getTransformations())
Does anybody out there have ideas on how I can make this happen?
Any help is greatly appreciated.
Cheers, Shivani
Thanks, Shivani Rao
You just need to use the PlainTextDocument
function instead of as.PlainTextDocument
. R will automatically return the last statement in your function, so it works if you just make the last line
PlainTextDocument(paste(unlist(z),collapse=' '))