I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest
package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified
error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest
to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize
. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n)
is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.
keep.forest=F
results in a 14MB result, while proximity=FALSE
made no difference in or out: the result was 232 MB - Wayne 2014-11-12 22:25
keep.forest = FALSE
will certainly drastically reduce the size of the resulting object - joran 2014-11-12 22:35
randomForest
: it was using randomForest
via caret
. At some point, it wanted to allocate 21 GB -- assuming that the OP was running randomForest
directly, not an issue - Wayne 2014-11-13 19:17
I'd recommend the bigrf
package in R, since it's designed for the type of issue you've encountered (i.e., lack of enough RAM). Unfortunately, at this time, bigrf
has been removed from CRAN, but it's still available in the archives (see answer: Can't install bigrf package).
Another approach might involve combining RFs based on different training data, but the results might be considered nonsensical (see answer: Combining random forests built with different training sets in R for details). The modification mentioned in the latter post did work for me, but the combined RFs I ran were sometimes better, and sometimes worse relative to using just a single RF (YMMV).
bigrf
package is supposed to handle regressions. http://finzi.psych.upenn.edu/library/bigrf/html/bigrf-package.htm - neanderslob 2016-11-04 23:38
proximity = FALSE
as joran suggested and tell us if it works - smci 2012-10-29 07:03