Random forest on a big dataset

Go To StackoverFlow.com

5

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.

Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.

Any suggestions or workaround ideas are much appreciated.

2012-04-05 23:05
by ktdrv
Run with proximity = FALSE as joran suggested and tell us if it works - smci 2012-10-29 07:03
One relatively simple way around your problem would be to subset your input matrix. All that data probably won't give you a better model than one with a subset of size 10K x 10K - Tim Biegeleisen 2015-01-15 10:31
Did you have a look at library(h2o) ? That runs OK for very large problems, see http://www.r-bloggers.com/benchmarking-random-forest-implementations - Tom Wenseleers 2015-08-20 18:50


11

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.

But it's hard to help more, given that you've provided no details about the actual code you're using.

2012-04-06 03:44
by joran
I kind of arrived at the same conclusion but don't seem to understand why it's needed and if there is some way of training the RF without the need for it - ktdrv 2012-04-06 04:10
I'm not sure what you mean. Setting proximity = FALSE will prevent he proximities from being calculated - joran 2012-04-06 04:15
I just did a test and it's actually the forest itself that's huge. In my particular test case, keep.forest=F results in a 14MB result, while proximity=FALSE made no difference in or out: the result was 232 MB - Wayne 2014-11-12 22:25
@Wayne The size of the forest object itself is a separate issue (and not what the OP asked about). The question asked about a specific error that was the result of the inability to allocate enough memory for a single matrix, and the only possible source of that specific error was the proximity matrix. But yes, setting keep.forest = FALSE will certainly drastically reduce the size of the resulting object - joran 2014-11-12 22:35
Now I remember when I had a problem similar to the OP's with randomForest: it was using randomForest via caret. At some point, it wanted to allocate 21 GB -- assuming that the OP was running randomForest directly, not an issue - Wayne 2014-11-13 19:17


1

I'd recommend the bigrf package in R, since it's designed for the type of issue you've encountered (i.e., lack of enough RAM). Unfortunately, at this time, bigrf has been removed from CRAN, but it's still available in the archives (see answer: Can't install bigrf package).

Another approach might involve combining RFs based on different training data, but the results might be considered nonsensical (see answer: Combining random forests built with different training sets in R for details). The modification mentioned in the latter post did work for me, but the combined RFs I ran were sometimes better, and sometimes worse relative to using just a single RF (YMMV).

2014-12-15 16:49
by Prophet60091
I might be misunderstanding, but I believe the bigrfpackage is supposed to handle regressions. http://finzi.psych.upenn.edu/library/bigrf/html/bigrf-package.htm - neanderslob 2016-11-04 23:38
you're right: I misread their package abstract. Editing my original answer. Thx - Prophet60091 2016-12-04 16:34
Ads