Random Forests data bucketing question

AndrewSch · 6/13/14

Hi all, I have a machine learning algorithm question, specifically about the Random Forests (RF) algorithm. This may not be the right forum to ask it, but I've already tried the kaggle ML forum, and haven't gotten an answer (yet). If you can point a more appropriate forum for me to ask it, that'd be much appreciated, too.

So, my (perhaps limited) understanding of RF is that the original data set, S, is randomly split into a training subset, St(k) and a classification subset Sc(k), differently for each k-th tree (k=1,2,...,M=#trees).

St(k) is about 2/3 of S; and Sc(k), used to compute the so-called out-of-bag (OOB) error, is the rest, about 1/3 of S. Bottom line, though, St(k) + Sc(k) = S. Hence, the entire set S is used for each tree in the forest, just differently split.

My question is the following: instead of passing the entire set S (just differently split) to each tree, can I pre-partition S into smaller buckets, S(k), k=1,2,...,M and "give" each tree a different bucket, S(k), which will be further split into (St(k), Sc(k)) but this time St(k) + Sc(k) = S(k) instead of whole S, where |S(k)| ~= |S|/M << |S| ? (where |.| denotes cardinality of the set)?

Would the underlying RF theory still hold, from a stochastic standpoint? Thank you in advance.

Polter · 6/14/14

I think you're unlikely to get many answers here (at least haven't encountered that many ML folks around); I'd try asking in one of the following communities:
- Cross Validated: http://stats.stackexchange.com/
- Data Science Stack Exchange: http://datascience.stackexchange.com/
- MetaOptimize Q+A: http://metaoptimize.com/qa/
- /r/MachineLearning: http://www.reddit.com/r/MachineLearning

See also:
- /r/MachineLearning FAQ: http://www.reddit.com/r/MachineLearning/wiki/
- links in the "Forum, Q&A" section: http://www.machinelearningsalon.org/resources.html

AndrewSch · 6/14/14

Thank you. I've actually gotten an answer on kaggle (yes, it is possible). Good suggestions, though. Thank you.

Random Forests data bucketing question

AndrewSch

Polter

AndrewSch

Similar threads