Tree: better algorithm for find_split #964

glouppe · 2012-07-18T10:15:57Z

This is a follow-up of #946. We came to the conclusion that the algorithm behind find_split should be redesigned in order to decrease one step further the training time in decision trees.

Below is a summary of the strategies we discussed.

N = total number of training samples
N_i = number of training samples at node i
d = n_features.

Let's assume for now that max_features = d.

Building X_argsort is O(N log(N) * d)
On master, at each node, find_split is O(N * d). To build a fully developed tree, find_split has to be called O(N) times, which results in a cumulative complexity for find_split of O(N^2 * d).

In total, the complexity of building a single tree is then O(N log(N) * d + N^2 * d)

If T trees are built, the complexity is O(N log(N) * d + T * N^2 * d) since X_argsort is shared for all trees.

[Strategy A] Assume that we remove sample_mask and that, at each node, we rather reorder X_argsorted in a divide and conquer fashion.

In that case, at each node, find_split is O(N_i *d). To build a fully developped tree, find_split has to be called O(N) times, which results, if we further assume that the tree is balanced, in a cumulative complexity for find_split of O(N log(N) * d).

In total, the complexity of building a single tree is then O(N log(N) * d + * N log(N) * d).

If T trees are built, the complexity is O(N log(N) * d + T * N log(N) * d) but requires extra-memory of O(N*d * n_jobs).

[Strategy B] Assume that we remove X_argsort and sample_mask and that, each node, we sort the node samples
along the considered features.

In that case, at each node, find_split is O(N_i log(N_i) * d). I don't exactly know the cumulative complexity in that case but my intuition is that it sould be should something like O(N log(N)^2 * d). Anyway, far less that O(N^2 * d).

In total, the complexity of building a single tree should be around O(N log(N)^2 * d).

If T trees are built, then complexity should be O(T * N log(N)^2 * d).

Overall, I think Strategy A is the best of all but the extra-memory required is a significant disadvatange.

Regarding Strategy B, theory says that asymptotically it is better than master, even for building ensemble of trees. However I agree that we should account for the constant factors behind this analysis. I remain convinced however that we should at least try and see! (I can work on that.)

The text was updated successfully, but these errors were encountered:

amueller · 2012-11-03T10:52:17Z

What I don' understand yet is how to do the sorting of X_argsort in strategy A.
This can not be done in-place, right? so for each feature, you have to copy this part of X_argsort and write it into the matrix again. For each depth you have to rebuild X_argsort once, which has cost N * d, right?
We not so bad...

glouppe · 2012-11-03T11:16:55Z

@amueller Yes, sorting is in-place. It assumes that for each tree, we work on a dedicated copy of X_argsort.

glouppe · 2012-12-06T11:13:19Z

Discussion follows up in #1435. Closing this one.

glouppe mentioned this issue Jul 24, 2012

Enhancements to the tree module #382

Closed

bdholt1 mentioned this issue Jul 27, 2012

[WIP] Lazy argsort for trees #986

Closed

amueller mentioned this issue Dec 5, 2012

Random Forest Performance #1435

Closed

glouppe closed this as completed Dec 6, 2012

amueller mentioned this issue Jan 7, 2013

speedup for decision trees with large datasets #1532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree: better algorithm for find_split #964

Tree: better algorithm for find_split #964

glouppe commented Jul 18, 2012

amueller commented Nov 3, 2012

glouppe commented Nov 3, 2012

glouppe commented Dec 6, 2012

Tree: better algorithm for find_split #964

Tree: better algorithm for find_split #964

Comments

glouppe commented Jul 18, 2012

amueller commented Nov 3, 2012

glouppe commented Nov 3, 2012

glouppe commented Dec 6, 2012