Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree: better algorithm for find_split #964

Closed
glouppe opened this issue Jul 18, 2012 · 3 comments
Closed

Tree: better algorithm for find_split #964

glouppe opened this issue Jul 18, 2012 · 3 comments
Milestone

Comments

@glouppe
Copy link
Contributor

glouppe commented Jul 18, 2012

This is a follow-up of #946. We came to the conclusion that the algorithm behind find_split should be redesigned in order to decrease one step further the training time in decision trees.

Below is a summary of the strategies we discussed.


N = total number of training samples
N_i = number of training samples at node i
d = n_features.

Let's assume for now that max_features = d.

  1. Building X_argsort is O(N log(N) * d)

  2. On master, at each node, find_split is O(N * d). To build a fully developed tree, find_split has to be called O(N) times, which results in a cumulative complexity for find_split of O(N^2 * d).

In total, the complexity of building a single tree is then O(N log(N) * d + N^2 * d)

If T trees are built, the complexity is O(N log(N) * d + T * N^2 * d) since X_argsort is shared for all trees.

  1. [Strategy A] Assume that we remove sample_mask and that, at each node, we rather reorder X_argsorted in a divide and conquer fashion.

In that case, at each node, find_split is O(N_i *d). To build a fully developped tree, find_split has to be called O(N) times, which results, if we further assume that the tree is balanced, in a cumulative complexity for find_split of O(N log(N) * d).

In total, the complexity of building a single tree is then O(N log(N) * d + * N log(N) * d).

If T trees are built, the complexity is O(N log(N) * d + T * N log(N) * d) but requires extra-memory of O(N*d * n_jobs).

  1. [Strategy B] Assume that we remove X_argsort and sample_mask and that, each node, we sort the node samples
    along the considered features.

In that case, at each node, find_split is O(N_i log(N_i) * d). I don't exactly know the cumulative complexity in that case but my intuition is that it sould be should something like O(N log(N)^2 * d). Anyway, far less that O(N^2 * d).

In total, the complexity of building a single tree should be around O(N log(N)^2 * d).

If T trees are built, then complexity should be O(T * N log(N)^2 * d).


Overall, I think Strategy A is the best of all but the extra-memory required is a significant disadvatange.

Regarding Strategy B, theory says that asymptotically it is better than master, even for building ensemble of trees. However I agree that we should account for the constant factors behind this analysis. I remain convinced however that we should at least try and see! (I can work on that.)

@amueller
Copy link
Member

amueller commented Nov 3, 2012

What I don' understand yet is how to do the sorting of X_argsort in strategy A.
This can not be done in-place, right? so for each feature, you have to copy this part of X_argsort and write it into the matrix again. For each depth you have to rebuild X_argsort once, which has cost N * d, right?
We not so bad...

@glouppe
Copy link
Contributor Author

glouppe commented Nov 3, 2012

@amueller Yes, sorting is in-place. It assumes that for each tree, we work on a dedicated copy of X_argsort.

@glouppe
Copy link
Contributor Author

glouppe commented Dec 6, 2012

Discussion follows up in #1435. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants