Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support dictionaries directly in case_when kernel #29220

Closed
asfimport opened this issue Aug 5, 2021 · 2 comments
Closed

[C++] Support dictionaries directly in case_when kernel #29220

asfimport opened this issue Aug 5, 2021 · 2 comments

Comments

@asfimport
Copy link

asfimport commented Aug 5, 2021

case_when (and other similar kernels) currently dictionary-decode inputs, then operate on the decoded values. This is both inefficient and unexpected. We should instead operate directly on dictionary indices.

Of course, this introduces more edge cases. If the dictionaries of inputs do not match, we have the following choices:

  1. Raise an error.

  2. Unify the dictionaries.

  3. Use one of the dictionaries, and raise an error if an index of another dictionary cannot be mapped to an index of the chosen dictionary.

  4. Use one of the dictionaries, and emit null if an index of another dictionary cannot be mapped to an index of the chosen dictionary. (This is what base dplyr if_else does with factors.)

    All of these options are reasonable, so we should introduce an options struct. We can implement ARROW-10: Fix mismatch of javadoc names and method parameters #3 and ARROW-15: Fix a naming typo for memory.AllocationManager.AllocationOutcome #4 at first (to cover R); ARROW-9: Replace straggler references to Drill #2 isn't strictly necessary, as the user can unify the dictionaries manually first, but it may be more efficient to do it this way. Similarly, ARROW-5: Update drill-fmpp-maven-plugin to 1.5.0 #1 isn't strictly necessary.

#3 and #4 are justifiable (beyond just "it's what R does") since users may filter down disjoint dictionaries into a set of common values and then expect to combine the remaining values with a kernel like case_when.

As described on GitHub.

Reporter: David Li / @lidavidm
Assignee: David Li / @lidavidm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13573. Please see the migration documentation for further details.

@asfimport
Copy link
Author

David Li / @lidavidm:
Also, we can use the 'fast' approach for dictionaries (being able to write into slices, using preallocated outputs as implemented for numeric inputs, as opposed to the builder-based approach in ARROW-13222 for variable-width types) though we'll want to support nested dictionaries too (lists of dictionaries and such).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11022
#11022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants