-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: add multi-arg transform support #8579
Conversation
My vision for the complete support of multi-arg transform would be:
I would commit to finish the API, Core, Spark(the poc PR: #8259 include these supports, but needs some refinements) and Flink support, Python if necessary. |
Gently ping @rdblue @RussellSpitzer @aokolnychyi @szehon-ho |
Hi, @advancedxy , thanks for the work. Sorry for the delay, I am just returning from paternity leave. Will love to see this get in to get work on zorder and geo-transforms. I left some comments |
WOW, big congrats on the arrival of your newborn. I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert. It will depends on multi-arg bucket transform at some point. |
Thank you so much!
Sure, understood. Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors |
Of course, thanks for offering. As listed in #8579 (comment), there are multiple parts about this feature. For this pr of spec change, I can address your comments in this week and hopefully you could help get more eyes from others on this spec change. I think we can be co-authors about this spec change and the whole feature. For other parts, It might take me a little more time to refactor/refine/decouple/impl. But it would be great if we can work together to move thing forward. |
4868bcf
to
097f1ab
Compare
format/spec.md
Outdated
@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | |||
| Transform name | Description | Source types | Result type | | |||
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | |||
| **`identity`** | Source value, unmodified | Any | Source type | | |||
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | |||
| **`bucket[N]`** | Hash of value(s), mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, weather we should simply add a new bucketV2
partition transform to distinguish the single-arg one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref: I think it was the decision as per this discussion? https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?disco=AAAA3dkHA5A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to keep just bucket
as long as we would return UnknownTransform
in old readers/writers. If keeping it as bucket
leads to exceptions, then we have to consider another name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for the very quick update, I left some more comments
Co-authored-by: Szehon Ho <szehon.apache@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some more comments. I think its getting close
I also pinged @aokolnychyi about it, he said he will take a look this week.
format/spec.md
Outdated
@@ -314,7 +314,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | |||
| Transform name | Description | Source types | Result type | | |||
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | |||
| **`identity`** | Source value, unmodified | Any | Source type | | |||
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | |||
| **`bucket[N]`** | Hash of value(s), mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with adding bucketv2 here. Let see what @aokolnychyi @rdblue think
format/spec.md
Outdated
|
||
| Field | JSON representation | Example | | ||
|---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| **`Sort Field(multi-arg transform)`** | `JSON object: {`<br /> `"transform": <transform JSON>,`<br /> `"source-id": -1,`<br /> `"source-ids": <list of ids>,`<br /> `"direction": <direction string>,`<br /> `"null-order": <null-order string>`<br />`}` | `{`<br /> ` "transform": "bucketV2[4]",`<br /> ` "source-id": -1,`<br /> ` "source-id": [1,2],`<br /> ` "direction": "desc",`<br /> ` "null-order": "nulls-last"`<br />`}` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add Notes section and also add the note from partition field here (about when to emit and omit source-id
and source-ids
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. I added the notes, and after reviewing this part, I think the table of Sort Fields
could be more consistent with Partition Fields
and therefore changes that part. WDYT?
format/spec.md
Outdated
|Field|JSON representation|Example| | ||
|--- |--- |--- | | ||
|**`Sort Field`**|`JSON object: {`<br /> `"transform": <transform JSON>,`<br /> `"source-id": <source id int>,`<br /> `"direction": <direction string>,`<br /> `"null-order": <null-order string>`<br />`}`|`{`<br /> ` "transform": "bucket[4]",`<br /> ` "source-id": 3,`<br /> ` "direction": "desc",`<br /> ` "null-order": "nulls-last"`<br />`}`| | ||
| Field | JSON representation | Example | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
format/spec.md
Outdated
|
||
Notes: | ||
1. For sort fields with a transform with a single argument, the id of the source field is set on `source-id`, and `source-ids` is omitted. | ||
2. For sort fields with a transform of multiple arguments, the ids of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is writing an explicit source-id
required to avoid exceptions compared to not writing that at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I believe an explicit source-id
is required to avoid exception in old versions.
See
int sourceId = JsonUtil.getInt(SOURCE_ID, element); |
and
int sourceId = JsonUtil.getInt(SOURCE_ID, element); |
This seems in a pretty good shape. I guess the open question is about cc @RussellSpitzer @danielcweeks @nastra @Fokko @rdblue @jackye1995 @amogh-jahagirdar |
I'll also take a look at this tomorrow morning as well, thanks @advancedxy ! |
@rdblue recently pointed me to the Bloom filter spec in Parquet. I think it contains a few interesting ideas that may be applicable to us. First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster. Second, Parquet avoids the modulo operator for performance reasons. Given all this information, I suggest we make this PR about multi-arg transforms in general (like how they are stored, how they are serialized, what happens during schema evolution, compatibility etc) and submit another one with @advancedxy @szehon-ho, how does this sound? |
Both sounds great improvement to me. Apart from faster hash, I'd like to add another possible option to explore: user defined hash function for bucket transform while we are working
I'm ok to merge multi-arg transform first. However I'm not sure how to provide examples for single-arg transform v.s. multi-arg transform as there will be no |**`Partition Field`** [2]|`JSON object: {`<br /> `"source-id": <id int>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}`|`{`<br /> `"source-id": 1,`<br /> `"field-id": 1000,`<br /> `"name": "id_bucket",`<br /> `"transform": "bucket[16]"`<br />`}`|
|**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br /> `"source-id": -1,`<br /> `"source-ids": <list of ids>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}`|`{`<br /> `"source-id": -1,`<br /> `"source-ids": [1,2],`<br /> `"field-id": 1000,`<br /> `"name": "id_type_bucket",`<br /> `"transform": "bucketV2[16]"`<br />`}`| @szehon-ho @aokolnychyi do you have any suggestions? |
Hi @advancedxy , I'm ok to leave that for the next pr. How about we just keep the notes for PartitionField and SortOrder like?
And just omit the example until we the bucketv2 pr? |
@szehon-ho @aokolnychyi the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Merged, thanks @advancedxy ! Feel free to work on bucketv2 spec, and we can make any other follow ups as well |
This change seems reasonable to me. @advancedxy, could you also post to the dev list that this was merged to get any input from folks who did not review before we release 1.5? I feel that would be important as it is a spec change. |
of course, nice suggestion. |
Notes: | ||
|
||
1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. | ||
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick. It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1, which as far as I can tell is still potentially a valid field according to the spec (I might have missed it but field IDs simply seem to be defined as integers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer using the first column from the source ID list instead of a fake ID. That way older readers at least see that the transform is associated with one of the correct columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using a valid source ID here would lead to incorrect results for old clients if a predicate is specified on the column. IIUC invalid ID here makes sure reads should always be correct or fail which seems like better semantics if the aim is forwards compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it would be better to choose a field ID from the existing range for reserved field IDs (e.g. MAX_INT-200) then to use -1,
Per my understanding, multi-arg transforms will mostly get a new transform name rather than the existing ones. Older readers will treat this multi-arg transform as an UnknownTransform
, the persisted source-id
is just to make old code happy, see this reply as well: #8579 (comment). So the value
of source-id
is just a place holder and doesn't make too much sense. It could be a field ID from the reserved range or a negative one since the current reference implementation wouldn't produce a negative field id.
I simply choose -1
as it seems more nature and doesn't need to put a somehow weird reserved field in the MetadataColumns.java
, but I think we make always make follow-up pr if there's valid concerns/solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a different transform (I wasn't clear on the final status there) I think it makes it less important so at this point it is bike shedding but I think having a clear signal that this field is meaningless might be useful. I think for V3 it might be worthwhile to consider dropping the backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I also think older readers will not be able to make any use of the new mulit-arg transforms. So they would only be able to read new tables (though without any partition pushdown), and would fail to write. So I agree , it is moot what to even put for source-id here, though I think choosing a reserved one is a good idea. Is it just so the java reference implementation can properly de-serialize as Unknown and have a better exception message?
Is the idea in v1/v2 to write source-id column as -1/reserved, and in v3, we will write source-ids for everything and drop source-id column?
I guess this is a more general discussion and can wait the new spec pr clarifying v1/2 vs v3 behaviors.
As discussed in #8258 and its google doc: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit?usp=sharing.
This PR adds multi-arg transform into the spec.
Co-authored-by: Szehon Ho szehon.apache@gmail.com