-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-3097 Update test cases for BSON Binary Vector #1750
Conversation
"canonical_bson": "1C00000005766563746F72000A0000000927030000FE420000E04000" | ||
}, | ||
{ | ||
"description": "Insufficient vector data FLOAT32", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this test introduces a use case where drivers should validate for the absence of the array, as when a null
reference is passed.
If that's the case, we should extend this test to cover INT8 and PackedBit, as well as other relevant fields which could be absent.I also think, explicitly defining "vector": null
or "dtype": null
could make the test case more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "canonical_bson" in "Insufficient vector data FLOAT32" is a corrupted BSON with insufficient bytes for 4-byte float32, so it lacks a corresponding float32 vector. It makes sense to use "vector": null
to make the case more explicit and consistent with other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification! At first glance, I thought the test was validating the absence of the entire vector array.
There is a section outlining what drivers should do in both valid and invalid cases. In valid cases, canonical_bson
must be present since drivers are expected to encode and decode it to verify correctness.
Now that we have canonical_bson
in invalid cases, it's unclear whether we should rely only on numerical values or also consider canonical_bson
. The spec states:
To prove correctness in an invalid case (valid: false), one MUST:
Raise an exception when attempting to encode a document from the numeric values, dtype, and padding.
Should we add another line for clarity? For example:
When the vector field is absent, drivers MUST decode canonical_bson, as this contains corrupted vector data that can't be represented via the numerical vector field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test improvements! Is there a drivers ticket for this change?
@@ -29,7 +29,7 @@ Each JSON file contains three top-level keys. | |||
|
|||
- `description`: string describing the test. | |||
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input. | |||
- `vector`: list of numbers | |||
- `vector`: (required if valid is true) list of numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a section on validation in the specs that we should update too:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the vector
field is absent in an invalid case, is canonical_bson
now a required field?
@@ -50,7 +50,8 @@ MUST assert that the input float array is the same after encoding and decoding. | |||
|
|||
#### To prove correct in an invalid case (`valid:false`), one MUST | |||
|
|||
- raise an exception when attempting to encode a document from the numeric values, dtype, and padding. | |||
- when the vector field exists, raise an exception for encoding a document from the numeric values, dtype, and padding. | |||
- when the canonical_bson field exists, raise an exception for decoding it, as the field contains corrupted data that can't be decoded into a vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a breaking change for drivers to fail to read Vectors that have already been inserted without this requirement.
How about drivers "fix" their serializers to stop making float32 vectors that have been created from the canonical_bson
bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine to just "read" the Vector (as a BSON Binary). However, the specs need to clarify how to "decode/unserialize" to a sequence of float32 when the bytes for the vector elements are corrupted to store the sequence as the description as well as expose the sequence to the driver users.
To prevent incompatibility issues, the exception doesn't have to be raised when reading the Binary of Vector from the server. It can be raised during the actual decoding/conversion (e.g. in the getter method of the float32 sequence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be raised during the actual decoding/conversion (e.g. in the getter method of the float32 sequence).
Perfect, and yeah that's how it works in node, our toFloat32Array()
helper throws and serialize
/stringify
throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clarify the specs with an update.
|
||
## Changelog | ||
|
||
- 2025-02-04: Update validation for decoding into a FLOAT32 vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the date of the originally accepted spec:
- 2024-11-01: BSON Binary Subtype 9 accepted DRIVERS-2926 (#1708)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link for drivers doesn't render in the code. Maybe just leave it as [DRIVERS-2926]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also please add the JIRA ticket and PR to the latest changelog entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do we have the convention of appending the ticket numbers? IMO, they are trackable from the git change log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The git change log contains all commits, not just those related to bson binary vectors. If we add just the PR, our convention is to include the jira ticket though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
## Changelog | ||
|
||
- 2025-02-04: Update validation for decoding into a FLOAT32 vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link for drivers doesn't render in the code. Maybe just leave it as [DRIVERS-2926]
?
|
||
## Changelog | ||
|
||
- 2025-02-04: Update validation for decoding into a FLOAT32 vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also please add the JIRA ticket and PR to the latest changelog entry?
"dtype_hex": "0x10", | ||
"dtype_alias": "PACKED_BIT", | ||
"padding": 1 | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was wrong with this test? An empty array but a non-zero padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a duplicate of L5-L12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice spot! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
DRIVERS-3097
Please complete the following before merging:
clusters, and serverless).