Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-3097 Update test cases for BSON Binary Vector #1750

Merged
merged 6 commits into from
Feb 7, 2025

Conversation

qingyang-hu
Copy link
Contributor

@qingyang-hu qingyang-hu commented Feb 3, 2025

DRIVERS-3097

  • Remove duplicate cases in "packed_bit.json"
  • Add "canonical_bson" for appropriate invalid cases
  • Add a case in "float32.json" containing corrupted "canonical_bson" with insufficient bytes for 4-byte float32

Please complete the following before merging:

  • Update changelog.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded
    clusters, and serverless).

Sorry, something went wrong.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@qingyang-hu qingyang-hu requested a review from a team as a code owner February 3, 2025 23:10
@qingyang-hu qingyang-hu requested review from dariakp and caseyclements and removed request for a team February 3, 2025 23:10
"canonical_bson": "1C00000005766563746F72000A0000000927030000FE420000E04000"
},
{
"description": "Insufficient vector data FLOAT32",
Copy link
Member

@vbabanin vbabanin Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test introduces a use case where drivers should validate for the absence of the array, as when a null reference is passed.

If that's the case, we should extend this test to cover INT8 and PackedBit, as well as other relevant fields which could be absent.I also think, explicitly defining "vector": null or "dtype": null could make the test case more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "canonical_bson" in "Insufficient vector data FLOAT32" is a corrupted BSON with insufficient bytes for 4-byte float32, so it lacks a corresponding float32 vector. It makes sense to use "vector": null to make the case more explicit and consistent with other cases.

Copy link
Member

@vbabanin vbabanin Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! At first glance, I thought the test was validating the absence of the entire vector array.

There is a section outlining what drivers should do in both valid and invalid cases. In valid cases, canonical_bson must be present since drivers are expected to encode and decode it to verify correctness.

Now that we have canonical_bson in invalid cases, it's unclear whether we should rely only on numerical values or also consider canonical_bson. The spec states:

To prove correctness in an invalid case (valid: false), one MUST:
Raise an exception when attempting to encode a document from the numeric values, dtype, and padding.

Should we add another line for clarity? For example:

When the vector field is absent, drivers MUST decode canonical_bson, as this contains corrupted vector data that can't be represented via the numerical vector field.

Copy link
Contributor

@nbbeeken nbbeeken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test improvements! Is there a drivers ticket for this change?

@@ -29,7 +29,7 @@ Each JSON file contains three top-level keys.

- `description`: string describing the test.
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input.
- `vector`: list of numbers
- `vector`: (required if valid is true) list of numbers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the vector field is absent in an invalid case, is canonical_bson now a required field?

@qingyang-hu qingyang-hu changed the title Update test cases for BSON Binary Vector DRIVERS-3097 Update test cases for BSON Binary Vector Feb 4, 2025
@qingyang-hu
Copy link
Contributor Author

@@ -50,7 +50,8 @@ MUST assert that the input float array is the same after encoding and decoding.

#### To prove correct in an invalid case (`valid:false`), one MUST

- raise an exception when attempting to encode a document from the numeric values, dtype, and padding.
- when the vector field exists, raise an exception for encoding a document from the numeric values, dtype, and padding.
- when the canonical_bson field exists, raise an exception for decoding it, as the field contains corrupted data that can't be decoded into a vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a breaking change for drivers to fail to read Vectors that have already been inserted without this requirement.

How about drivers "fix" their serializers to stop making float32 vectors that have been created from the canonical_bson bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to just "read" the Vector (as a BSON Binary). However, the specs need to clarify how to "decode/unserialize" to a sequence of float32 when the bytes for the vector elements are corrupted to store the sequence as the description as well as expose the sequence to the driver users.

To prevent incompatibility issues, the exception doesn't have to be raised when reading the Binary of Vector from the server. It can be raised during the actual decoding/conversion (e.g. in the getter method of the float32 sequence).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be raised during the actual decoding/conversion (e.g. in the getter method of the float32 sequence).

Perfect, and yeah that's how it works in node, our toFloat32Array() helper throws and serialize/stringify throw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clarify the specs with an update.


## Changelog

- 2025-02-04: Update validation for decoding into a FLOAT32 vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the date of the originally accepted spec:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link for drivers doesn't render in the code. Maybe just leave it as [DRIVERS-2926]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also please add the JIRA ticket and PR to the latest changelog entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we have the convention of appending the ticket numbers? IMO, they are trackable from the git change log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git change log contains all commits, not just those related to bson binary vectors. If we add just the PR, our convention is to include the jira ticket though.

@qingyang-hu qingyang-hu requested a review from nbbeeken February 5, 2025 21:20
Copy link
Contributor

@nbbeeken nbbeeken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


## Changelog

- 2025-02-04: Update validation for decoding into a FLOAT32 vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link for drivers doesn't render in the code. Maybe just leave it as [DRIVERS-2926]?


## Changelog

- 2025-02-04: Update validation for decoding into a FLOAT32 vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also please add the JIRA ticket and PR to the latest changelog entry?

"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 1
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong with this test? An empty array but a non-zero padding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a duplicate of L5-L12

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice spot! Thank you!

Copy link
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants