Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new text embedding for sparse vector #466

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cutecutecat
Copy link
Member

@cutecutecat cutecutecat commented Apr 15, 2024

Part of #459

  • Changed svector text representation from "[0, 1, 0, 3, 4]" to pgvector-style "{2:1, 4:3, 5:4}/5"
  • Upgraded toolchain for new feature proc_macro_byte_character from upstream

Reminder

The index is from 1 instead of 0 at pgvector

Old New
[0, 1, 2, 0, 0] {2:1, 3:2}/5
[0, 1, 2, 3, 4, 5, 6, 7] {2:1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7}/8
[5, 6, 7] {1:5, 2:6, 3:7}/3
[0, 0, 0, 0] {}/4

@cutecutecat
Copy link
Member Author

cutecutecat commented Apr 15, 2024

The failed CI is due to an upstream uncompatiablity:

  • proc-macro2 v1.0.80 introduced proc_macro_byte_character feature, which is recently stablized but not released until 2024-04-12

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 2089054 to 0e0c58b Compare April 16, 2024 13:52
@cutecutecat cutecutecat marked this pull request as ready for review April 17, 2024 01:38
@usamoi
Copy link
Collaborator

usamoi commented Apr 18, 2024

The reason why CI fails is that sqllogictest-bin released on crates.io but not on GitHub a few days before (so cargo-binstall fallbacks to build natively). We do not have to update the toolchain.

src/datatype/text_svecf32.rs Outdated Show resolved Hide resolved
if *x != F32::zero() {
match need_splitter {
true => {
buffer.push_str(format!("{}:{}", i + 1, x).as_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel not good about indexing from 1. It's not consistent with subscripting.

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 4 times, most recently from 69e527e to 2823e00 Compare April 19, 2024 01:31
@VoVAllen
Copy link
Member

Let's hold this PR for now, due to the conflict between 1-based array and 0-based array

@usamoi usamoi reopened this May 28, 2024
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: usamoi <usamoi@outlook.com>
@VoVAllen
Copy link
Member

This is used to support bm25 extension. It can produce string instead of depending on pgvecto.rs/pgvector. cc @cutecutecat

return Err(ParseVectorError::BadParentheses { character: '{' });
};
let mut token: ArrayVec<u8, 48> = ArrayVec::new();
let mut capacity = reserve;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reserves too large since vector is sparse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants