feat: new text embedding for sparse vector #466

cutecutecat · 2024-04-15T02:13:46Z

Part of #459

Changed svector text representation from "[0, 1, 0, 3, 4]" to pgvector-style "{2:1, 4:3, 5:4}/5"
Upgraded toolchain for new feature proc_macro_byte_character from upstream

Reminder

The index is from 1 instead of 0 at pgvector

Old	New
[0, 1, 2, 0, 0]	{2:1, 3:2}/5
[0, 1, 2, 3, 4, 5, 6, 7]	{2:1, 3:2, 4:3, 5:4, 6:5, 7:6, 8:7}/8
[5, 6, 7]	{1:5, 2:6, 3:7}/3
[0, 0, 0, 0]	{}/4

cutecutecat · 2024-04-15T02:56:49Z

The failed CI is due to an upstream uncompatiablity:

proc-macro2 v1.0.80 introduced proc_macro_byte_character feature, which is recently stablized but not released until 2024-04-12

usamoi · 2024-04-18T06:55:35Z

The reason why CI fails is that sqllogictest-bin released on crates.io but not on GitHub a few days before (so cargo-binstall fallbacks to build natively). We do not have to update the toolchain.

src/datatype/text_svecf32.rs

usamoi · 2024-04-18T06:59:27Z

src/datatype/text_svecf32.rs

+        if *x != F32::zero() {
+            match need_splitter {
+                true => {
+                    buffer.push_str(format!("{}:{}", i + 1, x).as_str());


I feel not good about indexing from 1. It's not consistent with subscripting.

VoVAllen · 2024-04-19T03:36:03Z

Let's hold this PR for now, due to the conflict between 1-based array and 0-based array

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>

Signed-off-by: usamoi <usamoi@outlook.com>

VoVAllen · 2024-05-29T08:54:07Z

This is used to support bm25 extension. It can produce string instead of depending on pgvecto.rs/pgvector. cc @cutecutecat

usamoi · 2024-05-30T07:56:26Z

src/utils/parse.rs

+        return Err(ParseVectorError::BadParentheses { character: '{' });
+    };
+    let mut token: ArrayVec<u8, 48> = ArrayVec::new();
+    let mut capacity = reserve;


It reserves too large since vector is sparse.

cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 2089054 to 0e0c58b Compare April 16, 2024 13:52

cutecutecat marked this pull request as ready for review April 17, 2024 01:38

cutecutecat requested review from VoVAllen and usamoi April 17, 2024 01:38

usamoi reviewed Apr 18, 2024

View reviewed changes

VoVAllen mentioned this pull request Apr 18, 2024

install patched pgrx failed #468

Closed

cutecutecat force-pushed the svec-new-text-rep branch 4 times, most recently from 69e527e to 2823e00 Compare April 19, 2024 01:31

cutecutecat closed this Apr 26, 2024

usamoi reopened this May 28, 2024

feat: new text embedding for sparse vector

0c57ad4

Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>

usamoi force-pushed the svec-new-text-rep branch from 2823e00 to 0c57ad4 Compare May 28, 2024 12:26

fix: use 0-based index

2d6c196

Signed-off-by: usamoi <usamoi@outlook.com>

usamoi force-pushed the svec-new-text-rep branch from f86433b to 2d6c196 Compare May 28, 2024 12:38

VoVAllen approved these changes May 29, 2024

View reviewed changes

usamoi reviewed May 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: new text embedding for sparse vector #466

feat: new text embedding for sparse vector #466

cutecutecat commented Apr 15, 2024 •

edited

cutecutecat commented Apr 15, 2024 •

edited

usamoi commented Apr 18, 2024 •

edited

usamoi Apr 18, 2024

VoVAllen commented Apr 19, 2024

VoVAllen commented May 29, 2024

usamoi May 30, 2024

feat: new text embedding for sparse vector #466

Are you sure you want to change the base?

feat: new text embedding for sparse vector #466

Conversation

cutecutecat commented Apr 15, 2024 • edited

Reminder

cutecutecat commented Apr 15, 2024 • edited

usamoi commented Apr 18, 2024 • edited

usamoi Apr 18, 2024

Choose a reason for hiding this comment

VoVAllen commented Apr 19, 2024

VoVAllen commented May 29, 2024

usamoi May 30, 2024

Choose a reason for hiding this comment

cutecutecat commented Apr 15, 2024 •

edited

cutecutecat commented Apr 15, 2024 •

edited

usamoi commented Apr 18, 2024 •

edited