Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JS] Dictionary encoded values repeating between record batches #41683

Open
vivek1729 opened this issue May 16, 2024 · 1 comment
Open

[JS] Dictionary encoded values repeating between record batches #41683

vivek1729 opened this issue May 16, 2024 · 1 comment

Comments

@vivek1729
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

We are trying to use the AsyncRecordBatchStreamReader to read several record batches from an http response stream. The record batches are dictionary encoded and we started noticing that the values of these records start repeating after reading the first record batch.

For a minimal repro, we can see that the RecordBatchStreamReader starts repeating values for record batches with regular dictionary encoding as well. I've added a sample txt file that contains the arrow serialized results for 2 record batches. Both the record batches are dictionary encoded (not delta), the first record batch contains 200 records and the second one just contains one. ArrowDebugging-WrongRecordBatch.txt

I'm simply trying to retrieve the value for the first column in these 2 batches which are expected to be different.
Here's a snippet to repro this behavior :

function readFileAsStream(fileName: string) {
    // read the contents of the text file as a string
    const base64String = readFileSync(fileName, 'utf-8');
    // Decode the base64 string
    const binaryString = atob(base64String);

    // Convert the binary string to a Uint8Array
    const bytes = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        bytes[i] = binaryString.charCodeAt(i);
    }

    const reader = arrow.RecordBatchStreamReader.from(bytes);

    // Read the record batches
    let batch;
    while (batch = reader.next()) {
        if (!batch || batch?.done) {
            break;
        }
        if (batch.value) {
            // Get the value of the first column
            console.log(batch.value.data.children[0].dictionary?.get(0));
        }
    }
}

Observed result:

'2013042345'
'2013042345'

Expected result (notice the second value is different from the first one):

'2013042345'
'2012020145'

Since the record batches are not delta dictionary encoded, I'd expect that the dictionary associated with the first record batch should get replaced with a separate dictionary when reading the second batch. I was looking at related issues and I wonder if this might be related to #23572 .

Additionally, I'd like to understand what's the recommended way to read multiple dictionary encoded record batches from an http stream. I imagine that we can use the reader.next() iterator pattern to keep reading record batches in a stream but I'd like to confirm my understanding.

Component(s)

JavaScript

@mirdaki
Copy link

mirdaki commented May 16, 2024

To add to this, we think the issue stems from the reader's internal dictionaries state not being reset after finishing a record batch.

@felipecrv felipecrv changed the title Dictionary encoded values repeating between record batches [JS] Dictionary encoded values repeating between record batches May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants