You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
We are trying to use the AsyncRecordBatchStreamReader to read several record batches from an http response stream. The record batches are dictionary encoded and we started noticing that the values of these records start repeating after reading the first record batch.
For a minimal repro, we can see that the RecordBatchStreamReader starts repeating values for record batches with regular dictionary encoding as well. I've added a sample txt file that contains the arrow serialized results for 2 record batches. Both the record batches are dictionary encoded (not delta), the first record batch contains 200 records and the second one just contains one. ArrowDebugging-WrongRecordBatch.txt
I'm simply trying to retrieve the value for the first column in these 2 batches which are expected to be different.
Here's a snippet to repro this behavior :
functionreadFileAsStream(fileName: string){// read the contents of the text file as a stringconstbase64String=readFileSync(fileName,'utf-8');// Decode the base64 stringconstbinaryString=atob(base64String);// Convert the binary string to a Uint8Arrayconstbytes=newUint8Array(binaryString.length);for(leti=0;i<binaryString.length;i++){bytes[i]=binaryString.charCodeAt(i);}constreader=arrow.RecordBatchStreamReader.from(bytes);// Read the record batchesletbatch;while(batch=reader.next()){if(!batch||batch?.done){break;}if(batch.value){// Get the value of the first columnconsole.log(batch.value.data.children[0].dictionary?.get(0));}}}
Observed result:
'2013042345'
'2013042345'
Expected result (notice the second value is different from the first one):
'2013042345'
'2012020145'
Since the record batches are not delta dictionary encoded, I'd expect that the dictionary associated with the first record batch should get replaced with a separate dictionary when reading the second batch. I was looking at related issues and I wonder if this might be related to #23572 .
Additionally, I'd like to understand what's the recommended way to read multiple dictionary encoded record batches from an http stream. I imagine that we can use the reader.next() iterator pattern to keep reading record batches in a stream but I'd like to confirm my understanding.
Component(s)
JavaScript
The text was updated successfully, but these errors were encountered:
To add to this, we think the issue stems from the reader's internal dictionaries state not being reset after finishing a record batch.
felipecrv
changed the title
Dictionary encoded values repeating between record batches
[JS] Dictionary encoded values repeating between record batches
May 17, 2024
Describe the bug, including details regarding any error messages, version, and platform.
We are trying to use the
AsyncRecordBatchStreamReader
to read several record batches from an http response stream. The record batches are dictionary encoded and we started noticing that the values of these records start repeating after reading the first record batch.For a minimal repro, we can see that the
RecordBatchStreamReader
starts repeating values for record batches with regular dictionary encoding as well. I've added a sample txt file that contains the arrow serialized results for 2 record batches. Both the record batches are dictionary encoded (not delta), the first record batch contains 200 records and the second one just contains one. ArrowDebugging-WrongRecordBatch.txtI'm simply trying to retrieve the value for the first column in these 2 batches which are expected to be different.
Here's a snippet to repro this behavior :
Observed result:
Expected result (notice the second value is different from the first one):
Since the record batches are not delta dictionary encoded, I'd expect that the dictionary associated with the first record batch should get replaced with a separate dictionary when reading the second batch. I was looking at related issues and I wonder if this might be related to #23572 .
Additionally, I'd like to understand what's the recommended way to read multiple dictionary encoded record batches from an http stream. I imagine that we can use the
reader.next()
iterator pattern to keep reading record batches in a stream but I'd like to confirm my understanding.Component(s)
JavaScript
The text was updated successfully, but these errors were encountered: