Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] arrow::int32 throws exc_bad_access #35616

Closed
hqx871 opened this issue May 16, 2023 · 5 comments
Closed

[c++] arrow::int32 throws exc_bad_access #35616

hqx871 opened this issue May 16, 2023 · 5 comments

Comments

@hqx871
Copy link

hqx871 commented May 16, 2023

hi team! I use the 0.15.1 and found a problem when read parquet file, which contains array column. The asan output

parquet-low-level-example(49396,0x7ff848622680) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
row num:1000000
=================================================================
==49396==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0001087d73f8 at pc 0x0001076ecb8d bp 0x7ff7b8b4d6b0 sp 0x7ff7b8b4d6a8
WRITE of size 8 at 0x0001087d73f8 thread T0
    #0 0x1076ecb8c in int arrow::util::RleDecoder::GetBatchWithDictSpaced<long long>(long long const*, long long*, int, int, unsigned char const*, long long) rle_encoding.h:488
    #1 0x1076e62c8 in parquet::DictDecoderImpl<parquet::PhysicalType<(parquet::Type::type)2> >::DecodeSpaced(long long*, int, int, unsigned char const*, long long) encoding.cc:1079
    #2 0x1075d9e6b in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)2> >::ReadValuesSpaced(long long, long long) column_reader.cc:1052
    #3 0x1075dc1a9 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)2> >::ReadRecordData(long long) column_reader.cc:1096
    #4 0x1075d6a4c in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)2> >::ReadRecords(long long) column_reader.cc:822
    #5 0x1073d1583 in parquet::arrow::LeafReader::NextBatch(long long, std::__1::shared_ptr<arrow::ChunkedArray>*) reader.cc:414
    #6 0x1073d55bd in parquet::arrow::NestedListReader::NextBatch(long long, std::__1::shared_ptr<arrow::ChunkedArray>*) reader.cc:469
    #7 0x1073f5a82 in parquet::arrow::RowGroupRecordBatchReader::ReadNext(std::__1::shared_ptr<arrow::RecordBatch>*) reader.cc:320
    #8 0x1073b409a in printParquetFile(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) reader-writer.cc:97
    #9 0x1073b5209 in main reader-writer.cc:111
    #10 0x7ff8049b230f  (<unknown module>)

0x0001087d73f8 is located 40 bytes to the left of global variable 'guard variable for arrow::SparseTensor::dim_name(int) const::kEmpty' defined in 'arrow-apache-arrow-0.15.1/cpp/src/arrow/sparse_tensor.cc' (0x1087d7420) of size 8
0x0001087d73f8 is located 0 bytes to the right of global variable 'kEmpty' defined in 'arrow-apache-arrow-0.15.1/cpp/src/arrow/sparse_tensor.cc:415:28' (0x1087d73e0) of size 24
SUMMARY: AddressSanitizer: global-buffer-overflow rle_encoding.h:488 in int arrow::util::RleDecoder::GetBatchWithDictSpaced<long long>(long long const*, long long*, int, int, unsigned char const*, long long)
Shadow bytes around the buggy address:
  0x1000210fae20: 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9 00 f9 f9 f9
  0x1000210fae30: 00 00 00 00 00 00 00 f9 f9 f9 f9 f9 00 f9 f9 f9
  0x1000210fae40: 00 00 f9 f9 00 f9 f9 f9 01 f9 f9 f9 01 f9 f9 f9
  0x1000210fae50: 01 f9 f9 f9 01 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9
  0x1000210fae60: 01 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9 00 00 00 00
=>0x1000210fae70: 00 00 00 f9 f9 f9 f9 f9 00 00 00 00 00 00 00[f9]
  0x1000210fae80: f9 f9 f9 f9 00 f9 f9 f9 00 00 00 f9 f9 f9 f9 f9
  0x1000210fae90: 00 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9 00 00 f9 f9
  0x1000210faea0: 00 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9 00 00 f9 f9
  0x1000210faeb0: 00 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9 00 00 f9 f9
  0x1000210faec0: 00 f9 f9 f9 00 00 f9 f9 00 f9 f9 f9 00 00 f9 f9
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==49396==ABORTING

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

Component(s)

C++

@hqx871
Copy link
Author

hqx871 commented May 16, 2023

I know this is too old. Have anyone solved this?

@hqx871
Copy link
Author

hqx871 commented May 16, 2023

The code is very simple.

void printParquetFile(const std::string &path) {
  arrow::Status st;
  // Open Parquet file reader
  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
  auto file_reader = parquet::ParquetFileReader::OpenFile(path, true);
  st = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(),
      std::move(file_reader), &arrow_reader);
  if (!st.ok()) {
    throw std::runtime_error(st.ToString());
  }

  auto meta = arrow_reader->parquet_reader()->metadata();
  std::cout << path << " row num:" << meta->num_rows() << std::endl;

  //auto totalGroupNum = meta->num_row_groups();
  //std::map<std::string, int32_t> columnMap;
  auto schema = meta->schema();
  std::vector<int> readColumnIds;
  for (int i = 0; i < meta->num_columns(); ++i) {
    auto column = schema->Column(i);
    std::cout << "col:" << std::to_string(i)
              << ", path:" << column->path()->ToDotString()
              << ", name:" << column->path()->ToDotVector()[0]
              << ", max definition level:" << column->max_definition_level()
              << std::endl;
    readColumnIds.push_back(i);
  }

  for (int group = 0; group < meta->num_row_groups(); ++group) {
    auto rowGroup = meta->RowGroup(group);
    auto groupRowNum = rowGroup->num_rows();
    std::shared_ptr<arrow::RecordBatchReader> batchReader;
    st = arrow_reader->GetRecordBatchReader({group}, {18},
                                            &batchReader);
    if (!st.ok()) {
      // Handle error instantiating file reader...
      throw std::runtime_error(st.ToString());
    }
    int groupReadLines = 0;
    while (groupReadLines < groupRowNum) {
      std::shared_ptr<arrow::RecordBatch> rowBatch;
      //st = batchReader->ReadNext(&rowBatch);
      try{
        st = batchReader->ReadNext(&rowBatch);
      } catch (const std::exception& ex) {
        throw std::runtime_error(ex.what());
      }
      if (!st.ok()) {
        // Handle error instantiating file reader...
        throw std::runtime_error(st.ToString());
      }
      groupReadLines += rowBatch->num_rows();
    }
  }
}

@mapleFU
Copy link
Member

mapleFU commented May 16, 2023

Would you min try in on latest arrow release?

@hqx871
Copy link
Author

hqx871 commented May 16, 2023

thanks for your reply. i have test the latest version and it works good. but i want to fix in the old version if i can find the root cause

@mapleFU
Copy link
Member

mapleFU commented May 16, 2023

I guess maybe you can find the releases after 0.15 and find out if there is any bugfixes...

@kou kou closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants