Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Possible data race when reading metadata of a parquet file #40068

Closed
mpimenov opened this issue Feb 13, 2024 · 12 comments · Fixed by #40111
Closed

[C++] Possible data race when reading metadata of a parquet file #40068

mpimenov opened this issue Feb 13, 2024 · 12 comments · Fixed by #40111

Comments

@mpimenov
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

The first and the last line of this block of code access the same metadata variable but only one of them does so holding a lock.
I assume this means the other one should too.
There are some other places in this file that access metadata in tricky ways (e.g. it is not clear from a first glance at a method whether nullptr is allowed or not). They could also race.

if (parquet_fragment->metadata() != nullptr) {
ARROW_ASSIGN_OR_RAISE(row_groups, parquet_fragment->FilterRowGroups(options->filter));
pre_filtered = true;
if (row_groups.empty()) return MakeEmptyGenerator<std::shared_ptr<RecordBatch>>();
}
// Open the reader and pay the real IO cost.
auto make_generator =
[this, options, parquet_fragment, pre_filtered,
row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader) mutable
-> Result<RecordBatchGenerator> {
// Ensure that parquet_fragment has FileMetaData
RETURN_NOT_OK(parquet_fragment->EnsureCompleteMetadata(reader.get()));

Component(s)

C++, Parquet

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

The first and the last line of this block of code access the same metadata variable but only one of them does so holding a lock.

  1. Which metadata variable?
  2. Which lock are you talking about?

@mpimenov
Copy link
Author

  1. std::shared_ptr<parquet::FileMetaData> metadata_;
  2. auto lock = physical_schema_mutex_.Lock();

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

Well, I'm not sure why it would be a problem in the snippet above. metadata() in line 607 returns before EnsureCompleteMetadata() in line 618 executes.

@mpimenov
Copy link
Author

In this call, sure. But by the same reasoning the locking is not needed at all.
I assume (and I've observed it in TSAN reports) that this method can be called from multiple threads, with the intention of caching metadata by the first thread that arrives to read it.

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

Hmm, then please post a corresponding TSAN report, this would help understand the issue.

@mpimenov
Copy link
Author

Arrow is built from commit 3c655df
tsan-40068.txt

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

Thanks! So, trying to sum it up:

  • main thread has an Acero source node whose StartProducing calls FragmentsToBatches calls ParquetFileFormat::ScanBatchesAsync
    #2 arrow::dataset::ParquetFileFormat::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&, std::shared_ptr<arrow::dataset::FileFragment> const&) const $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:607:36 (arrow_reproduce+0x126e20b)
    #3 arrow::dataset::FileFragment::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&) $ARROW_ROOT/cpp/src/arrow/dataset/file_base.cc:188:19 (arrow_reproduce+0x1156715)
    #4 arrow::dataset::(anonymous namespace)::FragmentToBatches(arrow::Enumerated<std::shared_ptr<arrow::dataset::Fragment>> const&, std::shared_ptr<arrow::dataset::ScanOptions> const&) $ARROW_ROOT/cpp/src/arrow/dataset/scanner.cc:305:3 (arrow_reproduce+0x11c4fa1)
    #5 arrow::dataset::(anonymous namespace)::FragmentsToBatches(std::function<arrow::Future<std::shared_ptr<arrow::dataset::Fragment>> ()>, std::shared_ptr<arrow::dataset::ScanOptions> const&)::$_0::operator()(arrow::Enumerated<std::shared_ptr<arrow::dataset::Fragment>> const&) const $ARROW_ROOT/cpp/src/arrow/dataset/scanner.cc:333:36 (arrow_reproduce+0x11c4fa1)
  • Arrow worker thread executes a Future callback from ParquetFileReader::OpenAsync's own callback, that in turn calls ParquetFileFormat::ScanBatchesAsync
    #4 arrow::dataset::ParquetFileFragment::SetMetadata(std::shared_ptr<parquet::FileMetaData>, std::shared_ptr<parquet::arrow::SchemaManifest>) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:819:13 (arrow_reproduce+0x12764c2)
    #5 arrow::dataset::ParquetFileFragment::EnsureCompleteMetadata(parquet::arrow::FileReader*) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:811:10 (arrow_reproduce+0x1275b25)
    #6 arrow::dataset::ParquetFileFormat::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&, std::shared_ptr<arrow::dataset::FileFragment> const&) const::$_0::operator()(std::shared_ptr<parquet::arrow::FileReader> const&) $ARROW_ROOT/cpp/src/arrow/dataset/file_parquet.cc:618:5 (arrow_reproduce+0x1283f68)

What seems weird is that two ScanBatchesAsync calls would happen on the same fragment.

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

ScanBatchesAsync is only called from FragmentToBatches, which in turn can only be called from a "scan" node.

One possible explanation would be if Dataset::GetFragments returns the same fragment instance twice. How did you create your dataset?

cc @westonpace

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

In any case, it would probably be good for ParquetFileFragment::metadata() to take the lock before reading its result value.

@mpimenov
Copy link
Author

Dataset creation itself looks quite vanilla to me

  std::vector<std::string> paths = FillPaths();
  arrow::dataset::FileSystemFactoryOptions options;
  options.partitioning = arrow::dataset::HivePartitioning::MakeFactory();
  options.partition_base_dir = root_path;
  options.exclude_invalid_files = true;
  return arrow::dataset::FileSystemDatasetFactory::Make(
             std::make_shared<arrow::fs::LocalFileSystem>(),
             paths,
             std::make_shared<arrow::dataset::ParquetFileFormat>(),
             options
  )
      .ValueOrDie()
      ->Finish()
      .ValueOrDie();

However, when reading it I do this (for filtering columns and such)

  arrow::dataset::internal::Initialize();
  std::vector<std::shared_ptr<arrow::Field>> fields;
  for (const auto& i : dataset->schema()->fields()) {
    if (options.fields_.empty() || contains(options.fields_, i->name())) {
      fields.push_back(i);
    }
  }
  dataset = dataset->ReplaceSchema(std::make_shared<arrow::Schema>(fields, dataset->schema()->metadata())).ValueOrDie();

I don't know if this is idiomatic enough

@pitrou
Copy link
Member

pitrou commented Feb 14, 2024

That looks reasonable to me.

@westonpace
Copy link
Member

I assume (and I've observed it in TSAN reports) that this method can be called from multiple threads, with the intention of caching metadata by the first thread that arrives to read it.

Correct, this is a sort of lazy cache. Parquet datasets will cache parquet metadata the first time it is loaded. So to trigger this you would need to:

  • Create a dataset
  • Simultaneously scan the dataset multiple times

So it is possible that we have one thread testing if a shared_ptr is null at the same time another thread is assigning to it. Assigning to a shared_ptr is maybe one of those things that is probably atomic but not necessarily documented to be atomic.

Either way, suppressing TSAN reports is a valid enough reason to fix this. The ParquetFileFragment::metadata method should grab the lock (which means it is no longer a const method).

@pitrou pitrou added this to the 15.0.1 milestone Feb 26, 2024
pitrou pushed a commit that referenced this issue Feb 26, 2024
… file (#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: #40068

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 15.0.1, 16.0.0 Feb 26, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…arquet file (apache#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: apache#40068

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…arquet file (apache#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: apache#40068

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@raulcd raulcd modified the milestones: 15.0.1, 15.0.2 Mar 13, 2024
raulcd pushed a commit that referenced this issue Mar 13, 2024
… file (#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: #40068

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants