Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41460: [C++] Use ASAN to poison temp vector stack memory #41695

Merged
merged 8 commits into from
May 18, 2024

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented May 16, 2024

Rationale for this change

See #41460. And reduce the overhead of current manual poisoning (filling the entire stack space with 0xFFs) that happens even in release mode.

What changes are included in this PR?

Use ASAN API to replace the current manual poisoning of the temp vector stack memory.

Are these changes tested?

Wanted to add cases to assert that ASAN poison/unpoison is functioning. However I found it too tricky to catch an ASAN error because ASAN directly uses signals that are hard to interoperate in C++/gtest. So I just manually checked poisoning is working in my local, e.g. by intentionally not unpoisoning the allocated buffer and seeing ASAN unhappy.

Just leveraging existing cases that use temp stack such as acero tests, which should cover this change well.

Are there any user-facing changes?

None.

Copy link

⚠️ GitHub issue #41460 has been automatically assigned in GitHub to PR creator.

@zanmato1984 zanmato1984 marked this pull request as ready for review May 16, 2024 16:39
@zanmato1984
Copy link
Collaborator Author

Hi @felipecrv , this is a followup of our discussion in #41335 (comment). Think you might want to take a look. Thanks.

@@ -20,16 +20,29 @@
#include "arrow/compute/util.h"
#include "arrow/memory_pool.h"

#ifdef ADDRESS_SANITIZER
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some platforms don't have ASAN, so wrap the ASAN related code with a flag that is only defined with ASAN.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 16, 2024
Comment on lines 41 to 48
TempVectorStack() = default;
~TempVectorStack();

TempVectorStack(const TempVectorStack&) = delete;
TempVectorStack& operator=(const TempVectorStack&) = delete;

TempVectorStack(TempVectorStack&&) = default;
TempVectorStack& operator=(TempVectorStack&&) = default;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boilerplate code to cope with the explicit destructor, otherwise the compiler will complain that implicit copy constructor is deleted due to a member of unique_ptr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two handy macros for you here ->

#ifndef ARROW_DISALLOW_COPY_AND_ASSIGN
#define ARROW_DISALLOW_COPY_AND_ASSIGN(TypeName) \
TypeName(const TypeName&) = delete; \
void operator=(const TypeName&) = delete
#endif
#ifndef ARROW_DEFAULT_MOVE_AND_ASSIGN
#define ARROW_DEFAULT_MOVE_AND_ASSIGN(TypeName) \
TypeName(TypeName&&) = default; \
TypeName& operator=(TypeName&&) = default
#endif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for tips! Will update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Comment on lines +32 to +34
if (buffer_) {
ASAN_UNPOISON_MEMORY_REGION(buffer_->mutable_data(), buffer_size_);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must unpoison the buffer before releasing.

Comment on lines 41 to 48
TempVectorStack() = default;
~TempVectorStack();

TempVectorStack(const TempVectorStack&) = delete;
TempVectorStack& operator=(const TempVectorStack&) = delete;

TempVectorStack(TempVectorStack&&) = default;
TempVectorStack& operator=(TempVectorStack&&) = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two handy macros for you here ->

#ifndef ARROW_DISALLOW_COPY_AND_ASSIGN
#define ARROW_DISALLOW_COPY_AND_ASSIGN(TypeName) \
TypeName(const TypeName&) = delete; \
void operator=(const TypeName&) = delete
#endif
#ifndef ARROW_DEFAULT_MOVE_AND_ASSIGN
#define ARROW_DEFAULT_MOVE_AND_ASSIGN(TypeName) \
TypeName(TypeName&&) = default; \
TypeName& operator=(TypeName&&) = default
#endif

static int64_t PaddedAllocationSize(int64_t num_bytes);

void alloc(uint32_t num_bytes, uint8_t** data, int* id);
void release(int id, uint32_t num_bytes);
static constexpr uint64_t kGuard1 = 0x3141592653589793ULL;
static constexpr uint64_t kGuard2 = 0x0577215664901532ULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep these, write them after the poisoning. Poisoning can only help us with invalid reads, but not with undesirable writes.

They were added to help detecting bugs like #11335

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felipecrv , I'll add them back.

But I'm also putting down some notes here about what I've been thinking, perhaps just as a memo to myself.

I'm removing these guards because ASAN can prevent both invalid reads and writes to poisoned memory. I don't think I can replicate the exact issue that #11335 was trying to fix, but I simply reverted #11335 and ran hash join ut, seeing ASAN successfully caught write after poison.

However there could be more tricky cases like below:

  1. func1 allocates a temp vector v1 and reads/writes to v1;
  2. func1 calls func2 (note: w/o releasing v1 - this is the essential of a "stack" and the tricky part);
  3. func2 allocates another temp vector v2 and reads/writes to v2;
  4. func2 releases v2 and returns to func1;
  5. func1 reads/writes to v1 again;
  6. func1 releases v1 and returns to its caller.

A) Suppose writes to v1 at offset >= v1.size() happen in 1, then ASAN can catch it because that piece of memory is already poisoned.
B) Suppose writes to v1 at offset < 0 happen in 1, then whether ASAN can catch it depends on the validity of the memory address lower than v1, but the guards have a better chance.
C) Suppose writes to v2 at offset >= v2.size() happen in 3, same as A.
D) Suppose writes to v2 at offset < 0 happen in 3, then ASAN has zero chance to catch it as long as the offset is within v1.size(), but the guards have a better chance.
E) Is it possible that func1 writes to v1 at offset >= v1.size() while v2 is still alive? So it is corrupting v2's content without ASAN noticing because v2's memory is unpoisoned throughout v2's lifespan. The answer is NO, the control flow will be in func2 as long as v2 is alive so func1 won't be able to execute any code.

To summarize, ASAN poisoning is not able to 100% cover what the guards can do. The guards have a better chance to catch invalid writes at negative offsets (B and D). But other than that, ASAN is equally good as the guards.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 18, 2024
@felipecrv felipecrv merged commit dcdf4e6 into apache:main May 18, 2024
39 of 40 checks passed
@felipecrv felipecrv removed the awaiting merge Awaiting merge label May 18, 2024
@felipecrv
Copy link
Contributor

Thank you for doing this! ASAN is so cool.

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dcdf4e6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ache#41695)

### Rationale for this change

See apache#41460. And reduce the overhead of current manual poisoning (filling the entire stack space with `0xFF`s) that happens even in release mode.

### What changes are included in this PR?

Use ASAN API to replace the current manual poisoning of the temp vector stack memory.

### Are these changes tested?

Wanted to add cases to assert that ASAN poison/unpoison is functioning. However I found it too tricky to catch an ASAN error because ASAN directly uses signals that are hard to interoperate in C++/gtest. So I just manually checked poisoning is working in my local, e.g. by intentionally not unpoisoning the allocated buffer and seeing ASAN unhappy.

Just leveraging existing cases that use temp stack such as acero tests, which should cover this change well.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#41460

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants