You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue
Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be compute-sanitizer tool issue which has been opened as nvbug 4553815.
This issue is to document the issue while working on possible workarounds until the bug is fixed.
The 2 errors appear as follows:
[ RUN ] NumericValueIteratorTest/1.non_null_iterator
========= Invalid __shared__ read of size 16 bytes
========= at 0x9670 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, thrust::minimum<void>>::Policy600, short *, short *, unsigned int, thrust::minimum<void>, short, short>(T2, T3, T4, T5, T6)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x8 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x331d50]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x14fb4]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel [0x70aae]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<short*, short*, thrust::minimum<void>, short, int>(void*, unsigned long&, short*, short*, int, thrust::minimum<void>, short, CUstream_st*) [clone .isra.0] [0x2fa199]
========= in /opt/conda/envs/test/bin/gtests/libcudf/ITERATOR_TEST
[ RUN ] MinMaxReductionTest/0.MinMaxTypes
========= Invalid __shared__ read of size 16 bytes
========= at 0x4310 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, short>(T2, T3, T4, T5, T6)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x8 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x331d50]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x14fb4]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel [0x70aae]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, int>(void*, unsigned long&, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, CUstream_st*) [clone .isra.0] [0x18950ae]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda()#2}::operator()() const [0x18984c3]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:std::unique_ptr<cudf::scalar, std::default_delete<cudf::scalar> > cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1898a70]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::detail::min(cudf::column_view const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x187ea46]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:decltype(auto) cudf::detail::aggregation_dispatcher<cudf::reduction::detail::reduce_dispatch_functor, cudf::reduce_aggregation const&>(cudf::aggregation::Kind, cudf::reduction::detail::reduce_dispatch_functor&&, cudf::reduce_aggregation const&) [0x193431e]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::detail::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1934d71]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, rmm::mr::device_memory_resource*) [0x193583f]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:std::pair<short, bool> ReductionTest<short>::reduction_test<short>(cudf::column_view const&, cudf::reduce_aggregation const&, std::optional<cudf::data_type>) [clone .constprop.0] [0x28ec47]
========= in /opt/conda/envs/test/bin/gtests/libcudf/./REDUCTIONS_TEST
If these were real errors the should appear when running without compute-sanitizer.
The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.
Provides a workaround for the compute-sanitizer issue described in #15258 causing memcheck failures in nightly builds.
An environment variable is introduced `LIBCUDF_MEMCHECK_ENABLED` so test code can bypass specific tests that cause the compute-sanitizer error. The env var is set only during memcheck tests since the failure does not occur in normal testing.
The failure only occurs for some `int16` or `uint16` reduction tests so managing these few tests is reasonable.
Other possible workarounds include
1. Reverting the compute-sanitizer to 11.8
Using the latest version is more desirable since the fix will likely not be back ported.
2. Adding an exclude filter to the CUB Reduce kernel
This disables checking for almost all reduction kernels
Authors:
- David Wendt (https://github.com/davidwendt)
- Bradley Dice (https://github.com/bdice)
Approvers:
- Bradley Dice (https://github.com/bdice)
- Vyas Ramasubramani (https://github.com/vyasr)
- Jake Awe (https://github.com/AyodeAwe)
URL: #15259
Describe the issue
Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be
compute-sanitizer
tool issue which has been opened as nvbug 4553815.This issue is to document the issue while working on possible workarounds until the bug is fixed.
The 2 errors appear as follows:
If these were real errors the should appear when running without
compute-sanitizer
.The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.
Steps/Code to reproduce
Note the failure only occurs on int16 (short) integer types when doing a min-reduction through CUB.
Additional context
The error occurs as follows on various
compute-sanitizer
versions:In general, it fails only with 12.0 and above.
The text was updated successfully, but these errors were encountered: