Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33723: [C++] re2::RE2::RE2() result must be checked #33806

Merged
merged 4 commits into from
Jan 22, 2023

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented Jan 20, 2023

Rationale for this change

RE2 construction needs to be checked.

What changes are included in this PR?

Check all RE2 object status.

Are these changes tested?

Tested with existing tests.

Are there any user-facing changes?

No.

@github-actions
Copy link

@js8544
Copy link
Collaborator Author

js8544 commented Jan 20, 2023

cc @kou

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests for error pattern cases?

@@ -1505,6 +1508,13 @@ struct MatchLike {
static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", kRE2Options);
// A LIKE pattern matching this regex can be translated into a suffix search.
static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options);
static bool global_checked = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use std::once_flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using std::call_once but the only solution I can come up with uses exceptions:

    static std::once_flag global_checked;
    try {
      std::call_once(global_checked, [] {
        auto check_global = [] {
          RETURN_NOT_OK(RegexStatus(kLikePatternIsSubstringMatch));
          RETURN_NOT_OK(RegexStatus(kLikePatternIsStartsWith));
          RETURN_NOT_OK(RegexStatus(kLikePatternIsEndsWith));
        };
        auto status = check_global();
        if (!status.ok()) { // throw an exception so that once_flag is not set
          throw std::invalid_argument(status.ToString());
        }
      });
    } catch (const std::invalid_argument& e) {
      return Status::Invalid(e.what());
    }

I couldn't find a more elegant solution. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel like a static boolean is enough for our case. As it avoids false positives, i.e., erroneous regexes are always checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these are constants it seems pretty unlikely that they would fail. You could probably just DCHECK_OK or abort if they fail. That being said, I agree this is probably a fine solution as well. You could also do something like...

template <typename StringType>
struct LikePatterns {
  RE2::Options kRE2Options = ...
  RE2 kLikePatternIsSubstringMatch(...);
  RE2 kLikePatternIsStartsWith(...);
  RE2 kLikePatternIsEndsWith(...);
  static Result<LikePatterns> Make() {
    RETURN_NOT_OK(...)
  }
};

template <typename StringType>
struct MatchLike {
  static Status Exec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
    static const Result<LikePatterns> kLikePatterns = LikePatterns::Make();
    RETURN_NOT_OK(kLikePatterns);
    ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. std::call_once isn't suitable here.

Comment on lines +1513 to +1515
RETURN_NOT_OK(RegexStatus(kLikePatternIsSubstringMatch));
RETURN_NOT_OK(RegexStatus(kLikePatternIsStartsWith));
RETURN_NOT_OK(RegexStatus(kLikePatternIsEndsWith));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have an error in these patterns, does this code report an error on the second call? (It seems that this code reports an error only on the first call.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does though. global_checked is set to true only if all three patterns are ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

@js8544
Copy link
Collaborator Author

js8544 commented Jan 21, 2023

Could you also add tests for error pattern cases?

Added a TestBaseBinaryKernels: FindSubstringRegexWrongPattern test

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -1505,6 +1508,13 @@ struct MatchLike {
static const RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)", kRE2Options);
// A LIKE pattern matching this regex can be translated into a suffix search.
static const RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))", kRE2Options);
static bool global_checked = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. std::call_once isn't suitable here.

Comment on lines +1513 to +1515
RETURN_NOT_OK(RegexStatus(kLikePatternIsSubstringMatch));
RETURN_NOT_OK(RegexStatus(kLikePatternIsStartsWith));
RETURN_NOT_OK(RegexStatus(kLikePatternIsEndsWith));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor include nit would be nice to fix.

@@ -19,6 +19,7 @@
#include <string>
#include <utility>
#include <vector>
#include "arrow/compute/exec.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the includes grouped together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cpp/src/gandiva/regex_functions_holder.cc Show resolved Hide resolved
@kou kou merged commit cb0f017 into apache:master Jan 22, 2023
@ursabot
Copy link

ursabot commented Jan 22, 2023

Benchmark runs are scheduled for baseline = 63967ac and contender = cb0f017. cb0f017 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] cb0f0178 ec2-t3-xlarge-us-east-2
[Failed] cb0f0178 test-mac-arm
[Failed] cb0f0178 ursa-i9-9960x
[Failed] cb0f0178 ursa-thinkcentre-m75q
[Failed] 63967ac0 ec2-t3-xlarge-us-east-2
[Failed] 63967ac0 test-mac-arm
[Failed] 63967ac0 ursa-i9-9960x
[Failed] 63967ac0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Feb 10, 2023
…33806)

### Rationale for this change

RE2 construction needs to be checked.

### What changes are included in this PR?

Check all RE2 object status.

### Are these changes tested?

Tested with existing tests.

### Are there any user-facing changes?

No.
* Closes: apache#33723

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] re2::RE2::RE2() result must be checked
4 participants