Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: Move shared sql behavior to helper gems #529

Conversation

kaylareopelle
Copy link
Contributor

@kaylareopelle kaylareopelle commented Jun 22, 2023

Mysql2, trilogy, and pg instrumentation contained duplicated constants and methods. This refactor creates two new gems, opentelemetry-helpers-mysql and opentelemetry-helpers-sql-obfuscation, to hold the shared code.

In addition, it also adds the ability to obfuscate SQLite, Cassandra, and Oracle databases, though there isn't any instrumentation available to automatically incorporate obfuscation for those database types.

It also improves MySQL statement extraction and SQL obfuscation behavior to better support comments prepended to SQL statements.

Resolves #32

Outdated content below this line


Questions for maintainers:

  • Would you prefer the obfuscation and MySql refactors to be in their own commits/PRs? (I have a branch on standby with separate commits, so it's easy to pivot) => doesn't matter, going with one commit
  • What do you think about this pattern? These names? Is there a different design you'd like me to try? => new gems, named like opentelemetry-helpers-<type> in a directory at a peer level with instrumentation
  • The public methods for the refactored instrumentation remained the same. Only constants and private methods were moved into helper modules. The instrumentation calling the modules continues to exercise the code. Would you like me to write tests for the new modules regardless? => This changed when we moved from modules to gems. The gems have full test coverage for their methods.

Remaining Work:

  • Write the readmes & rdocs
    • mysql
    • obfuscation
  • Update CI actions to include the new gems
  • Write tests
    • mysql
    • obfuscation
  • Resolve bundler error

@ericmustin
Copy link
Contributor

Would you prefer the obfuscation and MySql refactors to be in their own commits/PRs? (I have a branch on standby with separate commits, so it's easy to pivot)

I don't really care whatever is easier is fine, so happy to leave as is.

What do you think about this pattern? These names? Is there a different design you'd like me to try?

Setting aside the question of where this stuff lives, (i think Base is fine practically speaking but if ppl have opinions i probably wont die on this hill), few questions i have.

  1. it looks like PG obfuscation hasn't been abstracted yet, is that just due to this being WIP(ie you're planning to do that or that you're waiting for a resolution on your first Q), or is there a reason it's exempted?

  2. we have a bit of loose coupling of configuration options and these helper methods. For example, we refer to config[:obfuscation_limit] within the helper methods and treat it as an integer, and within the specific instrumentation's we ensure to validate these config options (see example:

    option :obfuscation_limit, default: 2000, validate: :integer
    ), but because one portion of this logic lives in the instrumentation and the other in the base helper methods, it's possible for these to drift, or in the case of a new config option being added or a new db instrumentation being added which leverages these helpers, it's possible for the config option to be missed.

    I don't know if there's an elegant way to handle this really, or even whether we should try to abstract out these config options into a universal helper place (it's entirely possible that there would be valid reasons we may want to enforce different defaults for different db libs, for example), and frankly it feels like something we can solve by just being thoughtful maintainers and ensuring we keep the config options in sync when accepting prs/changes/etc, but just bringing it up, non-blocking imo but open to hearing what others think or if we can add some defensiveness here to help future us cc @arielvalentin @ahayworth .

  3. if we add some defensiveness here wrt my comment above, some tests might be nice for that, but beside that I don't really think it matters, practically speaking it would be hard to merge a change without exercising the full -contrib's instrumentation test suite so the failure's will still get picked up eventually even if we only make a change to base, so not too worried.

@ericmustin
Copy link
Contributor

oh also this looks great so far! awesome stuff, ty for taking this on!

@kaylareopelle
Copy link
Contributor Author

it looks like PG obfuscation hasn't been abstracted yet, is that just due to this being WIP(ie you're planning to do that or that you're waiting for a resolution on your first Q), or is there a reason it's exempted?

Thanks for catching this! This was just an oversight on my part. I was so focused on the identical code in mysql2/trilogy that I didn't realize the code for PG, though structured differently, behaves the same.

@kaylareopelle
Copy link
Contributor Author

As far as configs go, I think I'd prefer to keep them defined in the instrumentation library rather than a helper, even though we'll need to make sure they're there for the methods to work.

We could try to make a new class for the SQL instrumentation to inherit from and raise an error if the config, or a method that accesses a memoized config, isn't defined. If we go that approach, we might want to change the current refactor a bit.

@kaylareopelle
Copy link
Contributor Author

@ericmustin - This is ready for another look! I'll clean up the commits after feedback. I didn't make any changes to the configs, but I would be happy to. Maybe we can chat during the SIG tomorrow?

Copy link
Contributor

@robertlaurin robertlaurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to extracting this code out to a common place, we've already ran into one instance where parity between trilogy and mysql2 fell out of sync because a change occurred in one place but not the other.

I'm not sure base is the correct location for it, but I'm also not convinced it is the wrong place either.

My general concern is that whenever base bumps it forces all other deps to bump with it. It is toily for releasing, but also for bumping a single instrumentation package as a consumer of these gems.

A slightly obnoxious solution might be gems for each domain? Mysql helpers, HTTP helpers, something_else helpers.

@fbogsany
Copy link
Contributor

fbogsany commented Jun 27, 2023

My general concern is that whenever base bumps it forces all other deps to bump with it. It is toily for releasing, but also for bumping a single instrumentation package as a consumer of these gems.

A slightly obnoxious solution might be gems for each domain? Mysql helpers, HTTP helpers, something_else helpers.

Agreed. -base is the wrong place for these helpers. At a minimum, I'd like a -helpers gem (or -common, which is what we ended up with in the "core" otel-ruby). Factoring them out further is probably may be too much to ask, but maybe not ... -sql_helpers for now, maybe -http_helpers later.

multi_line_comments: %r{\/\*(?:[^\/]|\/[^*])*?(?:\*\/|\/\*.*)}
}.freeze

DIALECT_COMPONENTS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more i look at this code, the more i continue to think this represents what really ought to be a Specification effort to standardize OpenTelemetry's treatment of "sql obfuscation" as a feature. Common dialect component keys ought to be part of otel semconv, and the implementation of how to accomplish that obfuscation could vary under the hood (regex might not be the most performant approach in some languages, etc), but it would be possible to do a matrix-like comparison of language-sdk's + collector's "obfuscation abilities" on sql(or other things, mongodb redis etc), and the varying concerns wrt PII related compliance things that different end users of different language sdks should need to worry about, and possibly, later on in their telemetry pipeline, apply some additional redaction or more fine grained ingestion or indexing approach to their data.

not blocking or anything, just was thinking about this and wanted to socialize it.

}.freeze

DIALECT_COMPONENTS = {
fallback: COMPONENTS_REGEX_MAP.keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that this would be something a future instrumentation could refer to as a sort've "catchall" or "default" adapter symbol without needing to update the sql_obfuscation helper gem? I'm fine with that as an idea, but I would maybe want to call it all or default, and then perhaps make self.obfuscate_sql use that as it's default value unless specified. wdyt? Also fine with just dropping this if that's too convoluted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. I love the name change suggestion (fallback is the NR name) and using this as the default value.

Do you think I should add a default value for the config hash? Then someone could just pass SQL to the API, though their mileage may vary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default adapter issue fixed in: 8096551

Lmk about the config hash. I'm leaning toward no default there so an error is raised if the configs are missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should look to avoid raising an error outside of initialization, but i think it would be good to raise an error at init time if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of raising an error, I decided to do a bit of refactoring so that the whole config hash wasn't required: fd2fa4b

This drops the need for :db_statement and provides a default value for :obfuscation_limit to make sure the .obfuscate_sql method can run with only a SQL statement.

Copy link
Contributor

@ericmustin ericmustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here @kaylareopelle . I had a few nits but largely speaking this looks really close.

The outstanding issues, unless i'm missing something:

  • bundler / version constraint TODOs, I don't totally grok the error you're seeing but afaik we shouldn't need to drop the opentelemetry-instrumentation-base version constraints on mysql2/trilogy/pg. can try to dig in more here if it ends up being the only thing blocking.
  • had a few suggestions to make sure we dont have a performance regression on the obfuscation hot path. It might be good to get another set of eyes here too.

@ericmustin ericmustin self-requested a review July 7, 2023 15:13
Copy link
Contributor

@ericmustin ericmustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think everything has been updated here and my previous comments are addressed, with the exception of where to put these gems and the bundler errors. will add a point to the agenda to ensure both these are surfaced and discussed at tuesday SIG, thank u for quick turnaround @kaylareopelle apologies again on delay

assert_equal(db_operation_and_name, operation)
end
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When both database_name and operation are nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Added here: 3f0bcb4

'/*application:web,controller:blob,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,' \
'endpoint_id:Projects::BlobController#show*/ SELECT "routes".* FROM "routes" WHERE "routes"' \
'."source_id" = 75 AND "routes"."source_type" = \'Namespace\' LIMIT 1'
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this a bit. While the example above works as expected, the result is incorrect if one of the query names is in the Marginalia comment. E.g.:

irb(main):001:1*       QUERY_NAMES = [
irb(main):002:1*         'set names',
irb(main):003:1*         'select',
irb(main):004:1*         'insert',
irb(main):005:1*         'update',
irb(main):006:1*         'delete',
irb(main):007:1*         'begin',
irb(main):008:1*         'commit',
irb(main):009:1*         'rollback',
irb(main):010:1*         'savepoint',
irb(main):011:1*         'release savepoint',
irb(main):012:1*         'explain',
irb(main):013:1*         'drop database',
irb(main):014:1*         'drop table',
irb(main):015:1*         'create database',
irb(main):016:1*         'create table'
irb(main):017:0>       ].freeze
=> 
["set names",
...
irb(main):018:0>       QUERY_NAME_REGEX = Regexp.new("\\b(#{QUERY_NAMES.join('|')})\\b", Regexp::IGNORECASE)
=> /\b(set names|select|insert|update|delete|begin|commit|rollback|savepoint|release savepoint|explain|drop database|drop table|create da...
irb(main):019:0* sql = '/*application:web,controller:blob,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,' \
irb(main):020:0* 'endpoint_id:Projects::BlobController#show*/ SELECT "routes".* FROM "routes" WHERE "routes"' \
irb(main):021:0>           '."source_id" = 75 AND "routes"."source_type" = \'Namespace\' LIMIT 1'
=> "/*application:web,controller:blob,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,endpoint_id:Projects::BlobController#show*/ S...
irb(main):022:0> QUERY_NAME_REGEX.match(sql) { |match| match[1].downcase } 
=> "select"
irb(main):023:0* sql = '/*application:web,controller:insert,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,' \
irb(main):024:0* 'endpoint_id:Projects::BlobController#show*/ SELECT "routes".* FROM "routes" WHERE "routes"' \
irb(main):025:0>           '."source_id" = 75 AND "routes"."source_type" = \'Namespace\' LIMIT 1'
=> "/*application:web,controller:insert,action:show,correlation_id:01EZVMR923313VV44ZJDJ7PMEZ,endpoint_id:Projects::BlobController#show*/...
irb(main):026:0> QUERY_NAME_REGEX.match(sql) { |match| match[1].downcase } 
=> "insert"
irb(main):027:0> 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I've updated the tests to include this case. It's fixed in: 82584ba

'create table'
].freeze

QUERY_NAME_REGEX = Regexp.new("\\b(#{QUERY_NAMES.join('|')})\\b", Regexp::IGNORECASE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we need to match and ignore (multi-line) comments at the start of the query. I.e. these components:

        comments: /(?:#|--).*?(?=\r|\n|$)/i,
        multi_line_comments: %r{\/\*(?:[^\/]|\/[^*])*?(?:\*\/|\/\*.*)}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated in: 82584ba

Copy link
Contributor

@fbogsany fbogsany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Thanks for this @kaylareopelle. Sorry for the lengthy review process - the PR and scope was quite large!

@arielvalentin
Copy link
Collaborator

@kaylareopelle I am going to do some exploratory testing before merging!

Thanks everyone for the thorough reviews.

@kaylareopelle kaylareopelle requested a review from a team as a code owner January 5, 2024 05:44
@kaylareopelle kaylareopelle changed the title refactor: Move shared sql behavior to helper gems refactor!: Move shared sql behavior to helper gems Feb 6, 2024
Mysql2, trilogy, and pg instrumentation contained duplicated
constants and methods. This refactor creates two new
gems, opentelemetry-helpers-mysql and
opentelemetry-helpers-sql-obfuscation, to hold the shared
code.

It also:
* Improves SQL statement query finding for users of
  Marginalia/Active Record Query Logs

* UTF-8 encodes MySQL statements before extraction
* Adds fixtures from New Relic’s SQL obfuscation tests
* Adjusts regex queries to support multiple lines
* Adds obfuscation support for SQLite, Apache Cassandra, and Oracle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Instrumentation Improvements
7 participants