Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python Otel] Manage call tracer life cycle use call arena. (v1.65.x backport) #37478

Merged

Conversation

XuanWang-Amos
Copy link
Contributor

Backport of #37460 to v1.65.x.

We're seeing segfault in Python CSM tests:

2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')

The issue

After investigation, we found that the call tracer was deleted before RecordEnd was called.

Why this fix

  • To fix this, we decide to use arena to manage the life cycle of CallTracer.
  • Since CallTracer was created in another shard object library (grpcio_observability) which don't have a dependency on grpc core, we can't use grpc_core::Arena directly when creating the call tracer.
  • As a workaround, we created a wrapper class ClientCallTracerWrapper to wrap the CallTracer and created another core API grpc_call_tracer_set_and_manage so that we can manage the life cycle of CallTracer use the wrapper class.

We're seeing segfault in Python CSM tests:
```
2024-08-03T09:49:45.720555997Z *** SIGSEGV received at time=1722678585 on cpu 0 ***
2024-08-03T09:49:45.721761998Z PC: @     0x7847ffd5c1c9  (unknown)  (unknown)
2024-08-03T09:49:45.722070502Z     @     0x7847fa309d8c         64  absl::lts_20240116::WriteFailureInfo()
2024-08-03T09:49:45.722175904Z     @     0x7847fa309a15        272  absl::lts_20240116::AbslFailureSignalHandler()
2024-08-03T09:49:45.722187675Z     @     0x7847ffc3d050       1592  (unknown)
2024-08-03T09:49:45.723432238Z     @     0x7847e97f9390  (unknown)  (unknown)
2024-08-03T09:49:45.723487349Z     @ ... and at least 1 more frames
2024-08-03T09:49:45.829702781Z [INFO  tini (1)] Spawned child process '/xds_interop_client' with pid '7'
2024-08-03T09:49:45.829766869Z [DEBUG tini (1)] Received SIGCHLD
2024-08-03T09:49:45.829778749Z [DEBUG tini (1)] Reaped child with pid: '7'
2024-08-03T09:49:45.829787070Z [INFO  tini (1)] Main child exited with signal (with signal 'Segmentation fault')
```

After investigation, we found that the call tracer was deleted before `RecordEnd` was called.

* To fix this, we decide to use arena to manage the life cycle of CallTracer.
* Since CallTracer was created in another shard object library (`grpcio_observability`) which don't have a dependency on grpc core, we can't use `grpc_core::Arena` directly when creating the call tracer.
* As a workaround, we created a wrapper class `ClientCallTracerWrapper` to wrap the CallTracer and created another core API `grpc_call_tracer_set_and_manage` so that we can manage the life cycle of CallTracer use the wrapper class.

<!--

If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the appropriate
lang label.

-->

Closes grpc#37460

COPYBARA_INTEGRATE_REVIEW=grpc#37460 from XuanWang-Amos:fix_otel_segfault 33c0b98
PiperOrigin-RevId: 662966853
@XuanWang-Amos XuanWang-Amos added release notes: no Indicates if PR should not be in release notes release notes: yes Indicates if PR needs to be in release notes and removed release notes: no Indicates if PR should not be in release notes labels Aug 14, 2024
@XuanWang-Amos XuanWang-Amos merged commit dcbbf06 into grpc:v1.65.x Aug 14, 2024
58 of 63 checks passed
@XuanWang-Amos XuanWang-Amos deleted the backport-1.65-fix_otel_segfault branch August 14, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants