Skip to content

bpf: add gRPC/HTTP2 context propagation via sk_msg HPACK injection#1832

Open
mmat11 wants to merge 3 commits intoopen-telemetry:mainfrom
coralogix:matt/cp-grpc
Open

bpf: add gRPC/HTTP2 context propagation via sk_msg HPACK injection#1832
mmat11 wants to merge 3 commits intoopen-telemetry:mainfrom
coralogix:matt/cp-grpc

Conversation

@mmat11
Copy link
Copy Markdown
Contributor

@mmat11 mmat11 commented Apr 14, 2026

Summary

This PR implements http2/grpc context propagation
resolves #1095

Validation

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds end-to-end gRPC/HTTP2 context propagation support by injecting/parsing traceparent in HTTP/2 (HPACK) frames and introducing an integration test relay chain to validate cross-language propagation and multiplexed stream isolation.

Changes:

  • Implement HTTP/2 (gRPC) traceparent injection in sk_msg plus HPACK parsing/adoption in kprobe HTTP/2/gRPC paths keyed by {ports, stream_id}.
  • Add a multi-hop (Go→Python→Go→Node.js→Java→Go) docker-compose-based integration test suite covering chain propagation and multiplexed concurrency.
  • Document the gRPC/HTTP2 propagation architecture and mark gRPC context propagation as supported in feature docs.

Reviewed changes

Copilot reviewed 29 out of 31 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/test/integration/grpc_relay_test.go New integration tests validating relay-chain propagation and multiplexed stream isolation via Jaeger queries
internal/test/integration/docker-compose-grpc-relay.yml Adds a 7-hop cross-language relay chain plus OBI + Jaeger wiring for integration testing
internal/test/integration/configs/obi-config-grpc-relay.yml Adds OBI discovery + OTLP exporter config for the relay test environment
internal/test/integration/components/jaeger/jaeger.go Adds helper to filter spans by operation, service, and span.kind
internal/test/integration/components/grpc_relay/relay.proto Defines the relay gRPC service used by the multi-language test components
internal/test/integration/components/grpc_relay/python/server.py Python relay hop with health endpoint and downstream gRPC call
internal/test/integration/components/grpc_relay/python/requirements.txt Pinned Python dependencies for Python relay container build
internal/test/integration/components/grpc_relay/python/Dockerfile Builds the Python relay container and generates gRPC stubs
internal/test/integration/components/grpc_relay/nodejs/server.js Node.js relay hop with persistent client connection to exercise HTTP/2 multiplexing
internal/test/integration/components/grpc_relay/nodejs/package.json Node.js relay dependencies for gRPC/proto loading
internal/test/integration/components/grpc_relay/nodejs/Dockerfile Builds the Node.js relay container
internal/test/integration/components/grpc_relay/java/src/main/proto/relay.proto Java proto definition for the relay service
internal/test/integration/components/grpc_relay/java/src/main/java/relay/RelayServer.java Java relay hop with shared Netty event loop and health endpoint
internal/test/integration/components/grpc_relay/java/pom.xml Maven build for Java relay with protobuf/grpc plugins
internal/test/integration/components/grpc_relay/java/Dockerfile Multi-stage build for Java relay container
internal/test/integration/components/grpc_relay/go/main.go Go relay hop(s) and multiplex endpoint to generate concurrent streams on one connection
internal/test/integration/components/grpc_relay/go/go.sum Go dependency locks for the Go relay component
internal/test/integration/components/grpc_relay/go/go.mod Go module definition for the relay component
internal/test/integration/components/grpc_relay/go/Dockerfile Builds the Go relay container
devdocs/grpc-context-propagation.md New design doc for HPACK injection + TCP options propagation
devdocs/features.md Updates feature matrix to indicate gRPC context propagation support
devdocs/README.md Adds link to the new gRPC context propagation doc
bpf/tpinjector/tpinjector.c Implements sk_msg HTTP/2 detection, HPACK injection, and trace adoption logic
bpf/tpinjector/maps/sk_h2_conn_flag.h Adds SK_STORAGE marker to tag sockets as HTTP/2
bpf/gotracer/maps/grpc.h Adds conn_ptr → connection_info map to bind stream_id to correct TCP ports
bpf/gotracer/go_grpc.c Writes per-stream outgoing trace context from Go uprobe to outgoing_trace_map
bpf/generictracer/protocol_http2.h Adds HPACK traceparent parsing and adopts injected per-stream context
bpf/common/trace_lifecycle.h Ensures egress_key_t.stream_id is initialized in non-H2 paths
bpf/common/h2_defs.h Centralizes HTTP/2/HPACK constants and traceparent layout offsets
bpf/common/egress_key.h Extends egress_key to include HTTP/2 stream_id for multiplex isolation
Files not reviewed (1)
  • internal/test/integration/components/grpc_relay/nodejs/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)

internal/test/integration/grpc_relay_test.go:1

  • If compose.Up() or any later require.* in this test fails, compose.Close() may not run, potentially leaking containers/resources in CI. Register cleanup immediately after creating/starting the suite (e.g., via t.Cleanup(func(){ _ = compose.Close() }), and optionally also call it after Up() succeeds) so teardown happens even on early failures.

Comment thread internal/test/integration/components/grpc_relay/go/go.mod
Comment thread bpf/tpinjector/tpinjector.c
Comment thread bpf/tpinjector/tpinjector.c
Comment thread bpf/tpinjector/tpinjector.c
Comment thread internal/test/integration/docker-compose-grpc-relay.yml Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.39%. Comparing base (3583541) to head (13754d3).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1832      +/-   ##
==========================================
+ Coverage   69.36%   69.39%   +0.03%     
==========================================
  Files         276      276              
  Lines       32692    32692              
==========================================
+ Hits        22677    22688      +11     
+ Misses       8807     8804       -3     
+ Partials     1208     1200       -8     
Flag Coverage Δ
integration-test 56.81% <ø> (+0.71%) ⬆️
integration-test-arm 28.98% <ø> (-0.59%) ⬇️
integration-test-vm-x86_64-5.15.152 30.33% <ø> (-0.08%) ⬇️
integration-test-vm-x86_64-6.10.6 29.65% <ø> (-0.15%) ⬇️
k8s-integration-test 43.47% <ø> (-0.18%) ⬇️
oats-test 38.40% <ø> (+0.24%) ⬆️
unittests 58.37% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 1)

Job Conclusion Duration Verdict
shard-6 (11 tests) failure 40m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 1)

Job Conclusion Duration Verdict
shard-6 (11 tests) failure 38m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 2)

Job Conclusion Duration Verdict
shard-6 (11 tests) failure 40m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request checks (attempt 1)

Job Conclusion Duration Verdict
Generate and checks failure 1m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request checks (attempt 2)

Job Conclusion Duration Verdict
Generate and checks failure 0m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests ARM (attempt 1)

Job Conclusion Duration Verdict
shard-1 (1 tests) failure 1m flaky
shard-0 (1 tests) failure 1m flaky
shard-3 (1 tests) failure 2m flaky
shard-2 (1 tests) failure 1m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Test Docker build (attempt 1)

Job Conclusion Duration Verdict
build (Dockerfile, linux/amd64, ubuntu-latest) failure 2m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests on VM (attempt 1)

Job Conclusion Duration Verdict
kernel 5.15.152 x86_64 shard-0 (1 tests) failure 2m flaky
kernel 5.15.152 x86_64 shard-2 (1 tests) failure 2m flaky
kernel 5.15.152 x86_64 shard-1 (1 tests) failure 2m flaky
kernel 6.10.6 x86_64 shard-0 (1 tests) failure 2m flaky
kernel 5.15.152 x86_64 shard-3 (1 tests) failure 1m flaky
kernel 6.10.6 x86_64 shard-1 (1 tests) failure 1m flaky
kernel 6.10.6 x86_64 shard-2 (1 tests) failure 2m flaky
kernel 6.10.6 x86_64 shard-3 (1 tests) failure 2m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request K8s integration tests (attempt 1)

Job Conclusion Duration Verdict
disable_informers failure 1m flaky
netolly failure 1m flaky
daemonset_python failure 2m flaky
netolly_multizone failure 1m flaky
netolly_promexport failure 2m flaky
netolly_tc_promexport failure 1m flaky
otel failure 2m flaky
informer_cache failure 2m flaky
daemonset failure 1m flaky
daemonset_multi_node failure 1m flaky
netolly_dropexternal failure 1m flaky
netolly_multizone_prom failure 1m flaky
owners failure 1m flaky
prom failure 1m flaky
restrict_local_node failure 2m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 1)

Job Conclusion Duration Verdict
shard-2 (11 tests) failure 1m flaky
shard-1 (10 tests) failure 2m flaky
shard-0 (9 tests) failure 1m flaky
shard-3 (11 tests) failure 2m flaky
shard-5 (11 tests) failure 2m flaky
shard-7 (11 tests) failure 2m flaky
shard-6 (11 tests) failure 1m flaky
shard-8 (11 tests) failure 1m flaky
shard-9 (11 tests) failure 2m flaky
shard-4 (11 tests) failure 1m flaky
Action: Re-running failed jobs (attempt 2 of 2)

Copy link
Copy Markdown
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! LGTM! I think we don't need to add the suggestion I made in this PR, we can follow-up with the additional check. I'm sure the CI will give you hell with the verifier :).


bpf_tail_call_static(msg, &extender_jump_table, k_tail_find_existing_tp);
// HTTP/2 detection: known H2 socket or "PRI " preface.
if ((inject_flags & k_inject_http_headers) && msg->size >= k_h2_frame_header_len) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we need to check if it's PRI or it's recorded in ongoing_http2_connections. PRI is only seen if it's the first communication between sides, but grpc clients tend to hold on to the same connection. When this happens we'll misclassify the request as TCP, we'll ship it to user-space and OBI user-space will detect it's HTTP2. Then from userspace we record the connection as HTTP2 into ongoing_http2_connections. This is how protocol_handler.h checks this. We must also check if it was HTTP2 SSL and not do anything, SSL and HTTP2 can easily be mixed up.

http2_conn_info_data_t *h2g = bpf_map_lookup_elem(&ongoing_http2_connections, &args->pid_conn);
if (h2g && (http2_flag_ssl(h2g->flags) == args->ssl)) {

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: PR OATS test (attempt 1)

Job Conclusion Duration Verdict
ai failure 1m flaky
http failure 1m flaky
mongo failure 1m flaky
kafka failure 1m flaky
redis failure 3m flaky
sql failure 2m flaky
memcached failure 1m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests ARM (attempt 2)

Job Conclusion Duration Verdict
shard-3 (1 tests) failure 1m flaky
shard-0 (1 tests) failure 2m flaky
shard-2 (1 tests) failure 1m flaky
shard-1 (1 tests) failure 1m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Test Docker build (attempt 2)

Job Conclusion Duration Verdict
build (Dockerfile, linux/amd64, ubuntu-latest) failure 2m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 2)

Job Conclusion Duration Verdict
shard-3 (11 tests) failure 1m flaky
shard-1 (10 tests) failure 1m flaky
shard-2 (11 tests) failure 2m flaky
shard-8 (11 tests) failure 1m flaky
shard-4 (11 tests) failure 2m flaky
shard-7 (11 tests) failure 2m flaky
shard-0 (9 tests) failure 1m flaky
shard-6 (11 tests) failure 1m flaky
shard-9 (11 tests) failure 2m flaky
shard-5 (11 tests) failure 1m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests on VM (attempt 2)

Job Conclusion Duration Verdict
kernel 5.15.152 x86_64 shard-1 (1 tests) failure 1m flaky
kernel 5.15.152 x86_64 shard-0 (1 tests) failure 1m flaky
kernel 6.10.6 x86_64 shard-2 (1 tests) failure 2m flaky
kernel 6.10.6 x86_64 shard-1 (1 tests) failure 3m flaky
kernel 6.10.6 x86_64 shard-3 (1 tests) failure 2m flaky
kernel 5.15.152 x86_64 shard-3 (1 tests) failure 1m flaky
kernel 5.15.152 x86_64 shard-2 (1 tests) failure 1m flaky
kernel 6.10.6 x86_64 shard-0 (1 tests) failure 2m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request K8s integration tests (attempt 2)

Job Conclusion Duration Verdict
disable_informers failure 2m flaky
daemonset_python failure 1m flaky
daemonset_multi_node failure 1m flaky
daemonset failure 1m flaky
netolly_dropexternal failure 2m flaky
netolly_promexport failure 1m flaky
netolly_multizone failure 1m flaky
netolly_multizone_prom failure 1m flaky
netolly failure 3m flaky
informer_cache failure 1m flaky
prom failure 1m flaky
owners failure 2m flaky
restrict_local_node failure 1m flaky
netolly_tc_promexport failure 2m flaky
otel failure 2m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: PR OATS test (attempt 2)

Job Conclusion Duration Verdict
http failure 2m flaky
memcached failure 1m flaky
ai failure 1m flaky
sql failure 2m flaky
kafka failure 2m flaky
mongo failure 1m flaky
redis failure 1m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: PR OATS test (attempt 1)

Job Conclusion Duration Verdict
kafka failure 18m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 1)

Job Conclusion Duration Verdict
shard-7 (11 tests) failure 39m flaky
Action: Re-running failed jobs (attempt 2 of 2)

@github-actions
Copy link
Copy Markdown
Contributor

CI Supervisor: Pull request integration tests (attempt 2)

Job Conclusion Duration Verdict
shard-7 (11 tests) failure 40m flaky
Action: NOT re-running. Reason: Maximum re-run attempts reached (attempt 2 of 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gRPC context Propagation

3 participants