Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding tracing_opentelemetry::layer to tracing_subscriber::registry breaks span exporting to Jaeger #159

Open
landonxjames opened this issue Jul 30, 2024 · 22 comments · May be fixed by #175

Comments

@landonxjames
Copy link

Bug Report

Version

$ cargo tree | grep tracing
├── opentelemetry-appender-tracing v0.5.0
│   ├── tracing v0.1.40
│   │   ├── tracing-attributes v0.1.27 (proc-macro)
│   │   └── tracing-core v0.1.32
│   ├── tracing-core v0.1.32 (*)
│   └── tracing-subscriber v0.3.18
│       ├── tracing v0.1.40 (*)
│       └── tracing-core v0.1.32 (*)
│   │       │   │   └── tracing v0.1.40 (*)
│   │       │   └── tracing v0.1.40 (*)
│   │       │   │   └── tracing v0.1.40 (*)
│   │       └── tracing v0.1.40 (*)
├── tracing v0.1.40 (*)
├── tracing-core v0.1.32 (*)
├── tracing-opentelemetry v0.25.0
│   ├── tracing v0.1.40 (*)
│   ├── tracing-core v0.1.32 (*)
│   ├── tracing-log v0.2.0
│   │   └── tracing-core v0.1.32 (*)
│   └── tracing-subscriber v0.3.18 (*)
└── tracing-subscriber v0.3.18 (*)

Platform

$ uname -a
Darwin bcd0746352ef 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64

Description

When I add a tracing_opentelemetry::layer() to a tracing_subscriber::registry() my existing exports to Jaeger breaks.

To replicate first run the following to start an instance of Jaeger:

docker run -d --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:latest

Then run the following code with the below Cargo.toml:

[package]
name = "otel-test"
version = "0.1.0"
edition = "2021"

[dependencies]
once_cell = "1.13"
opentelemetry = { version = "0.24.0", features = ["metrics", "logs"] }
opentelemetry_sdk = { version = "0.24.1", features = ["rt-tokio", "logs"] }
opentelemetry-otlp = { version = "0.17.0", features = ["tonic", "metrics", "logs"] }
opentelemetry-semantic-conventions = "0.16.0"
tokio = { version = "1.0", features = ["full"] }
opentelemetry-appender-tracing = { version = "0.5.0", default-features = false}
tracing = { version = "0.1.40", default-features = false, features = ["std"]}
tracing-core = { version = "0.1.32", default-features = false}
tracing-subscriber = { version= "0.3.18", default-features = false, features = ["env-filter","registry", "std"] }
tracing-opentelemetry = "0.25.0"
use opentelemetry::global::shutdown_tracer_provider;
use opentelemetry::{
    global,
    trace::{TraceContextExt, TraceError, Tracer, TracerProvider},
    KeyValue,
};
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{runtime, trace as sdktrace, Resource};
use opentelemetry_semantic_conventions::resource::SERVICE_NAME;
use tracing_subscriber::prelude::*;

use std::error::Error;

fn init_tracer_provider() -> Result<opentelemetry_sdk::trace::TracerProvider, TraceError> {
    opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://localhost:4317"),
        )
        .with_trace_config(
            sdktrace::Config::default()
                .with_resource(Resource::new(vec![KeyValue::new(SERVICE_NAME, "BLAHBLAH")])),
        )
        .install_batch(runtime::Tokio)
}

#[tracing::instrument()]
fn do_the_thing(num: u8) {
    tracing::info!("I AM DOING THE THING {num}");
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
    let tracer_provider = init_tracer_provider().expect("Failed to initialize tracer provider.");
    let foo_tracer = tracer_provider.tracer("FOO_TEST");
    global::set_tracer_provider(tracer_provider.clone());

    let tracer = global::tracer("tracing-jaeger");

    // Uncommenting the lines below will break trace exporting to Jaeger
    // let telemetry_layer = tracing_opentelemetry::layer().with_tracer(foo_tracer);
    // tracing_subscriber::registry().with(telemetry_layer).init();

    tracer.in_span("main-operation", |cx| {
        let span = cx.span();
        span.set_attribute(KeyValue::new("my-attribute", "my-value"));
        span.add_event(
            "Main span event".to_string(),
            vec![KeyValue::new("foo", "1")],
        );
        tracer.in_span("child-operation...", |cx| {
            let span = cx.span();
            span.add_event("Sub span event", vec![KeyValue::new("bar", "1")]);
        });
    });

    do_the_thing(1);

    shutdown_tracer_provider();
    Ok(())
}

Running the above as is will export the traces created by the OTel crates to Jaeger. Uncommenting the below lines will cause that exporting to fail:

    // let telemetry_layer = tracing_opentelemetry::layer().with_tracer(foo_tracer);
    // tracing_subscriber::registry().with(telemetry_layer).init();

Instead I would have expected these lines to export the spans created by the tracing crate to Jaeger alongside the spans created by the Otel crates. Initially reported this to the OTtel crates, but seems like it is likely coming from tracing_opentelemetry.

@mladedav
Copy link
Contributor

Try commenting out the global::set_tracer_provider. tracing-opentelemetry also calls it might interfere. Although it should try registering the same tracer so maybe it won't change anything.

@SDAChess
Copy link

SDAChess commented Aug 7, 2024

having the same issue. Commenting out the global::set_tracer_provider did not solve it.

@graebm
Copy link

graebm commented Sep 20, 2024

I've lost days to this. I'm a newb to Rust and OpenTelemetry and assumed I it was my fault when I tweaked the opentelemetry-otlp/examples/basic-otlp example to use tracing_opentelemetry for spans, and the example stopped sending any data to jaeger-all-in-one or otel-desktop-viewer

@graebm
Copy link

graebm commented Sep 23, 2024

I did some experiments, it's just the latest version that's borked. If I use slightly older versions (and the corresponding versions of their dependencies) then the opentelemetry-otlp example works fine.

I experimented with the 3 most recent releases: v0.26.0, v0.25.0, v0.24.0
I cloned this repo, checked out the corresponding version, and (while running jaeger-all-in-one) ran: cargo run --example opentelemetry-otlp 10 times.

for v0.26.0: jaeger displayed no traces
for v0.25.0: jaeger displayed traces for 8/10 runs
for v0.24.0: jaeger displayed traces for 10/10 runs

so I'll be using v0.24.0 and it dependencies for now. For anyone curious, these are:

tracing-subscriber = "0.3.18"
tracing = "0.1.40"
opentelemetry = { version = "0.23", features = ["trace", "metrics"] }
opentelemetry_sdk = { version = "0.23", default-features = false, features = ["trace", "rt-tokio"] }
opentelemetry-stdout = { version = "0.4.0", features = ["trace", "metrics"] }
opentelemetry-otlp = { version = "0.16", features = ["metrics"] }
opentelemetry-semantic-conventions = "0.15.0"
tracing-core = "0.1.28"
tracing-opentelemetry = "0.24.0"

I did a wireshark capture of traffic to port 4317 and in the failed run there's only 1 HTTP/2 connection, from the MetricsService:
POST /opentelemetry.proto.collector.metrics.v1.MetricsService/Export

while in successful runs there are 2 HTTP/2 connections, one from MetricsService and one another from the TraceService:
POST /opentelemetry.proto.collector.trace.v1.TraceService/Export

Note sure if that helps. If I weren't such a Rust newb, I'd dig further into the actual code

@djc
Copy link
Collaborator

djc commented Sep 24, 2024

This is the tracing-opentelemetry commit that upgraded to opentelemetry 0.25:

18fffd6

None of the other changes in tracing-opentelemetry 0.26 seem to be potential regressions for this:

So I'll suggest that your issue isn't caused by any problem in tracing-opentelemetry, but is instead in your own code or, more likely, in the interaction between your code and the opentelemetry crates.

@graebm
Copy link

graebm commented Sep 24, 2024

In my last post, I was just checking out this repo and running its opentelemetry-otlp.rs example

In the latest 0.26.0 release (which mostly uses the 0.26 releases of opentelemetry-* crates), no tracing data is being exported. But in the 0.24.0 release (which mostly uses the 0.24 releases of opentelemetry-* crates) it all works fine.

If very little changed in this crate, my best guess is that the break is due to some change in the opentelemetry-* crates

@djc
Copy link
Collaborator

djc commented Sep 25, 2024

If very little changed in this crate, my best guess is that the break is due to some change in the opentelemetry-* crates

Yes, that is what I was suggesting. Going to close this issue based on that conclusion, feel free to reopen if you have reason to disagree.

@djc djc closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
@mladedav
Copy link
Contributor

I believe that if we think that this library is broken, but it's broken by one of the dependencies we should still track it. People can find this issue and still is an open problem.

It also seems that examples in this repository don't work, that sounds like an issue that should be solved as the examples is probably the most common thing people use when trying to set this up.

@djc djc reopened this Sep 25, 2024
@mzabaluev
Copy link
Contributor

mzabaluev commented Oct 9, 2024

I've found that setting the registry subscriber as the global default (the .init() call in the original example) could be the problem.
This example hangs on exiting the "test_span":

use std::error::Error;

use opentelemetry::trace::TracerProvider as _;
use opentelemetry::{global as otel_global, KeyValue};
use opentelemetry_otlp::WithExportConfig as _;
use opentelemetry_sdk::{runtime, trace::Config as TraceConfig, Resource};
use opentelemetry_semantic_conventions::resource::SERVICE_NAME;
use tracing::{info, info_span};
use tracing_subscriber::prelude::*;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
    let exporter = opentelemetry_otlp::new_exporter()
        .tonic()
        .with_endpoint("http://localhost:4317");
    let tracer_provider = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(exporter)
        .with_trace_config(
            TraceConfig::default()
                .with_resource(Resource::new([KeyValue::new(SERVICE_NAME, "playground")])),
        )
        .install_simple()
        /* .install_batch(runtime::Tokio) */
        ?;

    let tracer = tracer_provider.tracer("movement_tracing");
    let telemetry_layer = tracing_opentelemetry::layer().with_tracer(tracer);
    let subscriber = tracing_subscriber::registry().with(telemetry_layer);
    tracing::subscriber::set_global_default(subscriber)?;

    // tracing::subscriber::with_default(subscriber, ||
    {
        let _span = info_span!("test_span").entered();

        info!("test event");
    }
    // );

    otel_global::shutdown_tracer_provider();
    Ok(())
}

If the .install_simple() call is replaced with install_batch, the example does not hang, but does not export any traces just like in the original filing.

Commenting the line with tracing::subscriber::set_global_default and uncommenting the tracing::subscriber::with_default call makes the example work. This is not ideal as the latter will only work on the current thread.
Oddly, the top-level doc example only shows use of the with_default. I will try to pare down the opentelemetry-otlp example to see if it has the same problem.

@mzabaluev
Copy link
Contributor

mzabaluev commented Oct 9, 2024

With the "simple" span processor, the deadlock is traced to
blocking while a lock is held:

        let result = self
            .exporter
            .lock()
            .map_err(|_| TraceError::Other("SimpleSpanProcessor mutex poison".into()))
            .and_then(|mut exporter| futures_executor::block_on(exporter.export(vec![span])));

If the tonic exporter enters and leaves a tracing span anywhere in the implementation of the export async call (in this case, in hyper-util implementing a synchronous DNS resolver), the globally set tracing subscriber handles that span in a tokio worker thread, blocking on trying to lock the same mutex.

Regardless of the choice of the processor, the fundamental problem is that we have a recursion due to the globally set tracing subscriber. Perhaps tracing-opentelemetry should temporarily replace the global tracing subscriber while exporting to prevent the recursion.

@djc
Copy link
Collaborator

djc commented Oct 9, 2024

Perhaps tracing-opentelemetry should temporarily replace the global tracing subscriber while exporting to prevent the recursion.

On every export? It feels like the performance cost of that would be prohibitive.

@mzabaluev
Copy link
Contributor

On every export? It feels like the performance cost of that would be prohibitive.

Yes, also there will be issues with other threads closing spans concurrently.
But as it stands, setting a subscriber with an OpenTelemetryLayer as the global default is fragile. Maybe the examples showing this should always set up a filter to only export selected spans (I need to check if the deadlock can in fact be broken up this way). Also, what if the opentelemetry-rust implementation is changed so that the internally entered spans occur nested into the telemetry-exported span on the same thread?

@mzabaluev
Copy link
Contributor

Filtering works with the simple processor:

    let telemetry_layer = tracing_opentelemetry::layer()
        .with_tracer(tracer)
        .with_filter(filter::Targets::new().with_target("playground_telemetry", Level::INFO));
    let subscriber = tracing_subscriber::registry().with(telemetry_layer);
    tracing::subscriber::set_global_default(subscriber)?;

    {
        let _span = info_span!(target: "playground_telemetry", "test_span").entered();

        info!(target: "playground_telemetry", "test event");
    }

The batch processor, however, is broken by set_global_default.

@djc
Copy link
Collaborator

djc commented Oct 11, 2024

I think it would be good to open an issue against the opentelemetry-rust project describing this issue from first principles -- AIUI, it's still mostly an opentelemetry-rust issue rather than a tracing-opentelemetry issue?

@mzabaluev
Copy link
Contributor

AIUI, it's still mostly an opentelemetry-rust issue rather than a tracing-opentelemetry issue?

I think it's not, from opentelemetry-rust POV: any component down their exporter stack might be using tracing, they can't be expected to try to prevent their dependencies from doing so. But if the global tracing subscriber calls into the opentelemetry pipeline again, we always have potential of lock reentrancy, infinite recursion, or a trace queue growing until it exhausts a resource limit.

@mzabaluev
Copy link
Contributor

The batch processor, however, is broken by set_global_default.

The issue with the batch processor is different. It looks like it needs the subscriber to be dropped to finish exporting properly.

This example emits a span 1000 times to cause some export data to be buffered.

    let tracer = tracer_provider.tracer("tracing-otel-subscriber");
    let telemetry_layer = tracing_opentelemetry::layer()
        .with_tracer(tracer)
        .with_filter(filter::Targets::new().with_target("example_telemetry", Level::INFO));
    let fmt_layer = tracing_subscriber::fmt::layer()
        .with_thread_names(true)
        .without_time();
    let subscriber = tracing_subscriber::registry()
        .with(telemetry_layer)
        .with(fmt_layer);
    tracing::subscriber::set_global_default(subscriber)?;

    for _ in 0..1000 {
        let _span = info_span!(target: "example_telemetry", "test_span").entered();

        info!(target: "example_telemetry", "test event");
    }

This is the output I get from the fmt layer after the 1000 test events at the end of main:

DEBUG tokio-runtime-worker tower::buffer::worker: buffer closing; waking pending tasks
TRACE tokio-runtime-worker h2::proto::streams::streams: Streams::recv_eof
TRACE tokio-runtime-worker h2::proto::streams::state: recv_eof; state=Open { local: Streaming, remote: AwaitingHeaders }
TRACE tokio-runtime-worker h2::proto::streams::counts: transition_after; stream=StreamId(1); state=State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=0; num_send=1
TRACE tokio-runtime-worker h2::proto::streams::counts: dec_num_streams; stream=StreamId(1)
TRACE tokio-runtime-worker h2::proto::streams::counts: transition_after; stream=StreamId(1); state=State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=0; num_send=0
TRACE tokio-runtime-worker h2::proto::streams::streams: drop_stream_ref; stream=Stream { id: StreamId(1), state: State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }, is_counted: false, ref_count: 2, next_pending_send: None, is_pending_send: false, send_flow: FlowControl { window_size: Window(65535), available: Window(0) }, requested_send_capacity: 0, buffered_send_data: 0, send_task: None, pending_send: Deque { indices: None }, next_pending_send_capacity: None, is_pending_send_capacity: false, send_capacity_inc: true, next_open: None, is_pending_open: false, is_pending_push: false, next_pending_accept: None, is_pending_accept: false, recv_flow: FlowControl { window_size: Window(2097152), available: Window(2097152) }, in_flight_recv_data: 0, next_window_update: None, is_pending_window_update: false, reset_at: None, next_reset_expire: None, pending_recv: Deque { indices: None }, is_recv: true, recv_task: None, push_task: None, pending_push_promises: Queue { indices: None, _p: PhantomData<h2::proto::streams::stream::NextAccept> }, content_length: Omitted }
TRACE tokio-runtime-worker h2::proto::streams::counts: transition_after; stream=StreamId(1); state=State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=0; num_send=0
TRACE tokio-runtime-worker h2::proto::streams::streams: drop_stream_ref; stream=Stream { id: StreamId(1), state: State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }, is_counted: false, ref_count: 1, next_pending_send: None, is_pending_send: false, send_flow: FlowControl { window_size: Window(65535), available: Window(0) }, requested_send_capacity: 0, buffered_send_data: 0, send_task: None, pending_send: Deque { indices: None }, next_pending_send_capacity: None, is_pending_send_capacity: false, send_capacity_inc: true, next_open: None, is_pending_open: false, is_pending_push: false, next_pending_accept: None, is_pending_accept: false, recv_flow: FlowControl { window_size: Window(2097152), available: Window(2097152) }, in_flight_recv_data: 0, next_window_update: None, is_pending_window_update: false, reset_at: None, next_reset_expire: None, pending_recv: Deque { indices: None }, is_recv: true, recv_task: None, push_task: None, pending_push_promises: Queue { indices: None, _p: PhantomData<h2::proto::streams::stream::NextAccept> }, content_length: Omitted }
TRACE tokio-runtime-worker h2::proto::streams::counts: transition_after; stream=StreamId(1); state=State { inner: Closed(Error(Io(BrokenPipe, Some("stream closed because of a broken pipe")))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=0; num_send=0
DEBUG tokio-runtime-worker tonic::transport::channel::service::connection: connection task error: hyper::Error(Io, Custom { kind: BrokenPipe, error: "connection closed because of a broken pipe" })

@h-a-n-a
Copy link

h-a-n-a commented Oct 14, 2024

I'm new to open-telemetry and have been stuck on this issue for 2 days.

I was trying to run the demo in examples/opentelemetry-otlp.rs but jaeger didn't come up with anything.
At first, I thought this was related to issue open-telemetry/opentelemetry-rust#1625 and open-telemetry/opentelemetry-rust#1961. Realizing that, I tried to use global::set_tracer_provider and turned out to be nothing useful. Then I follow the examples/basic.rs to use subscriber::set_default, then it's working.

Then, I downgraded tracing-opentelemetry and a few opentelemetry-* libraries to versions provided above and it's working. I haven't dug into the corresponding crates. It would be great if someone could check if there's anything wrong between these versions.

Also tried downgrading opentelemetry-* libraries and keep tracing-opoentelemetry as-is, but type is not matched as the tracing lib apparently requires higher versions of opentelemetry-*. So I was not able to reckon which one is the root cause.

I will update the reproducible demo later.

Reproducible demo: https://github.com/h-a-n-a/issue_opentelemetry_jaeger

@mzabaluev
Copy link
Contributor

I have found a partial workaround in calling force_flush on the TraceProvider returned by install_batch. Dropping the provider instance before global shutdown does not help.

The question remains as to why the batch worker does not flush automatically.

@mzabaluev
Copy link
Contributor

mzabaluev commented Oct 15, 2024

The right A helpful (but not always, see below) thing to do is to call shutdown on the TracingProvider. Dropping does not help because the inner object is still alive, because the Arc in TracingProvider is cloned into the tracer that is owned by OpenTelemetryLayer.

The batch exporter does flush on its own upon reaching some conditions as long as the runtime is running, so there is no need to do anything periodically.

@h-a-n-a
Copy link

h-a-n-a commented Oct 16, 2024

The right thing to do is to call shutdown on the TracingProvider. Dropping does not help because the inner object is still alive, because the Arc in TracingProvider is cloned into the tracer that is owned by OpenTelemetryLayer.

The batch exporter does flush on its own upon reaching some conditions as long as the runtime is running, so there is no need to do anything periodically.

@mzabaluev
EDIT: fixed this issue by sending TracerProvider::shutdown operation to another blocking thread:

tokio::task::spawn_blocking(move || {
    let _ = provider.shutdown();
})
.await
.unwrap();

ORIGINAL:
I ran a demo and it was stuck. Without the TracerProvider::shutdown it shutdown successfully but didn't produce any jaeger tracing. Do you know if there's anything missing?

use opentelemetry::{
    global::{self, shutdown_tracer_provider},
    trace::TracerProvider,
    KeyValue,
};
use opentelemetry_sdk::{runtime, Resource};
use tracing::subscriber::{set_default, with_default};
use tracing_opentelemetry::OpenTelemetryLayer;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

#[tracing::instrument]
fn foo() {
    tracing::info!("info");
    tracing::error!("error");
    bar();
}

#[tracing::instrument]
fn bar() {
    tracing::info!("info");
    tracing::error!("error");
}

#[tokio::test]
async fn subscriber_init() {
    let provider = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(opentelemetry_otlp::new_exporter().tonic())
        .with_trace_config(opentelemetry_sdk::trace::Config::default().with_resource(
            Resource::new(vec![KeyValue::new("service.name", "issue-03")]),
        ))
        .install_batch(runtime::Tokio)
        .unwrap();
    global::set_tracer_provider(provider.clone());

    let trace = provider.tracer("issue-03");

    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer())
        .with(OpenTelemetryLayer::new(trace))
        .init();

    foo();

    let _ = provider.shutdown();

    shutdown_tracer_provider();
}

My dependencies:

[dependencies]
opentelemetry = "0.26.0"
opentelemetry-otlp = "0.26.0"
opentelemetry_sdk = { version = "0.26.0", features = ["rt-tokio"] }
tokio = { version = "1.32.0", features = ["macros", "rt-multi-thread"] }
tracing = "0.1"
tracing-opentelemetry = "0.27.0"
tracing-subscriber = { version = "0.3" }

@mzabaluev
Copy link
Contributor

mzabaluev commented Oct 16, 2024

The shutdown of the span processor calls futures_executor::block_on, it's probably not a good thing to do in async code running on the Tokio runtime.

@mzabaluev
Copy link
Contributor

The shutdown of the span processor calls futures_executor::block_on, it's probably not a good thing to do in async code running on the Tokio runtime.

open-telemetry/opentelemetry-rust#2047 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants