[RFC] Renovate Streaming Support #5910

yurishkuro · 2024-08-31T16:21:18Z

Background

One of the challenges of distributed tracing is that spans can arrive from all kinds of places in the architecture at different times. If your only job is to store them (which is what Jaeger collector does primarily) then it's not a big problem, since the storage backends take care of partitioning and indexing the spans by trace-id. But the most interesting applications of traces require looking at a whole trace in one place to make decisions based on the overall call graph, not on individual spans.

Data Streaming is great at doing that. Historically Jaeger supported a couple of Java-based data pipelines (for basic dependency graph and for transitive dependency graph), which were implemented independently on top of Spark and Flink frameworks. There were problems with that approach:

The business logic had to be written in Java, meaning we could not reuse all the domain model capabilities we had in the primary Go code
We had to duplicate some of the logic, e.g. the all-in-one supported constructing a dependency graph on the fly, which was implemented completely independently from the Java Spark job.
The https://github.com/jaegertracing/spark-dependencies and https://github.com/jaegertracing/jaeger-analytics-flink repos had seen very little changes, the latter doesn't even have a production-grade way of running it

Proposal

We should bring streaming capabilities into the main Jaeger repo using Go code. This will address many of the problems mentioned above. The main challenge with data streaming is that it is a stateful activity, which requires checkpointing capabilities to avoid data loss and inconsistent results when Jaeger instances are restarted. This is where the well known streaming frameworks like Spark and Flink come in - they provide the needed orchestration and statefulness. In the past we could not use them with Go, but today there are projects like Apache Beam that provide a unified programming model via well supported SDK (including Go) that allows implementing the pipeline logic in Go and executing it on a number of runtimes

The text was updated successfully, but these errors were encountered:

dosubot bot added the changelog:new-feature Change that should be called out as new feature in CHANGELOG label Aug 31, 2024

yurishkuro mentioned this issue Aug 31, 2024

Implement in-memory Service Dependency Graph using Apache Beam #5911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Renovate Streaming Support #5910

[RFC] Renovate Streaming Support #5910

yurishkuro commented Aug 31, 2024

[RFC] Renovate Streaming Support #5910

[RFC] Renovate Streaming Support #5910

Comments

yurishkuro commented Aug 31, 2024

Background

Proposal