Earlier this week, we released Conduit v0.4.0. This release has some significant improvements to the Prometheus-based telemetry system, and introduces some incredibly cool new tools for debugging microservices.
Within 60 seconds from installation, Conduit now gives you preconfigured Grafana dashboards for every Kubernetes Deployment. These dashboards not only cover top-line metrics such as success rate, request volume, and latency distributions per service, they break these metrics down per dependency. This means you can easily answer questions like “where is the traffic for this service coming from?” and “what’s the success rate of Foo when it’s calling Bar?”, without having to change anything in your application.
Dude, where’s my traffic?
Some of the most critical questions to answer in a microservices app are around the runtime dependencies between services. It’s one thing to have success rate per service; it’s quite another to understand where traffic to a service is coming from, and which dependencies of a service are exhibiting failures. Typically, these questions are also the hardest to answer as well. But we knew we had an opportunity to make it easy with Conduit.
To accomplish this, in 0.4.0, we moved Conduit’s telemetry system to a completely pull-based (rather than push-based) model for metrics collection. As part of this, we also modified Conduit’s Rust proxy to expose granular metrics describing the source, destination, and health of all requests. A pull-based approach reduces complexity in the proxy, and fits better into the model of the world that Prometheus expects.
As a result, Conduit’s telemetry system is now incredibly flexible. You can dive into request rate, success rate, and latency metrics between any two Kubernetes deployments, pods, or namespaces. You can write arbitrary Prometheus queries on the result. And, of course, we wire everything through to the CLI as well. “Top” for microservices, anyone?
How it works in practice
Let’s walk through a brief example. (For a full set of installation instructions, see the official Conduit Getting Started Guide.)
First, install the Conduit CLI:
curl https://run.conduit.io/install | sh
Next, install Conduit onto your Kubernetes cluster:
conduit install | kubectl apply -f -
Finally, install the “emojivoto” demo app, and add it to the Conduit mesh:
curl https://raw.githubusercontent.com/runconduit/conduit-examples/master/emojivoto/emojivoto.yml | conduit inject - | kubectl apply -f -
The demo app includes a
vote-bot service that is constantly running traffic through the system. This AI-based bot is voting on its favorite emojis and is designed to slowly become more intelligent, and more cunning, over time. For safety reasons, we recommend you don’t let it run for more than a few days. 😉
Let’s see how we can use Conduit to understand how traffic is flowing through the demo app.
Start by seeing how the web service (technically, web Deployment) is doing:
$ conduit stat -n emojivoto deployment web NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 web 1/1 90.00% 2.0rps 2ms 4ms 9ms
The success rate of requests is only 90%. There’s a problem here. But is it possible this is an upstream failure? We can find out looking at the success rate of the services our `web` deployment talks to.
$ conduit stat deploy --all-namespaces --from web --from-namespace emojivoto NAMESPACE NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 emojivoto emoji 1/1 100.00% 2.0rps 1ms 2ms 2ms emojivoto voting 1/1 72.88% 1.0rps 1ms 1ms 1ms
Here you see that `web` talks to both the `emoji` and the `voting` services. The success rate of the calls to `emoji` is 100%, but to `voting` its only 72.88%. Note that this command is displaying the success rate only from `web` to `emoji`, and only from `web` to `voting`. The aggregate success rate of the `emoji` and `voting` services might be different.
With just a bit of digging, we’ve determined that the culprit is probably the `voting` service. Who else talks to the `voting` service? To find out, we can run the following command:
$ conduit stat deploy --to voting --to-namespace emojivoto --all-namespaces NAMESPACE NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 emojivoto web 1/1 83.33% 1.0rps 1ms 2ms 2ms
The `voting` service is only called from the `web` service. So, by tracing dependencies from web, we now have a plausible target for our first investigation: the `voting` service is returning a 83% success rate when `web` is calling it. From here, we might look into the logs, traces, or other forms of deeper investigation into this service.
That’s just a sample of some of the things you can do with Conduit. If you want to dive deeper, try looking at the success rate across all namespaces; success rate for a single namespace, broken down by deployments across all namespaces that call into that namespace; or success rate of Conduit components themselves. The possibilities are endless (kinda)!
We’ve also recorded a brief demo so you can see this in action.
In terms of metrics and telemetry, we’ll be extending these semantics to other Kubernetes objects, such as Pods and ReplicaSets in upcoming releases. We’ll also be making `conduit tap` work on these same objects, since `tap` and `stat` work beautifully together. We might also just have another fun command or two waiting in the wings, ready to show off the power of Conduit’s new telemetry pipeline. Stay tuned!
Special thanks to Frederic Branczyk for invaluable Prometheus help.