Stephen Fairchild

Stephen Fairchild https://fairchild.sh/ Technical writing about software engineering, systems, and tools. en-us Tracing in Distributed Systems https://fairchild.sh/tracing-in-distributed-systems.html https://fairchild.sh/tracing-in-distributed-systems.html Thu, 30 Apr 2026 00:00:00 +0000 ## Understanding Distributed Tracing When your application spans dozens of microservices debugging becomes a nightmare. A single user request might touch ten different services and you're stuck grepping through logs trying to piece together what happened. This is where distributed tracing saves you. ### What is Distributed Tracing? Distributed tracing tracks a request as it flows through your system. Each service adds metadata about its work creating a complete picture of the request's journey. You get visibility into: - Which services were called and in what order - How long each operation took - Where errors occurred - What happened in parallel vs sequentially ### The Building Blocks **Traces** represent a single request flowing through your system. A trace is made up of spans. **Spans** represent a unit of work. Each service creates spans for the operations it performs. Spans have: - A name (like "database query" or "http request") - Start and end timestamps - Tags (key-value metadata) - Logs (timestamped events) - Parent span ID (to build the hierarchy) **Trace Context** is how you connect spans across services. When Service A calls Service B it passes along trace metadata (trace ID, span ID) in HTTP headers or message metadata. This is called context propagation. ### Why You Need It Traditional logging falls apart in distributed systems. Logs tell you what happened in one service but connecting the dots across services is manual and painful. Distributed tracing gives you: **Root Cause Analysis** - See exactly which service is slow or failing. If checkout is slow you can see whether it's the payment service, inventory service, or database causing the bottleneck. **Performance Optimization** - Identify slow operations even if they don't fail. Maybe your services are healthy but you're making 20 sequential database calls when you could batch them. **Understanding System Behavior** - See how requests actually flow through your system. Your architecture diagrams lie but traces don't. ### The OpenTelemetry Standard OpenTelemetry (OTel) is the industry standard for instrumentation. It provides: - APIs for creating spans in your code - SDKs for popular languages - Automatic instrumentation for common frameworks - Exporters to send data to backends The beauty of OTel is vendor neutrality. Instrument your code once and swap backends without changing your application code. ### Implementation Strategy **Start with HTTP boundaries.** Instrument your HTTP servers and clients first. This gives you service-to-service visibility immediately. **Add database tracing next.** Database queries are often the bottleneck. Most ORMs have OTel integrations. **Instrument critical paths.** Don't trace everything. Focus on user-facing workflows and known problem areas. Too much data becomes noise. **Propagate context everywhere.** This is the hardest part. Every time you cross a service boundary (HTTP, message queue, background job) you need to propagate trace context. Miss one and your traces break. ### Common Pitfalls **Over-instrumentation.** Creating spans for every function call generates massive overhead. Span creation is cheap but not free. Be selective. **Sampling confusion.** At scale you can't keep every trace. Sample intelligently - keep all errors and slow requests, sample normal requests. Head-based sampling (decide at the start) is simple but can miss issues. Tail-based sampling (decide after the trace completes) catches more problems but is harder to implement. **Missing context propagation.** If you forget to propagate trace context in even one place your traces become disconnected fragments. This is especially tricky with async code, background jobs, and message queues. **High cardinality tags.** Don't use user IDs or unique identifiers as tag values. Backends index tags and high cardinality kills performance. Use user IDs in span names or logs instead. ### The Future Tracing is merging with metrics and logs. The "three pillars of observability" are converging. OTel now supports metrics and logs alongside traces. Soon you'll query your system asking "show me slow checkout traces" and drill into metrics and logs without leaving your tracing UI. Why Rust is better for the planet https://fairchild.sh/rust-is-a-positive-impact-on-the-planet.html https://fairchild.sh/rust-is-a-positive-impact-on-the-planet.html Wed, 29 Apr 2026 00:00:00 +0000 ## Why Rust is Good for the planet As software engineers and we don't often think about the environmental impact of our technology choices. But with data centers consuming roughly 1-2% of global electricity and rising the efficiency of our code matters more than ever. ### Energy Efficiency Through Performance Rust compiles to native machine code with zero-cost abstractions. This means Rust programs run fast and use minimal CPU cycles compared to interpreted or JIT-compiled languages. Studies have shown that compiled languages like Rust and C use significantly less energy than interpreted languages like Python or Ruby. For compute-intensive workloads and the difference can be 50x or more in energy consumption. When you're running services at scale this efficiency compounds. ### Memory Safety Reduces Waste Memory leaks and crashes waste energy. A service that crashes and restarts throws away all the work it was doing. Memory leaks cause services to bloat and consume unnecessary resources. ### Longer Server Lifecycles Because Rust code is so performant you can get more life out of your hardware. Extending hardware lifecycles reduces e-waste and the environmental cost of manufacturing new equipment. ### Less Infrastructure Means Less Energy When your code is efficient and you need fewer servers. Fewer servers means: - Less energy for compute - Less energy for cooling - Smaller physical footprint in data centers - Lower manufacturing impact Discord famously rewrote their read state service from Go to Rust and went from thousands of servers to hundreds and while improving latency and reducing memory usage by 10x. ### The Developer Experience Angle Rust's strict compiler catches bugs early. This means less time debugging in production and less wasted CI/CD runs and and fewer emergency deploys. Rust's compile-time guarantees mean fewer wasteful cycles. ### Small Binaries and reducing Carbon Footprint Rust binaries are small and self-contained with no runtime to ship Smaller artifacts mean: - Less bandwidth for deployments - Faster cold starts in serverless environments - Lower storage costs