0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”
This is a thread about *observability value:* both the benefits and the costs.
                    
                                    
                    This is a thread about *observability value:* both the benefits and the costs.
                        
                        
                        1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:
- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
                    
                                    
                    - Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
                        
                        
                        2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*
This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
                    
                                    
                    This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
                        
                        
                        3/ Let’s dig into this a bit more. Really, there are two flavors of telemetry – statistics (i.e., “metrics”) and events (i.e., “traces and logs”), and they should be considered separately.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        4/ For metrics telemetry, the cost driver is cardinality, especially around “custom metrics.” Per-metric cardinality is combinatorial and grows well into the millions, and customers pay accordingly.
It really doesn’t need to be that way!
                    
                                    
                    It really doesn’t need to be that way!
                        
                        
                        5/ Most of that “cardinality budget” is spent on long-tail metrics that *literally never appear in query results*. Customers should have a simple slider to trade off cardinality, spend, and query result quality – schematically, like this:
                        
                        
                        
                        
                                                
                        
                                                
                    
                    
                                    
                    
                        
                        
                        6/ Want perfect fidelity for a business-critical metric? Drag the slider all the way to the right. Want to degrade gracefully for a customer_id tag with millions of values? Drag the slider to the left, pay 99% less, and still have metric data for your largest customers.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        7/ Event data is interesting, too.
(An aside: 99% of “logging telemetry” is really just “tracing telemetry.” Any log *about a transaction* (i.e., “almost all of them”) should be attached to the trace context so it can benefit from trace analysis.)
                    
                                    
                    (An aside: 99% of “logging telemetry” is really just “tracing telemetry.” Any log *about a transaction* (i.e., “almost all of them”) should be attached to the trace context so it can benefit from trace analysis.)
                        
                        
                        8/ In any case, there is a fundamental tradeoff between the *number of transactions recorded* (txns/sec – i.e., those that survive sampling at a given analytical stage) and the *level of detail* (bytes/txn) in those transactions.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        9/ When we multiply “txns/sec” by “bytes/txn,” we end up with a “bytes/sec.” In order to visualize the trade-offs here, we chart various telemetry throughput targets as (hyperbolic) lines against the following axes:
                        
                        
                        
                        
                                                
                        
                                                
                    
                    
                                    
                    
                        
                        
                        10/ For tracing (or logging) data, we must trade off between *sampling* (either in the clients, in collection infra, or before hitting long-term durable storage) and *detail*. There is no “right answer” here, and it’s just something that should be considered carefully.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        11/ But no matter where we draw that line, the reality is that there is *A LOT* of tracing/logging/event data; especially when microservices are involved, as the data volume is a function of transaction rate *multiplied by* microservice count!
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        12/ The elephant in the room around tracing data is where we *collect, store, and analyze* the data. The moment you send things over the WAN, you are paying a 100x cost penalty; but if you just keep things in a “dumb” collector, you can’t query over it dynamically.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        13/ Architecturally, this means that computation *must* be distributed – close to the data – and that data *must* stay close to the services themselves; or the network cost is simply too disruptive to overall observability ROI.
                        
                        
                        
                        
                                                
                    
                    
                                    
                    
                        
                        
                        14/ So, in closing…
Observability can be _incredibly_ valuable!
*Do* build an ROI case around its primary benefactors: your services, your teams, and your brand.
But `Telemetry != Observability`: *don’t* pay vendors extra for a telemetry firehose you can& #39;t even control.
                    
                                    
                    Observability can be _incredibly_ valuable!
*Do* build an ROI case around its primary benefactors: your services, your teams, and your brand.
But `Telemetry != Observability`: *don’t* pay vendors extra for a telemetry firehose you can& #39;t even control.
                        
                        
                        PS: Given all of the above,  @lightstephq& #39;s approach to metrics cleanly separates telemetry from observability, and will provide explicit control over cardinality costs.
You can sign up for early access here: https://lightstep.com/metrics/ ">https://lightstep.com/metrics/&...
                    
                
                You can sign up for early access here: https://lightstep.com/metrics/ ">https://lightstep.com/metrics/&...
 
                         Read on Twitter
Read on Twitter 
                             
                             
                             
                                     
                                    