Alert-Driven Monitoring

(simpleobservability.com)

50 points | by khazit 3 hours ago

9 comments

stingraycharles 2 hours ago
Good metrics and alerting systems are designed, from the top down. Not bottom up.
Lots of metrics are typically available, but almost all of them are noise.
Start with the business: what is important to the business ? What kind of failures are existential threats ?
Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.
I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.
This is rarely the case.
I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.
And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.
[-]
- dandellion 2 hours ago
  I agree that alerts should just be the vital ones. But in terms of monitoring and metrics, more is generally better. I joined a company where something broke and the only way to figure out what was wrong was to ssh and hop through several services and it was a massive waste of time for something that just having set up basic otel would be trivial to narrow down.
- b112 2 hours ago
  If you receive too many emails, alerts, warnings, and so on, you are only training yourself and the team to ignore them.
  As you say, few is better. And a well chosen few.
- alansaber 2 hours ago
  Very few alerts, implemented around core business logic, incorporating as many edge cases as possible. This is the way.
prpl 44 minutes ago
> The real core of infrastructure monitoring isn’t dashboards. It’s the alerts.
“it’s not X it’s Y”
at this point when I see this pattern in writing I assume most if not all of it is AI generated - same with em-dashes.
This is not to discount the idea that alerts are more important than dashboards (I work directly in observability) - but just to say that I personally shut off reading anything else with these patterns because, generally speaking, the rest of the content is just not original or interesting.
kylemaxwell 5 minutes ago
I like the ideas, but either it’s entirely LLM written or the writer has internalized “LLM voice”. At this point that is more distracting than helpful.
ecoffey 22 minutes ago
I certainly agree in spirit that the alerts are important, and should be actionable. But I wouldn't start at just "looking at the service" and then trying to define the first set of alerts.
Instead I would move up a level and start with a SLO for the various "business level" metrics you might care about. Things like "request latency", "successful requests", etc.
Then use the longer lookahead "error budget" burndowns to see where your error budget is being spent, and from there decide 1.) if the SLO needs adjusting, and/or 2.) if an alert is appropriate.
To cleanly answer those questions and iterate you'll need metrics, dashboards, traces, and logs. So then you're not just making dashboards because "its best practice", you're creating them to specifically help you measure if you're meeting your stated service objectives.
https://sre.google/sre-book/service-level-objectives/
manoDev 1 hour ago
For prior art on how to define alert conditions, see:
https://en.wikipedia.org/wiki/Nelson_rules
https://en.wikipedia.org/wiki/Western_Electric_rules
https://en.wikipedia.org/wiki/Westgard_rules
[-]
- esafak 1 hour ago
  Now we use purely statistical measures, which requires a probabilistic model. The name of the game is calibration.
Yokohiii 1 hour ago
In my opinion the best method to reduce alerts is to work hard to get rid of the underlying problems or turn them into a non-problems. If you do a good job most errors are 3rd party driven, that can be indeed hard to solve relative to company politics. But at that point you can always tell your boss how it can be solved and that you wont go on pager duty for stuff that is out of your control.
prism56 1 hour ago
I work writing analytics and monitoring for industrial equipment. We have hundreds of sensors sending back realtime data.
There was a period of time where people were writing alerts for the sake of it (i.e we have this sensor, when should we alert on it).
Nowadays we're strictly failure mode driven, this has meant lots of sensors aren't used in the analytics. They are however available to the experts to plot them for a more holistic view if required.
analogpixel 2 hours ago
> Alerts should be actionable. If no action can or should be taken, then the alert is not needed.
Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.
[-]
- esafak 1 hour ago
  I know something is going to happen if disk space runs out; I don't need to experience it first.
  [-]
  - stackskipton 1 hour ago
    Sure, but for every alert, there is exception.
    ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.
    Even worse is CPU/RAM alerts.
    [-]
    - esafak 1 hour ago
      Alerts are for when things don't go as expected. You set up log rotation but an agent quietly breaks it or ES introduces a a bug in it.
jbmsf 1 hour ago
I have some thoughts here.
I work for a startup; we have what I think is a fairly typical setup: metrics ingested from a variety of sources, fed into industry-standard metrics/dashboard solutions, triggering escalations to humans. It's fine and I'm happy we have it, but...
The highest value source of alerting right now is one of our growth marketers who pays close attention to our CRM and product analytics tool and notices when key product funnels are underperforming.
Our next highest value signals are a handful of ad hoc alerting channels, mostly in Slack, either directly from a partner telling us that something suspicious happened on their side (think: fraud) or from in-product instrumentation sent to a channel for non-engineering visibility. Members of our business/product/operations team pay attention in these places and make decisions based on their business context.
After that, our support team is increasingly able to filter customer issues and differentiate between bugs, missing features, etc.
I know someone is going to argue that these are all a sign that we haven't instrumented the right things. Fair, but also misses the point. The decision makers in these flows don't (and won't) live in traditional alerting systems and wouldn't have helped us understand breakages without these other, ad hoc processes.
My theory is that it's relatively easy to offer a technical product that moves alerts around or that manages escalation paths. It's quite hard to design a product that surfaces detail to a non-technical export and that makes it easy to build systematic rules.