The number of moving parts in the system meant a lot of noise was being generated from many of the lower-level metrics we were gathering. We didn’t have the benefit of scaling gradually or having the system run for a few months to understand what good looked like for metrics like our CPU rate or even the latencies of some of the individual components
Effectively training the monitoring/monitors to understand system behaviour. A feedback loop of sorts. Would this need to be continually updated as a micro-services functionality changes. At a guess yes, but when and how often, prior to deploy. Could a delta help ?