Solid OpenAI post-mortem on @kubernetes.io API overload, caused by "per-node telemetry ingestion" https://status.openai.com/incidents/ctrsv3lwd797 🤔
Oddly familiar to a common DaemonSet @prometheus.io / @opentelemetry.io gotcha for metrics scrapes, we talked about in the past:
https://youtu.be/yk2aaAyxgKw?t=768 🙈
Oddly familiar to a common DaemonSet @prometheus.io / @opentelemetry.io gotcha for metrics scrapes, we talked about in the past:
https://youtu.be/yk2aaAyxgKw?t=768 🙈
Comments
data plane inside kubernetes relies on kube apiserver because of dns service discovery…
However prevention part looks promising.
In these cases I always see how safe our infrastructure that manages cluster actually is and monitoring that makes it easy to find offenders
https://github.com/GoogleCloudPlatform/prometheus-engine
Another solution is to ofc have a cluster service discovery that pushes the targets to node agents, but pure (optimized) daemonset usually scales enough!
Somewhat relevant to https://opentelemetry.io/docs/kubernetes/operator/target-allocator/