The idea is to come up with situations you expect

Dołączył: 03 Mar 2024 Posty: 2

The intentionally cause those situations in production (during the work day, with warning) to ensure that you can handle them. After running several exercises on our cluster, they often revealed issues like gaps in monitoring or configuration errors. We were very happy to discover those issues early on in a controlled fashion rather than by surprise six months later. Here are a few of the game day exercises we ran: Terminate one Kubernetes API server Terminate all the Kubernetes API servers and bring them back up (to our surprise, this worked very well) Terminate an etcd node Cut off worker nodes in our Kubernetes cluster from the API servers (so that they can't communicate). This resulted in all pods on those nodes being moved to other nodes.

We were really pleased to see how well Kubernetes responded to a lot of the disruptions we threw at it. Kubernetes is designed to be resilient to errors---it has one etcd cluster storing all the state, an API server which is simply a REST interface to that database, and a collection of stateless controllers" that coordinate all cluster management Brazil Mobile Number List If any of the Kubernetes core components (the API server, controller manager, or scheduler) are interrupted or restarted, once they come up they read the relevant state from etcd and continue operating seamlessly. This was one of the things we hoped would be true, and has actually worked very well in practice. Here are some kinds of issues that we found during these tests: "Weird, I didn't get paged for that, that really should have paged.

Let's fix our monitoring there." "When we destroyed our API server instances and brought them back up, they required human intervention. We’d better fix that." "Sometimes when we do an etcd failover, the API server starts timing out requests until we restart it." After running these tests, we developed remediations for the issues we found: we improved monitoring, fixed configuration issues we'd discovered, and filed bugs with Kubernetes. Making cron jobs easy to use Let's briefly explore how we made our Kubernetes-based system easy to use. Our original goal was to design a system for running cron jobs that our team was confident operating and maintaining. Once we had established our confidence in Kubernetes, we needed to make it easy for our fellow engineers to configure and add new cron jobs. We developed a simple YAML configuration format so that our users didn't need to understand anything about Kubernetes’ internals to use the system.

Wysłany: Nie Mar 03, 2024 13:19