Python Infrastructure Outage
The incident has been resolved. ## Incident Report: Cluster-wide service outage **Duration**: ~36 minutes \(20:04–20:40 UTC\) **Impact**: Some PSF-hosted services were unavailable, including [python.org](http://python.org), [us.pycon.org](http://us.pycon.org), PyPI stats, [bugs.python.org](http://bugs.python.org), and related services. What was unaffected was our other cluster that manages [PyPI.org](http://PyPI.org) among other services related to PyPI. **Root Cause**: During local development of kubernetes workloads locally there was an incorrect context switch to one of our production clusters. The scale-down commands ran against the production cluster instead of the local environment, iterating through all deployments and setting them to zero replicas. which created cascading failures. **Recovery**: Services were restored with the help of Ee Durbin by bringing up infrastructure in dependency order, original replica counts were recovered from Kubernetes event history. Action items: * Separate kubeconfig files for production vs local, rather than relying on context switching * Research adding admission control or policies to prevent bulk scale-to-zero operations * Document the infrastructure dependency chain and recovery runbook for future incidents Jacob Coffee, PSF Infrastructure Team