Increased latency and error rates
We have seen a full recovery of services.
Home
/
Status
/
Service
Scale-Out Delivery Platform for software delivery and testing at scale.
Source
auto
Category
Development
Adapter
STATUSPAGE IO
Verified
Pending review
Current state
Operational
Checked 12m ago
38
Components
0
Active incidents
0
Maintenance
4.35%
90d uptime
Increased latency and error rates
Jun 18, 8:05 AM
Normalized official status-page data for incidents, maintenance, components, and history.
4.35%
Known uptime
23 known history days
38
Components tracked
0 outage, 0 degraded
51
Incidents indexed
0 active right now
39
Maintenance windows
0 active or scheduled
Components with the most recent status-page events.
GitHub Webhooks
Operational
SCM Providers
Operational
AWS ec2-us-east-1
Operational
AWS elasticache-us-east-1
Operational
AWS elb-us-east-1
Operational
Component changes, incidents, and maintenance windows grouped by day.
operational
degraded
outage
maintenance
unknown
1
operational days
1
degraded days
21
outage days
0
maintenance days
67
unknown days
Latest outages and degradations detected from the official status page.
We have seen a full recovery of services.
The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.
Between 00:05 - 00:34 UTC, a subset of customers experienced increased latency and timeout errors on the Agent API. This impacts job assignment. At peak impact, we saw an error rate of 1.3% of requests and job acceptance latency up to 53s.
We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.
## Service Impact Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows. ## Incident Summary On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario. ### Impact window 1 At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC. ### Impact window 2 A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services. ### Customer Impact The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident. ## Changes we're making We have made the following immediate changes: * Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code. * Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling. * Moved affected notification-processing workloads back to known-good ECS capacity. * Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays. We are also making the following reliability improvements: * Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback. * Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load. * Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods. * Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier. * Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work. ## Areas we are improving: incident communication During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows. We are improving how we communicate during notification latency incidents by: * Updating the status page earlier when notification latency is likely to affect customer workflows * Making status page updates clearer about the customer-visible impact, not just the affected internal service * Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood * Using customer-level notification latency monitoring to help identify affected customers sooner
## Service Impact A subset of customers experienced elevated latency in notification delivery. ## Incident Summary While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred. Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate. When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work. We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated. Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident. ## Changes we're making * We are simplifying the runbook used to rollback migrations in the event of incidents. * We are adding more verification steps to the migration process.
## Service Impact A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services. ## Incident Summary We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go. Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat. The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33. ## Changes we're making We have already made the following changes to our rollout of EKS for production workloads: * Upsized the underlying system nodes. * Set higher requests and limits for the Prometheus server so it can handle more product load. * Reviewed and set any missing requests and limits for all new EKS resources, ensuring that EKS has all the required information to prevent accidental resource contention. * Added more observability and monitors for EKS pod and node health to help us identify root causes quickly during future incidents. We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.
Processing of the backlog is complete.
Additional capacity was added to our redis caches. This triggered a failover between UTC 15:10 - 15:14 and there was a spike of errors on the REST and GraphQL APIs. Customers would have seen some errors in the Buildkite UI during this period as well. We have been monitoring the situation since then and things have returned to baseline.
The fix was successful and the backlog has now been cleared.
Scheduled and completed maintenance windows are separated from incidents.
The scheduled maintenance has been completed.
Maintenance completed 02:20 UTC. We have confirmed builds to be in their correct states.
The scheduled maintenance is now complete. Resuming normal service.
The scheduled maintenance has been completed.
This maintenance has completed with minimal impact and service has restored to normal.
Maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
Maintenance has been successfully completed. Please contact support@buildkite.com if you experience any issues. Thank you for your patience during this maintenance period.
The scheduled maintenance has been completed successfully and Buildkite is back to normal.
Uptimus tracks the official Buildkite status page, normalizes upstream events, and separates incidents from scheduled maintenance.
Official source
https://www.buildkitestatus.com
Adapter
STATUSPAGE IO
Alert streams
Incidents, component changes, and maintenance windows.
Public SEO page
Indexable status history for users searching outage information.
Regional reports can be layered on top of official provider status when user signals are available.
Showing 1 to 25 of 38 tracked components.
| Component | Status | Type | Last changed |
|---|---|---|---|
REST API | Operational | Group | Not recorded |
Notifications | Operational | Group | Not recorded |
Hosted Agents | Operational | Group | Not recorded |
Package Registries | Operational | Group | Not recorded |
Test Engine | Operational | Group | Not recorded |
SCM Providers Third party SCM providers which may affect your builds | Operational | Group | 6/15/2026 |
Third Party Services Third party services we depend upon | Operational | Group | Not recorded |
AWS ec2-us-east-1 | Operational | Component | Not recorded |
GitHub | Operational | Component | Not recorded |
GitHub Commit Status Notifications | Operational | Component | Not recorded |
Hosted Agents Buildkite's hosted compute in the Pipelines product https://buildkite.com/docs/pipelines/hosted-agents/overview | Operational | Component | Not recorded |
REST API | Operational | Component | Not recorded |
Web Web interface for Test Analytics | Operational | Component | Not recorded |
Web | Operational | Component | Not recorded |
Web | Operational | Component | Not recorded |
AWS elasticache-us-east-1 | Operational | Component | Not recorded |
Agent API | Operational | Component | Not recorded |
Email Notifications | Operational | Component | Not recorded |
GitHub API Requests | Operational | Component | Not recorded |
Ingestion Ingestion queue processing for Test Analytics | Operational | Component | Not recorded |
MacOS | Operational | Component | Not recorded |
Package Managers - API Endpoints for clients like docker, npm, gem etc | Operational | Component | Not recorded |
Remote MCP Server | Operational | Component | Not recorded |
AWS elb-us-east-1 | Operational | Component | Not recorded |
GitHub Webhooks | Operational | Component | 6/15/2026 |
Follow outages, degraded components, and maintenance updates in your Uptimus workspace with email, push, and webhook alerts.
Official provider components
Incident and maintenance separation
Workspace alerts and webhooks
Related status pages based on category, adapter type, and operational history.
Buildkite is currently marked as Operational in Uptimus based on the latest official status page check.
Supported status page providers are checked continuously by our scraper scheduler. The public page is cached briefly for SEO and performance.
No. Uptimus stores incidents and maintenance windows separately when the upstream provider exposes enough detail.
Yes. Create an Uptimus workspace, follow this provider, and choose email, push, or webhook notifications.