Is Buildkite down right now?

Buildkite is currently operational in Uptimus.

Can I get Buildkite outage alerts?

Yes. Uptimus can alert you when official status page incidents, maintenance windows, or component changes are detected.

Buildkite Status. Check Current Outages and Incidents

Status Provider

Home

Status

Service

Recent incidents

Official incidents

Latest outages and degradations detected from the official status page.

Increased latency and error rates

Resolved

We have seen a full recovery of services.

Jun 18, 8:05 AMResolved Jun 18, 9:43 AMMajorREST API

Increased latency on REST and GraphQL APIs

Resolved

The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.

Jun 11, 11:18 PMResolved Jun 12, 12:16 AMMajorRemote MCP ServerREST API

Increased latency and error rates for Agent API

Resolved

Between 00:05 - 00:34 UTC, a subset of customers experienced increased latency and timeout errors on the Agent API. This impacts job assignment. At peak impact, we saw an error rate of 1.3% of requests and job acceptance latency up to 53s.

Jun 11, 12:32 AMResolved Jun 11, 12:51 AMMinorAgent API

Email deliveries are delayed

Resolved

We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.

May 30, 12:30 AMResolved May 30, 12:30 AMNone

Delayed notifications

Postmortem

## Service Impact Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows. ## Incident Summary On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario. ### Impact window 1 At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC. ### Impact window 2 A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services. ### Customer Impact The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident. ## Changes we're making We have made the following immediate changes: * Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code. * Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling. * Moved affected notification-processing workloads back to known-good ECS capacity. * Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays. We are also making the following reliability improvements: * Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback. * Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load. * Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods. * Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier. * Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work. ## Areas we are improving: incident communication During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows. We are improving how we communicate during notification latency incidents by: * Updating the status page earlier when notification latency is likely to affect customer workflows * Making status page updates clearer about the customer-visible impact, not just the affected internal service * Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood * Using customer-level notification latency monitoring to help identify affected customers sooner

May 28, 8:20 PMResolved May 28, 9:18 PMMajorSlack NotificationsEmail Notifications

Increased latency and error rates

Postmortem

## Service Impact A subset of customers experienced elevated latency in notification delivery. ## Incident Summary While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred. Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate. When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work. We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated. Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident. ## Changes we're making * We are simplifying the runbook used to rollback migrations in the event of incidents. * We are adding more verification steps to the migration process.

May 26, 9:56 AMResolved May 26, 10:38 AMNoneSlack NotificationsEmail Notifications

Delayed notifications

Postmortem

## Service Impact A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services. ## Incident Summary We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go. Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat. The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33. ## Changes we're making We have already made the following changes to our rollout of EKS for production workloads: * Upsized the underlying system nodes. * Set higher requests and limits for the Prometheus server so it can handle more product load. * Reviewed and set any missing requests and limits for all new EKS resources, ensuring that EKS has all the required information to prevent accidental resource contention. * Added more observability and monitors for EKS pod and node health to help us identify root causes quickly during future incidents. We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.

May 20, 4:40 PMResolved May 20, 5:39 PMMajorSlack NotificationsEmail Notifications

Delayed Test Engine ingestion processing

Resolved

Processing of the backlog is complete.

May 15, 6:51 AMResolved May 15, 7:35 AMMinorIngestion

Error rates increasing

Resolved

Additional capacity was added to our redis caches. This triggered a failover between UTC 15:10 - 15:14 and there was a spike of errors on the REST and GraphQL APIs. Customers would have seen some errors in the Buildkite UI during this period as well. We have been monitoring the situation since then and things have returned to baseline.

May 13, 3:14 PMResolved May 13, 3:34 PMMinorWebRemote MCP Server

Delayed Test Engine ingestion processing

Resolved

The fix was successful and the backlog has now been cleared.

May 12, 12:59 PMResolved May 12, 4:09 PMMinorIngestion

Recent maintenance

Maintenance windows

Scheduled and completed maintenance windows are separated from incidents.

Database maintenance for subset of customers

Completed

The scheduled maintenance has been completed.

Jan 11, 6:00 AMResolved Jan 11, 10:00 AMMaintenanceAgent APIJob Queue

Database maintenance

Completed

Maintenance completed 02:20 UTC. We have confirmed builds to be in their correct states.

Nov 16, 2:00 AMResolved Nov 16, 3:14 AMMaintenanceWebLinux (ARM64)

Maintenance to in-memory caches

Completed

The scheduled maintenance is now complete. Resuming normal service.

Mar 30, 11:00 PMResolved Mar 31, 1:14 AMMaintenanceWebSlack Notifications

Scheduled upgrade

Completed

The scheduled maintenance has been completed.

May 27, 6:00 AMResolved May 27, 6:45 AMMaintenanceAgent APIWeb

Scheduled upgrade

Completed

This maintenance has completed with minimal impact and service has restored to normal.

May 20, 5:00 AMResolved May 20, 5:25 AMMaintenanceAgent APIWeb

Database maintenance

Completed

Maintenance has been completed.

Feb 4, 3:00 AMResolved Feb 4, 3:24 AMMaintenanceSlack NotificationsSCM Integrations

Database maintenance

Completed

The scheduled maintenance has been completed.

Jul 30, 1:00 AMResolved Jul 30, 1:55 AMMaintenanceWebSlack Notifications

Database maintenance

Completed

The scheduled maintenance has been completed.

Apr 30, 1:00 AMResolved Apr 30, 2:00 AMMaintenanceWebAgent API

Database maintenance

Completed

Maintenance has been successfully completed. Please contact support@buildkite.com if you experience any issues. Thank you for your patience during this maintenance period.

Mar 19, 1:00 AMResolved Mar 19, 3:03 AMMaintenanceWebSlack Notifications

Database maintenance

Completed

The scheduled maintenance has been completed successfully and Buildkite is back to normal.

May 29, 12:00 AMResolved May 29, 1:00 AMMaintenanceSlack NotificationsAgent API

Buildkite components

Tracked components

Showing 1 to 25 of 38 tracked components.

Component	Status	Type	Last changed
REST API	Operational	Group	Not recorded
Notifications	Operational	Group	Not recorded
Hosted Agents	Operational	Group	Not recorded
Package Registries	Operational	Group	Not recorded
Test Engine	Operational	Group	Not recorded
SCM Providers Third party SCM providers which may affect your builds	Operational	Group	6/15/2026
Third Party Services Third party services we depend upon	Operational	Group	Not recorded
AWS ec2-us-east-1	Operational	Component	Not recorded
GitHub	Operational	Component	Not recorded
GitHub Commit Status Notifications	Operational	Component	Not recorded
Hosted Agents Buildkite's hosted compute in the Pipelines product https://buildkite.com/docs/pipelines/hosted-agents/overview	Operational	Component	Not recorded
REST API	Operational	Component	Not recorded
Web Web interface for Test Analytics	Operational	Component	Not recorded
Web	Operational	Component	Not recorded
Web	Operational	Component	Not recorded
AWS elasticache-us-east-1	Operational	Component	Not recorded
Agent API	Operational	Component	Not recorded
Email Notifications	Operational	Component	Not recorded
GitHub API Requests	Operational	Component	Not recorded
Ingestion Ingestion queue processing for Test Analytics	Operational	Component	Not recorded
MacOS	Operational	Component	Not recorded
Package Managers - API Endpoints for clients like docker, npm, gem etc	Operational	Component	Not recorded
Remote MCP Server	Operational	Component	Not recorded
AWS elb-us-east-1	Operational	Component	Not recorded
GitHub Webhooks	Operational	Component	6/15/2026

Recent incidents

Official incidents

Latest outages and degradations detected from the official status page.

Increased latency and error rates

Resolved

We have seen a full recovery of services.

Jun 18, 8:05 AMResolved Jun 18, 9:43 AMMajorREST API

Increased latency on REST and GraphQL APIs

Resolved

The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.

Jun 11, 11:18 PMResolved Jun 12, 12:16 AMMajorRemote MCP ServerREST API

Increased latency and error rates for Agent API

Resolved

Jun 11, 12:32 AMResolved Jun 11, 12:51 AMMinorAgent API

Email deliveries are delayed

Resolved

We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.

May 30, 12:30 AMResolved May 30, 12:30 AMNone

Delayed notifications

Postmortem

May 28, 8:20 PMResolved May 28, 9:18 PMMajorSlack NotificationsEmail Notifications

Increased latency and error rates

Postmortem

May 26, 9:56 AMResolved May 26, 10:38 AMNoneSlack NotificationsEmail Notifications

Delayed notifications

Postmortem

May 20, 4:40 PMResolved May 20, 5:39 PMMajorSlack NotificationsEmail Notifications

Delayed Test Engine ingestion processing

Resolved

Processing of the backlog is complete.

May 15, 6:51 AMResolved May 15, 7:35 AMMinorIngestion

Error rates increasing

Resolved

May 13, 3:14 PMResolved May 13, 3:34 PMMinorWebRemote MCP Server

Delayed Test Engine ingestion processing

Resolved

The fix was successful and the backlog has now been cleared.

May 12, 12:59 PMResolved May 12, 4:09 PMMinorIngestion

Recent maintenance

Maintenance windows

Scheduled and completed maintenance windows are separated from incidents.

Database maintenance for subset of customers

Completed

The scheduled maintenance has been completed.

Jan 11, 6:00 AMResolved Jan 11, 10:00 AMMaintenanceAgent APIJob Queue

Database maintenance

Completed

Maintenance completed 02:20 UTC. We have confirmed builds to be in their correct states.

Nov 16, 2:00 AMResolved Nov 16, 3:14 AMMaintenanceWebLinux (ARM64)

Maintenance to in-memory caches

Completed

The scheduled maintenance is now complete. Resuming normal service.

Mar 30, 11:00 PMResolved Mar 31, 1:14 AMMaintenanceWebSlack Notifications

Scheduled upgrade

Completed

The scheduled maintenance has been completed.

May 27, 6:00 AMResolved May 27, 6:45 AMMaintenanceAgent APIWeb

Scheduled upgrade

Completed

This maintenance has completed with minimal impact and service has restored to normal.

May 20, 5:00 AMResolved May 20, 5:25 AMMaintenanceAgent APIWeb

Database maintenance

Completed

Maintenance has been completed.

Feb 4, 3:00 AMResolved Feb 4, 3:24 AMMaintenanceSlack NotificationsSCM Integrations

Database maintenance

Completed

The scheduled maintenance has been completed.

Jul 30, 1:00 AMResolved Jul 30, 1:55 AMMaintenanceWebSlack Notifications

Database maintenance

Completed

The scheduled maintenance has been completed.

Apr 30, 1:00 AMResolved Apr 30, 2:00 AMMaintenanceWebAgent API

Database maintenance

Completed

Maintenance has been successfully completed. Please contact support@buildkite.com if you experience any issues. Thank you for your patience during this maintenance period.

Mar 19, 1:00 AMResolved Mar 19, 3:03 AMMaintenanceWebSlack Notifications

Database maintenance

Completed

The scheduled maintenance has been completed successfully and Buildkite is back to normal.

May 29, 12:00 AMResolved May 29, 1:00 AMMaintenanceSlack NotificationsAgent API

Buildkite components

Tracked components

Showing 1 to 25 of 38 tracked components.

Component	Status	Type	Last changed
REST API	Operational	Group	Not recorded
Notifications	Operational	Group	Not recorded
Hosted Agents	Operational	Group	Not recorded
Package Registries	Operational	Group	Not recorded
Test Engine	Operational	Group	Not recorded
SCM Providers Third party SCM providers which may affect your builds	Operational	Group	6/15/2026
Third Party Services Third party services we depend upon	Operational	Group	Not recorded
AWS ec2-us-east-1	Operational	Component	Not recorded
GitHub	Operational	Component	Not recorded
GitHub Commit Status Notifications	Operational	Component	Not recorded
Hosted Agents Buildkite's hosted compute in the Pipelines product https://buildkite.com/docs/pipelines/hosted-agents/overview	Operational	Component	Not recorded
REST API	Operational	Component	Not recorded
Web Web interface for Test Analytics	Operational	Component	Not recorded
Web	Operational	Component	Not recorded
Web	Operational	Component	Not recorded
AWS elasticache-us-east-1	Operational	Component	Not recorded
Agent API	Operational	Component	Not recorded
Email Notifications	Operational	Component	Not recorded
GitHub API Requests	Operational	Component	Not recorded
Ingestion Ingestion queue processing for Test Analytics	Operational	Component	Not recorded
MacOS	Operational	Component	Not recorded
Package Managers - API Endpoints for clients like docker, npm, gem etc	Operational	Component	Not recorded
Remote MCP Server	Operational	Component	Not recorded
AWS elb-us-east-1	Operational	Component	Not recorded
GitHub Webhooks	Operational	Component	6/15/2026

Status Provider

Status Provider

Buildkite Status

Service health overview

Operational snapshot

Top affected components

90-day status history

Daily rollup

Recent incidents

Official incidents

Increased latency and error rates

Increased latency on REST and GraphQL APIs

Increased latency and error rates for Agent API

Email deliveries are delayed

Delayed notifications

Increased latency and error rates

Delayed notifications

Delayed Test Engine ingestion processing

Error rates increasing

Delayed Test Engine ingestion processing

Recent maintenance

Maintenance windows

Database maintenance for subset of customers

Database maintenance

Maintenance to in-memory caches

Scheduled upgrade

Scheduled upgrade

Database maintenance

Database maintenance

Database maintenance

Database maintenance

Database maintenance

About the Buildkite status page integration

Outage map preview

Buildkite components

Tracked components

Get notified when Buildkite changes status

Show Buildkite on your own status page

Users also follow these services

Frequently asked questions

Buildkite Status

Service health overview

Operational snapshot

Top affected components

90-day status history

Daily rollup

Recent incidents

Official incidents

Increased latency and error rates

Increased latency on REST and GraphQL APIs

Increased latency and error rates for Agent API

Email deliveries are delayed

Delayed notifications

Increased latency and error rates

Delayed notifications

Delayed Test Engine ingestion processing

Error rates increasing

Delayed Test Engine ingestion processing

Recent maintenance

Maintenance windows

Database maintenance for subset of customers

Database maintenance

Maintenance to in-memory caches

Scheduled upgrade

Scheduled upgrade

Database maintenance

Database maintenance

Database maintenance

Database maintenance

Database maintenance

About the Buildkite status page integration

Outage map preview

Buildkite components

Tracked components

Get notified when Buildkite changes status

Show Buildkite on your own status page

Users also follow these services

Frequently asked questions