Uptimus
uptimus
Sign inGet alerts
HomePricingFeaturesIncidentsReports

Status Provider

Home

/

Status

/

Service

ThisIsDownThisIsDown

uptimus

Public status intelligence for services, apps, games, and official status page providers.

Product

HomeFeaturesPricingStatus providers

Monitoring

WebsitesMobile appsMobile gamesSteam games

Community

IncidentsReportsSign inSign up

© 2026 Uptimus. All rights reserved.

Know what is down before your users do.

Buildkite logo
Operational

Buildkite Status

Scale-Out Delivery Platform for software delivery and testing at scale.

Source

auto

Category

Development

Adapter

STATUSPAGE IO

Verified

Pending review

Get alertsOfficial status page

Current state

Operational

Checked 22m ago

38

Components

0

Active incidents

0

Maintenance

4.35%

90d uptime

Increased latency and error rates

Jun 18, 8:05 AM

Service health overview

Operational snapshot

Normalized official status-page data for incidents, maintenance, components, and history.

4.35%

Known uptime

23 known history days

38

Components tracked

0 outage, 0 degraded

51

Incidents indexed

0 active right now

39

Maintenance windows

0 active or scheduled

Top affected components

Components with the most recent status-page events.

GitHub Webhooks

Operational

3

SCM Providers

Operational

3

AWS ec2-us-east-1

Operational

1

AWS elasticache-us-east-1

Operational

1

AWS elb-us-east-1

Operational

1

90-day status history

Daily rollup

Component changes, incidents, and maintenance windows grouped by day.

operational

degraded

outage

maintenance

unknown

1

operational days

1

degraded days

21

outage days

0

maintenance days

67

unknown days

Recent incidents

Official incidents

Latest outages and degradations detected from the official status page.

Increased latency and error rates
Resolved

We have seen a full recovery of services.

Jun 18, 8:05 AMResolved Jun 18, 9:43 AMMajorREST API
Increased latency on REST and GraphQL APIs
Resolved

The mitigation applied before the last update had the intended effect, and we have seen recovery in REST API latency.

Jun 11, 11:18 PMResolved Jun 12, 12:16 AMMajorRemote MCP ServerREST API
Increased latency and error rates for Agent API
Resolved

Between 00:05 - 00:34 UTC, a subset of customers experienced increased latency and timeout errors on the Agent API. This impacts job assignment. At peak impact, we saw an error rate of 1.3% of requests and job acceptance latency up to 53s.

Jun 11, 12:32 AMResolved Jun 11, 12:51 AMMinorAgent API
Email deliveries are delayed
Resolved

We have received reports email deliveries have not been working, affecting signup and invite emails as well as build notification emails. This issue has now been resolved.

May 30, 12:30 AMResolved May 30, 12:30 AMNone
Delayed notifications
Postmortem

## Service Impact Customers experienced delayed Buildkite notification delivery. The customer impact varied depending on how those notifications are used. For some customers, delayed notifications also delayed downstream CI, merge, or deployment workflows. ## Incident Summary On 28 May, Buildkite experienced elevated notification delivery latency after part of our notification-processing infrastructure became underprovisioned. This happened because the Prometheus service used by our EKS autoscaling path ran out of storage, which meant some EKS-based workers could not autoscale correctly while queues were growing. We mitigated the incident by moving affected workloads back to our previous ECS-based infrastructure and manually increasing worker capacity. Recovery took longer than expected because the rollback path did not fully handle this scenario. ### Impact window 1 At 20:01 UTC, notification-processing workers became underprovisioned and notification delivery latency increased. We detected the issue through internal queue latency monitoring and began shifting affected workloads from EKS back to ECS. This rollback took longer than expected because the ECS services we were rolling back to were not ready to immediately take the full load. Engineers had to manually adjust scaling configuration and worker counts while the incident was active. Notification latency recovered for most customers by 21:00 UTC. ### Impact window 2 A second, shorter impact window occurred between 22:12 UTC and 22:40 UTC for a subset of customers. After the first recovery, some workloads were still running on EKS and had started autoscaling again after Prometheus recovered. We incorrectly believed those workloads were no longer serving traffic. When we reconciled our infrastructure configuration, those EKS workloads were scaled down before their ECS equivalents had been fully scaled up. This caused another period of underprovisioning for some notification-processing workers. We resolved it by completing the rollback and scaling the remaining affected ECS services. ### Customer Impact The impact was not identical for every customer. For customers who use Buildkite notifications as an input to other CI or deployment systems, notification latency can delay those downstream workflows. Some customers also experienced secondary or longer-running effects based on the specific notification types, retry behaviour, or integrations involved. We are following up directly with affected customers where their impact differed from the general incident. ## Changes we're making We have made the following immediate changes: * Increased Prometheus storage capacity and reconciled that change in infrastructure-as-code. * Added monitoring to alert before Prometheus storage exhaustion can affect autoscaling. * Moved affected notification-processing workloads back to known-good ECS capacity. * Fixed GitHub notification retry behaviour for a class of errors that could cause repeated retries and extend notification delays. We are also making the following reliability improvements: * Hardening the EKS-to-ECS rollback process so it verifies destination capacity, autoscaling configuration, and traffic movement before and during rollback. * Reviewing other EKS control-plane dependencies, including KEDA and Karpenter, to ensure their CPU, memory, and storage allocations are appropriate for production load. * Reassessing the order and pace of future EKS migrations so customer-critical workloads move more gradually and with clearer settling periods. * Improving customer-level monitoring for notification delivery latency, so we can detect customer-impacting regressions earlier. * Reviewing which notification types are on the scheduling or CI hot path for customers, and whether they need tighter latency expectations, separate queueing, or more specific alerting than general notification work. ## Areas we are improving: incident communication During this incident, our public status page did not reflect customer-visible impact as quickly or clearly as it should have. In particular, notification delivery latency can affect customers differently depending on how notifications are used in their CI and deployment workflows. We are improving how we communicate during notification latency incidents by: * Updating the status page earlier when notification latency is likely to affect customer workflows * Making status page updates clearer about the customer-visible impact, not just the affected internal service * Improving internal escalation paths for customers who report critical CI impact before the incident is fully understood * Using customer-level notification latency monitoring to help identify affected customers sooner

May 28, 8:20 PMResolved May 28, 9:18 PMMajorSlack NotificationsEmail Notifications
Increased latency and error rates
Postmortem

## Service Impact A subset of customers experienced elevated latency in notification delivery. ## Incident Summary While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred. Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate. When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work. We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated. Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident. ## Changes we're making * We are simplifying the runbook used to rollback migrations in the event of incidents. * We are adding more verification steps to the migration process.

May 26, 9:56 AMResolved May 26, 10:38 AMNoneSlack NotificationsEmail Notifications
Delayed notifications
Postmortem

## Service Impact A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services. ## Incident Summary We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go. Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat. The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33. ## Changes we're making We have already made the following changes to our rollout of EKS for production workloads: * Upsized the underlying system nodes. * Set higher requests and limits for the Prometheus server so it can handle more product load. * Reviewed and set any missing requests and limits for all new EKS resources, ensuring that EKS has all the required information to prevent accidental resource contention. * Added more observability and monitors for EKS pod and node health to help us identify root causes quickly during future incidents. We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.

May 20, 4:40 PMResolved May 20, 5:39 PMMajorSlack NotificationsEmail Notifications
Delayed Test Engine ingestion processing
Resolved

Processing of the backlog is complete.

May 15, 6:51 AMResolved May 15, 7:35 AMMinorIngestion
Error rates increasing
Resolved

Additional capacity was added to our redis caches. This triggered a failover between UTC 15:10 - 15:14 and there was a spike of errors on the REST and GraphQL APIs. Customers would have seen some errors in the Buildkite UI during this period as well. We have been monitoring the situation since then and things have returned to baseline.

May 13, 3:14 PMResolved May 13, 3:34 PMMinorWebRemote MCP Server
Delayed Test Engine ingestion processing
Resolved

The fix was successful and the backlog has now been cleared.

May 12, 12:59 PMResolved May 12, 4:09 PMMinorIngestion

Recent maintenance

Maintenance windows

Scheduled and completed maintenance windows are separated from incidents.

Database maintenance for subset of customers
Completed

The scheduled maintenance has been completed.

Jan 11, 6:00 AMResolved Jan 11, 10:00 AMMaintenanceAgent APIJob Queue
Database maintenance
Completed

Maintenance completed 02:20 UTC. We have confirmed builds to be in their correct states.

Nov 16, 2:00 AMResolved Nov 16, 3:14 AMMaintenanceWebLinux (ARM64)
Maintenance to in-memory caches
Completed

The scheduled maintenance is now complete. Resuming normal service.

Mar 30, 11:00 PMResolved Mar 31, 1:14 AMMaintenanceWebSlack Notifications
Scheduled upgrade
Completed

The scheduled maintenance has been completed.

May 27, 6:00 AMResolved May 27, 6:45 AMMaintenanceAgent APIWeb
Scheduled upgrade
Completed

This maintenance has completed with minimal impact and service has restored to normal.

May 20, 5:00 AMResolved May 20, 5:25 AMMaintenanceAgent APIWeb
Database maintenance
Completed

Maintenance has been completed.

Feb 4, 3:00 AMResolved Feb 4, 3:24 AMMaintenanceSlack NotificationsSCM Integrations
Database maintenance
Completed

The scheduled maintenance has been completed.

Jul 30, 1:00 AMResolved Jul 30, 1:55 AMMaintenanceWebSlack Notifications
Database maintenance
Completed

The scheduled maintenance has been completed.

Apr 30, 1:00 AMResolved Apr 30, 2:00 AMMaintenanceWebAgent API
Database maintenance
Completed

Maintenance has been successfully completed. Please contact support@buildkite.com if you experience any issues. Thank you for your patience during this maintenance period.

Mar 19, 1:00 AMResolved Mar 19, 3:03 AMMaintenanceWebSlack Notifications
Database maintenance
Completed

The scheduled maintenance has been completed successfully and Buildkite is back to normal.

May 29, 12:00 AMResolved May 29, 1:00 AMMaintenanceSlack NotificationsAgent API

About the Buildkite status page integration

Uptimus tracks the official Buildkite status page, normalizes upstream events, and separates incidents from scheduled maintenance.

Official source

https://www.buildkitestatus.com

Adapter

STATUSPAGE IO

Alert streams

Incidents, component changes, and maintenance windows.

Public SEO page

Indexable status history for users searching outage information.

Outage map preview

Regional reports can be layered on top of official provider status when user signals are available.

Buildkite components

Tracked components

Showing 1 to 25 of 38 tracked components.

ComponentStatusTypeLast changed

REST API

Operational

Group

Not recorded

Notifications

Operational

Group

Not recorded

Hosted Agents

Operational

Group

Not recorded

Package Registries

Operational

Group

Not recorded

Test Engine

Operational

Group

Not recorded

SCM Providers

Third party SCM providers which may affect your builds

Operational

Group

6/15/2026

Third Party Services

Third party services we depend upon

Operational

Group

Not recorded

AWS ec2-us-east-1

Operational

Component

Not recorded

GitHub

Operational

Component

Not recorded

GitHub Commit Status Notifications

Operational

Component

Not recorded

Hosted Agents

Buildkite's hosted compute in the Pipelines product https://buildkite.com/docs/pipelines/hosted-agents/overview

Operational

Component

Not recorded

REST API

Operational

Component

Not recorded

Web

Web interface for Test Analytics

Operational

Component

Not recorded

Web

Operational

Component

Not recorded

Web

Operational

Component

Not recorded

AWS elasticache-us-east-1

Operational

Component

Not recorded

Agent API

Operational

Component

Not recorded

Email Notifications

Operational

Component

Not recorded

GitHub API Requests

Operational

Component

Not recorded

Ingestion

Ingestion queue processing for Test Analytics

Operational

Component

Not recorded

MacOS

Operational

Component

Not recorded

Package Managers - API

Endpoints for clients like docker, npm, gem etc

Operational

Component

Not recorded

Remote MCP Server

Operational

Component

Not recorded

AWS elb-us-east-1

Operational

Component

Not recorded

GitHub Webhooks

Operational

Component

6/15/2026

Get notified when Buildkite changes status

Follow outages, degraded components, and maintenance updates in your Uptimus workspace with email, push, and webhook alerts.

Start monitoringView plans

Show Buildkite on your own status page

Official provider components

Incident and maintenance separation

Workspace alerts and webhooks

Users also follow these services

Related status pages based on category, adapter type, and operational history.

GitHub

Development

Atlassian Status

Adapter Tests

GitHub Configurable Adapter

Adapter Tests

Vercel

Cloud

Cal.com Status

Adapter Tests

Clickbank

Analytics

Authy

Security

Gemini

Analytics

Dochub

Productivity

Mycase

Analytics

Kaseya

Security

Buddy

Development

Frequently asked questions

Is Buildkite down right now?

Buildkite is currently marked as Operational in Uptimus based on the latest official status page check.

How often does Uptimus check Buildkite?

Supported status page providers are checked continuously by our scraper scheduler. The public page is cached briefly for SEO and performance.

Are maintenance windows counted as incidents?

No. Uptimus stores incidents and maintenance windows separately when the upstream provider exposes enough detail.

Can I get alerts for Buildkite?

Yes. Create an Uptimus workspace, follow this provider, and choose email, push, or webhook notifications.