Degraded Console Service - Devices Page slowness
This incident has been resolved.
Home
/
Status
/
Service
Unified platform for identity, access, and device management.
Source
auto
Category
Cloud
Adapter
STATUSPAGE IO
Verified
Pending review
Current state
Operational
Checked 33m ago
123
Components
0
Active incidents
0
Maintenance
14.29%
90d uptime
Degraded Console Service - Devices Page slowness
May 8, 6:41 AM
Normalized official status-page data for incidents, maintenance, components, and history.
14.29%
Known uptime
7 known history days
123
Components tracked
0 outage, 0 degraded
50
Incidents indexed
0 active right now
51
Maintenance windows
0 active or scheduled
Components with the most recent status-page events.
Admin Console
Operational
Admin Console - EU Region
Operational
Agent
Operational
Agent - EU Region
Operational
Multi-Tenant Portal (MTP)
Operational
Component changes, incidents, and maintenance windows grouped by day.
operational
degraded
outage
maintenance
unknown
1
operational days
0
degraded days
4
outage days
2
maintenance days
83
unknown days
Latest outages and degradations detected from the official status page.
This incident has been resolved.
The incident has been resolved.
This incident has been resolved.
 **Date**: Apr 7, 2026 **Date of Incident:** Mar 30, 2026 **Description**: RCA for Directory Association Processing Delays **Summary:** Starting March 30th at approximately 15:40 MDT, JumpCloud customers experienced significant delays in directory-related updates. This included latency in password changes, user-to-group associations, and outbound provisioning reflecting in downstream systems. The root cause was identified as a specific code deployment in our Devices service that inadvertently flooded a background processing queue with unpartitioned messages, causing a bottleneck that prevented updates from processing in real-time. The issue was fully resolved by 00:25 MDT on March 31, 2026. **What Happened:** The incident was caused by a change in how the JumpCloud agent retrieves software application configurations. 1. **Traffic Spike:** The new code shifted the "source of truth" for these configurations to a new database. If a device polled the system and did not find its record in the new database, the code automatically enqueued a "track collect" request to sync the data. 2. **Unexpected Volume:** We anticipated a "lazy backfill" \(where records are created over time\), but underestimated the number of devices that had no existing software bindings. This resulted in an immediate, massive spike of nearly 280,000 messages. 3. **The Bottleneck \(Partitioning\):** Crucially, these specific messages were enqueued without a "Partition ID." In our high-scale FIFO \(First-In-First-Out\) queue architecture, messages without a partition ID are processed one-by-one rather than in parallel. This effectively "serialized" the queue, preventing us from scaling up workers to process the backlog faster and causing the observed latency. **Resolution and Recovery**: Once the offending code was rolled back, the "tap" was turned off, and no further unpartitioned messages were added to the queue. Because the bottleneck was caused by the lack of partitioning, simply scaling horizontally could not speed up the processing of the existing backlog. The team monitored the queue throughput and determined that the safest and fastest path to recovery was allowing the worker to process the existing messages sequentially rather than risking further disruption by attempting to manually manipulate the production queue. **Corrective Actions**: To ensure this type of bottleneck does not occur again, we have committed to the following: * Improving pre-production testing to better simulate the scale and conditions that can occur in production queue processing * Reviewing other areas of the platform where similar patterns could produce unexpected request spikes * Enhancing monitoring and alerting thresholds to enable faster detection and response when queue backlogs begin to form * Strengthening our deployment validation process to more thoroughly account for background data migrations before releasing dependent code changes
 **Date**: Mar 17, 2026 **Date of Incident:** Mar 12, 2026 **Description**: RCA for Agent Backend \(HAProxy\) System Degradation **Summary:** On March 12, 2026, from 10:05 AM to 2:45 PM MDT, JumpCloud experienced a significant service degradation affecting Agent-related activities. During this window, agent updates, including syncing users, passwords, policies and other agent data, as well as new agent installations were unavailable. This was caused by a "thundering herd" event triggered by a backend traffic-shaping change. We have since identified the root causes and implemented infrastructure changes to prevent a recurrence. **What Happened?** At 10:00 AM MDT, our engineering team enabled a feature flag \(a "circuit breaker"\) designed to protect our System Insights API from high load by returning `503 Service Unavailable` responses for certain non-critical requests. While the flag performed its intended function, it had an unforeseen secondary effect on the JumpCloud Agent’s connection logic. Because the agents could not reuse existing connections for these specific failed requests, hundreds of thousands of agents in our main production environment attempted to establish new mTLS \(mutual TLS\) connections simultaneously. This created a "Thundering Herd" event that saturated our HAProxy ingress layer, exhausting CPU resources and causing a cascade of connection failures. **Root Cause:** The prolonged nature of this incident was the result of three distinct, overlapping bottlenecks that our team had to isolate and resolve one by one: 1. **CPU-Intensive SSL Handshaking:** Establishing an mTLS connection is a CPU-intensive process. The sheer volume of simultaneous connection attempts pushed our HAProxy pods to their resource limits. This caused the pods to become unresponsive, leading to "Out of Memory" \(OOM\) kills and failed health probes. 2. **Health Check Death Spiral:** Our internal health checks initially relied on a Layer 7 SSL validation. Because the CPU was 100% occupied with agent reconnections, the pods couldn't respond to their own health checks in time. This caused the system to erroneously mark healthy pods as "down”, removing them from the rotation and further overwhelming the remaining pods. 3. **Load Balancer Handshake Saturation:** As we attempted to scale our infrastructure, the Application Load Balancer \(ALB\) encountered a throughput bottleneck specifically related to the rate of new connection establishments. The surge of agents attempting to negotiate new SSL handshakes at the same time exceeded the ALB's burst capacity, temporarily preventing even healthy backend pods from receiving and processing traffic. **Why It Took Time to Resolve:** While reverting the flag was the correct first step, the agents were already in an aggressive retry loop that continued even after the 503 errors stopped. We had to experiment with several configurations \(adjusting health check intervals and timeout windows\) to find a balance that allowed pods to stay "alive" long enough to process the backlog. Stability was achieved only once we implemented Concurrency Control. By lowering the maximum allowed concurrent connections per pod, we stopped the CPU from over-committing to handshakes, allowing the system to reliably process a controlled flow of traffic until the global queue cleared. **Corrective Actions / Risk Mitigation:** **1.\) Edge Infrastructure Hardening** We are standardized on a new high-availability configuration for our HAProxy ingress layer. * **Concurrency Governance**: We have implemented a strict maxconn limit per pod. This acts as a "pressure valve," ensuring that the CPU remains available to process existing requests rather than becoming saturated by new connection attempts. * **Dynamic Capacity Management via Autoscaling**: We are implementing Horizontal Pod Autoscaling \(HPA\) for our HAProxy ingress layer, calibrated to trigger based on both CPU utilization and active connection counts. This ensures we can absorb sudden traffic fluctuations and also maintain a controlled flow of requests to our backend services. **2.\) Agent Connectivity Optimization** We are updating the JumpCloud Agent’s communication layer to be more "network-aware" during degraded states: * **Enhanced Connection Pooling**: We are reconfiguring the agent's HTTP transport logic to maximize the reuse of existing idle connections. This significantly reduces the "Connection Tax" on our backend during high-traffic events. * **Streamlined Resource Handling**: We are implementing stricter protocols for draining and closing HTTP response bodies, ensuring that pooled connections are returned to the rotation immediately and reliably. **3.\) Adaptive Retry Logic \(Jitter\)** To further break up "synchronized" traffic spikes: * **Introduction of Jitte**r: While our agents currently use exponential backoff for poll requests, we are adding randomized "jitter" to our retry intervals. This spreads reconnection attempts across a wider window, preventing large blocks of agents from hitting the service at the exact same millisecond. * **Standardizing Resilient Retry Logic:** We are transitioning the Agent’s default HTTP client to a unified **exponential backoff** model for all request types. * **Controlled Rollou**t: This update will be managed via a staged rollout to monitor for any unforeseen side effects on fleet-wide connectivity patterns.
This incident has been resolved.
 **Date**: Nov 21, 2025 **Date of Incident:** Nov 19, 2025 **Description**: RCA for Admin Portal Login Errors **Summary:** On November 19, 2025, starting at approximately 04:30 UTC, between 1-5% of requests experienced intermittent failures to successfully authenticate to the Admin Console, lasting until roughly 06:30 UTC. Users attempting to authenticate received an “unexpected” error message during this window, but subsequent retries may have been successful. **Root Cause:** This issue was triggered during a standard infrastructure update and traffic shift intended to move services to a new, updated cluster. The core issue was a combination of an infrastructure configuration mismatch and gaps in our detection and validation processes. 1. Configuration Drift: The new infrastructure cluster \(Green Cluster\), intended to host the service, was missing a single but essential configuration value used by the control plane’s service mesh. This value had been recently applied to the existing cluster \(old cluster\) but was inadvertently excluded when the new cluster's baseline configuration was created and branched. When production traffic began routing to the new cluster, the missing configuration caused some access components to fail, leading to the login errors. 2. Detection Gaps: The application logged the configuration failure as a _Warning_ message, rather than a critical error. This meant our automated monitoring system did not trigger an immediate alert or rollback when the issue first occurred. The team quickly isolated the issue to the new Green Cluster, and an emergency process was initiated to immediately revert all production traffic back to the stable old cluster. **Corrective Actions / Risk Mitigation:** 1. Automated Configuration Diff Check - Implementing an automated process to continuously compare and ensure 100% configuration parity between old and new production clusters during all transition phases. 2. Clear Rule Enforcement - Reinforcing and automating the process to ensure all configuration changes are applied consistently across all active and future clusters. 3. Multi-Layer Error Monitoring - Implementing error rate monitoring at every layer of the network and application stack to ensure no failure goes undetected.
This incident has been resolved.
 **Date**: Nov 13, 2025 **Date of Incident:** Nov 6, 2025 **Description**: RCA for SSO/OIDC Service Degradation **Summary:** On November 6, 2025, starting at approximately 12:00 UTC, customers experienced failures to launch any application relying on JumpCloud's OIDC-based Single Sign-On \(SSO\), lasting for roughly one hour. **Root Cause:** The outage was caused by a combination of two errors during a scheduled compliance procedure: 1. Faulty Password Generation: Our automated system for rotating database passwords created a new credential that contained unsafe special characters. 2. Missing Special Character Logic: The entrypoint script for our core SSO service was missing logic to handle these special characters before using the password to construct a database connection string. When the SSO service attempted to restart and use the newly rotated password, the presence of the unsafe characters caused the connection string to be misinterpreted as invalid, leading to a parsing failure and service degradation. This issue stemmed from a latent configuration bug that was masked by prior rotation processes. Previously, database passwords were rotated manually using an older system \(`random_password` IAC resource\) which was explicitly configured to only generate alphanumeric characters. These characters are inherently safe in a URL context, so the underlying bug in the SSO service's connection logic was never exposed. When the credential management was successfully migrated to the new, more robust rotation process, the new function began generating highly complex passwords, including special characters, for the first time. This immediately triggered the latent parsing flaw in the SSO service’s entrypoint script. **Corrective Actions / Risk Mitigation:** 1. Hardening code logic - All services that construct database connection strings will be audited and updated to explicitly encode the password component eliminating character misinterpretation. 2. Enhanced rotation alerting - New monitoring and alerting dashboards are in place to track the health and success schedule of all automated credential rotation jobs, providing an immediate alert if a rotation creates an invalid credential. 3. Update password generation logic - The automated credential rotation function has been updated to explicitly generate passwords that are safe, avoiding complex, reserved characters.
 **Date**: Nov 7, 2025 **Date of Incident:** Nov 4, 2025 **Description**: RCA for Auth Database Degradation **Summary:** On November 4, 2025, a number of customers experienced intermittent failures, timeouts and increased latency when attempting to authenticate to multiple JumpCloud Services, including consoles, LDAP, RADIUS and SAML, or use Multi-Factor Authentication. **Root Cause:** The incident was triggered by an issue in the deployment process involving a database schema change and a subsequent application code release. During this deployment, a planned database change unintentionally removed several database indexes required by the existing application code. The sequence of failure was as follows: 1. Deployment Order Error: The database schema change \(which removed necessary indexes\) was applied to the production database before the new application code \(which did not require those indexes\) was deployed. 2. Performance Collapse: The existing, high-volume authentication code \(used for functions like TOTP and push authentication\) was forced to run against the now-inefficient database structure. Queries that normally took milliseconds suddenly took several seconds. 3. Connection Exhaustion: These slow queries held database connections open for extended periods, quickly overwhelming the database server's available connection pool. 4. Full Outage: With no available connections, the main authentication API could not communicate with the database, leading to 100% CPU utilization on the database server and triggering the intermittent timeouts and failures experienced by our customers. **Why Testing Did Not Catch This:** The issue was not identified during testing in our Development or Staging environments due to insufficient Load Simulation. The resource consumption issues and connection exhaustion only manifest under the extreme pressure of peak production traffic volume. The simulated load profiles in our lower environments were not sufficient to expose this specific failure mode. **Corrective Actions / Risk Mitigation:** 1. Mandatory schema change review - All database schema changes must now undergo an additional level of review to explicitly assess index dependencies and impact. 2. New deployment phasing - We are implementing new tools and checks to enforce that application code dependent on a schema change is deployed before a database change is executed. 3. Enhance alerting - We are implementing new monitors and alerts specifically for the Auth-API's database connection pool health and CPU utilization. 4. Enhanced load testing - We are revisiting the load profiles used in our staging environments looking for opportunities to more accurately simulate peak production traffic.
Scheduled and completed maintenance windows are separated from incidents.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
Uptimus tracks the official Jumpcloud status page, normalizes upstream events, and separates incidents from scheduled maintenance.
Official source
http://status.jumpcloud.com
Adapter
STATUSPAGE IO
Alert streams
Incidents, component changes, and maintenance windows.
Public SEO page
Indexable status history for users searching outage information.
Regional reports can be layered on top of official provider status when user signals are available.
Showing 1 to 25 of 123 tracked components.
| Component | Status | Type | Last changed |
|---|---|---|---|
Active Directory | Operational | Group | Not recorded |
Admin Console Admin Console | Operational | Group | 6/21/2026 |
Agent | Operational | Group | 6/21/2026 |
AI Gateway | Operational | Group | Not recorded |
Android EMM | Operational | Group | Not recorded |
Apple MDM | Operational | Group | Not recorded |
Cloudflare | Operational | Group | Not recorded |
Commands | Operational | Group | Not recorded |
Custom API Import | Operational | Group | Not recorded |
Directory Insights | Operational | Group | Not recorded |
Federation | Operational | Group | Not recorded |
G Suite Integration | Operational | Group | Not recorded |
General Access API | Operational | Group | Not recorded |
Groups (user/devices) | Operational | Group | Not recorded |
JumpCloud GO | Operational | Group | Not recorded |
LDAP | Operational | Group | Not recorded |
Mobile Admin App: iOS and Android | Operational | Group | Not recorded |
Multi-Tenant Portal (MTP) | Operational | Group | 6/21/2026 |
Office 365 Integration | Operational | Group | Not recorded |
Privileged Access Management (PAM) | Operational | Group | Not recorded |
Password Manager | Operational | Group | Not recorded |
Password Policies | Operational | Group | Not recorded |
Payment and Billing | Operational | Group | Not recorded |
Policy Management | Operational | Group | Not recorded |
RADIUS | Operational | Group | Not recorded |
Follow outages, degraded components, and maintenance updates in your Uptimus workspace with email, push, and webhook alerts.
Official provider components
Incident and maintenance separation
Workspace alerts and webhooks
Related status pages based on category, adapter type, and operational history.
Jumpcloud is currently marked as Operational in Uptimus based on the latest official status page check.
Supported status page providers are checked continuously by our scraper scheduler. The public page is cached briefly for SEO and performance.
No. Uptimus stores incidents and maintenance windows separately when the upstream provider exposes enough detail.
Yes. Create an Uptimus workspace, follow this provider, and choose email, push, or webhook notifications.