`from_url` uploads degradation
Although Cloudflare's issues persist and the files they distribute take longer to download via `from_url` than others, our systems remain stable. We are closing the incident but will continue to monitor the situation.
Home
/
Status
/
Service
File uploading, processing, and delivery for web and mobile apps.
Source
auto
Category
Cloud
Adapter
STATUSPAGE IO
Verified
Pending review
Current state
Operational
Checked 6m ago
9
Components
0
Active incidents
0
Maintenance
100%
90d uptime
`from_url` uploads degradation
Dec 19, 3:23 PM
Normalized official status-page data for incidents, maintenance, components, and history.
100%
Known uptime
1 known history days
9
Components tracked
0 outage, 0 degraded
45
Incidents indexed
0 active right now
18
Maintenance windows
0 active or scheduled
Components with the most recent status-page events.
CDN
Operational
Dashboard
Operational
Document processing
Operational
Processing engines
Operational
REST API
Operational
Component changes, incidents, and maintenance windows grouped by day.
operational
degraded
outage
maintenance
unknown
1
operational days
0
degraded days
0
outage days
0
maintenance days
89
unknown days
Latest outages and degradations detected from the official status page.
Although Cloudflare's issues persist and the files they distribute take longer to download via `from_url` than others, our systems remain stable. We are closing the incident but will continue to monitor the situation.
# **Analysis of the December 15, 2025, Upload API Disruption** ## 1. Summary On December 15, 2025, the Uploadcare platform experienced a significant service disruption affecting the Upload API and REST API subsystems. The incident, spanning approximately ten hours, resulted in elevated error rates for file uploads, increased latency across API endpoints, and interruptions of Dashboard operations. ## 2. Root cause The root cause was a resource contention event on Amazon Elastic File System \(EFS\), aggravated by an observability side-effect. A monitoring agent, executing a metadata-heavy recursive file scan, monopolized the I/O capacity of the shared NFS mount points. This I/O saturation created a cascading failure due to a legacy architectural pattern where storage operations occur within active database transactions. Background Processing Workers, taking long time to read or write to the stalled filesystem, held open database transactions for a long time. This saturated the global connection pool \(PgBouncer\), causing a denial of service for Upload API HTTP Workers and other services attempting to run database queries, despite the underlying PostgreSQL engine remaining healthy. ## 3. Architectural context and system design The Uploadcare platform is designed for scale, but this incident exposed specific coupling risks between our storage, database, and monitoring layers. ### 3.1 Ingestion pipeline & worker roles The core of the affected system is the Upload API. It utilizes two distinct worker types: 1. Upload API HTTP workers. These synchronous workers handle incoming HTTP requests \(Direct uploads, Multipart, and `from_url` requests\). They ingest data and schedule tasks. 2. Background processing workers. These asynchronous workers pick up tasks from a queue to perform complex operations \(virus scanning, image validation, property extraction, etc\). Both worker types utilize a shared temporary staging area backed by Amazon EFS to maintain a POSIX-compliant shared state across the distributed fleet. ### 3.2 Observability stack To monitor the staging area's health, a Telegraf sidecar agent was configured to track file backlog depth. This was implemented via a standard Linux `find` command sequence. On a network filesystem \(NFS\), a recursive `find` forces a traversal of the directory tree over the network. As file counts grow, the cost of this operation increases linearly, generating a storm of `READDIRPLUS` and `GETATTR` RPC calls. ## 4. Timeline _All times are listed in Coordinated Universal Time \(UTC\)._ * **17:17:** System begins receiving a massive surge of `from_url` upload requests. This workload is write-heavy. * **17:17 – 17:23:** Upload API HTTP workers successfully ingest files and write them to EFS. However, Background processing workers \(responsible for processing and deleting files\) begin to slow down due to the increasing I/O load. The rate of file creation exceeds the rate of processing/deletion. The file count on the volume jumps from ~40 \(healthy\) to ~7,000. * **17:25:** The "Noisy Neighbor" effect begins. The Telegraf `find` command, attempting to scan 7,000\+ files, fails to complete within its 30-second timeout. We lose visibility of metrics related to the file staging area. * **17:30:** Multiple instances of the `find` command stack up. The resulting metadata storm heavily affects the NFS client's ability to process requests. Background processing workers become very ineffective. * **19:06:** PostgreSQL begins terminating client connections due to `idle-in-transaction` timeouts. This confirms workers are stuck holding connections. * **20:28 – 20:51:** Engineering attempts to horizontally scale the PgBouncers to absorb the held connections. This inadvertently pushes the upper layer PgBouncers toward their OS file descriptor limits. * **21:07:** Upper layer PgBouncer instances hit the Linux file descriptor limit \(`ulimit`\). New connections to this layer begin failing. * **21:28:** File descriptor limits are increased. Background processing workers begin to recover slowly as the pool stabilizes. * **23:30:** Background processing workers successfully process the backlog. The background task queue becomes empty. * **23:41:** Despite the background queue being empty, Upload API HTTP workers continue to exhibit "overload" behavior. * **01:35 \(Dec 16\):** Engineering reconfigures Upload API HTTP workers to connect directly to the upper PgBouncer layer. * **02:48:** EFS metrics reappear. * **02:58:** Engineers manually removed approximately 4,000 "orphaned" files that remained on the disk \(files where the background job failed during the crash and could not self-heal\). File manipulations on EFS become normal. * **02:59:** Database connection pressure vanished. Latency returned to nominal levels. Incident resolved. ## 5. Response Evaluation ### What went well * Proactive alerting. Our monitoring systems successfully triggered alerts for elevated wait times before customer reports of degradation began to surface, giving the team an early head start on triage. * Core database resilience. Despite the severe saturation of the connection poolers \(PgBouncer\), the underlying PostgreSQL database engine demonstrated remarkable stability. Resource utilization \(CPU/Memory\) on the database nodes remained nominal, preventing a total database meltdown. * URL API isolation. The URL API \(file processing and delivery\) remained fully operational throughout the incident. Its architectural isolation from the ingestion pipeline ensured that end-user access to previously uploaded files was never affected. ### What went wrong * Observability became the bottleneck. The monitoring agent \(`telegraf`\), designed to provide visibility, became the primary cause of the outage. The use of an active, heavy command like `find` on a network filesystem was a design flaw that turned a monitoring check into a Denial of Service attack on the NFS client. * Misleading initial signals. The initial flood of `idle-in-transaction` connections led the response team to focus on the database layer. It delayed the investigation into the storage layer. ## 5. Action Items * Eliminate "Noisy Neighbor" observability. Replace active, resource-intensive monitoring checks \(specifically the recursive `find` command\) with passive metrics. Transition to using AWS CloudWatch metrics or other side-channel indicators that do not consume client-side I/O credits or burden the EFS mount. * Enforce connection pool isolation. Reconfigure PgBouncer to implement strict resource bulkheading. Dedicate separate connection pools for the Upload API HTTP workers, REST API HTTP workers, and background workers. This ensures that a resource stall in the background worker fleet cannot starve the customer-facing APIs of database connections. * Decouple storage I/O from database state. Refactor the application logic to execute blocking file storage operations \(reading/writing to EFS\) _outside_ of atomic database transactions. This ensures that storage latency results only in slower worker throughput, rather than holding database connections open \(`idle-in-transaction`\) and exhausting the global pool. * Upgrade incident response tooling. Equip production containers and nodes with essential diagnostic capabilities and establish secure access protocols. This ensures engineers can diagnose kernel-level and network-level bottlenecks during active incidents. ## 6. Conclusion The December 15 disruption was a complex failure mode involving storage physics, monitoring side-effects, and database concurrency. By identifying the contention on the EFS mount \(caused by the monitoring agent\) as the root cause, rather than a hard AWS limit, we have a clear path to resolution. The removal of the invasive monitoring script and the decoupling of transactions will robustly insulate the Uploadcare platform from similar incidents in the future. We apologize for the disruption and appreciate the patience of our customers as we hardened our systems.
On October 20, 2025, between 07:50 UTC and 09:27 UTC, a portion of our URL API service experienced a major disruption. The service responsible for real-time processing and transformation of uncached files was unavailable, resulting in failed requests for those specific operations. The delivery of already-cached files and all other Uploadcare services, including file uploading and management, were unaffected. The direct root cause of this incident was a major service outage in the Amazon Web Services \(AWS\) `us-east-1` \(N. Virginia\) region, specifically impacting AWS DynamoDB. During the incident, we also identified a critical flaw in our incident communication process: we were unable to access our Atlassian-hosted status page to provide timely updates to our customers. It was affected by AWS services disruption, too. Our key follow-up actions will focus on improving the architectural resilience of our services to reduce single points of failure and establishing a more robust, independent system for incident communications. ## Timeline of events All times are in UTC on October 20, 2025. * **06:49:** Our internal monitoring systems begin to alert on a significant increase in error rates and latency for the URL API service, specifically for requests involving uncached files. Our engineering team begins an immediate investigation. * **07:03**: Our engineering team localizes issue with DynamoDB. * **07:37:** The team attempts to update our public status page \(`status.uploadcare.com`\) to inform customers but is unable to log in to the third-party provider \(Atlassian Statuspage\). * **07:46:** We get to conclusion that the issue is related to DNS of DynamoDB. * **07:51:** AWS posts its first notification acknowledging increased error rates and latencies for multiple services in the `us-east-1` region. Our team correlates our internal alerts with this broader AWS issue. * **08:26:** AWS confirms that the issue is centered on significant error rates for DynamoDB in `us-east-1`, confirming our initial diagnosis of the root cause affecting our service. * **09:01:** AWS reports they have identified a potential root cause related to DNS resolution for the DynamoDB API endpoint and are working on mitigations. * **09:27:** Our internal monitoring shows that error rates for the URL API have returned to normal levels. The service is fully recovered and operational. We declare the incident resolved internally. ## What went well * Our internal monitoring systems detected the service failure immediately, allowing for a rapid start to our investigation. * Our engineering team was able to quickly correlate the internal issue with the external AWS service outage, preventing wasted time on internal diagnostics. ## What went wrong * **Architectural single point of failure:** Our URL API for uncached files had a hard dependency on a single service \(DynamoDB\) within a single AWS region \(`us-east-1`\). The failure of this service led directly to the failure of ours, with no mechanism for graceful degradation or failover. * **Failure of incident communication:** Our designated channel for customer communication, the status page hosted at `status.uploadcare.com`, was inaccessible to our team during the outage. This failure prevented us from providing timely and transparent updates to our users, which is a critical part of our incident response protocol. ## Action items * **Establish a redundant status communication channel:** We will set up a secondary, fully independent channel for incident communication. This will ensure that if our primary status page provider is ever unavailable, we can still communicate effectively with our customers. * **Architect for resilience and decentralization:** Our engineering team will conduct a full architectural review of the URL API service. The primary goal is to re-architect the service to remove DynamoDB in `us-east-1` as a single point of failure. This may involve implementing multi-region failover capabilities or introducing a more resilient data caching layer. * **Audit critical services for single points of failure:** We will expand our review to other critical services across the Uploadcare platform to identify and mitigate other potential single points of failure related to external, regional dependencies. We sincerely apologize for the disruption this incident caused our customers and for our failure to communicate the issue in a timely manner. We are committed to learning from this event and implementing these changes to build a more resilient and reliable platform.
## Incident Summary On 26 November 2024, an issue with our URL API was identified, causing a partial outage of the service from 14:31 to 15:00 UTC. ## Timeline * **14:20 UTC** – A CDN configuration change was made to improve the Auto Format feature in the URL API. * **14:31 UTC** – Service degradation began as increased traffic overwhelmed our origin servers due to cache key changes. * **14:37 UTC** – Alert notifications were received by our team. * **14:58 UTC** – The CDN configuration was rolled back. * **15:00 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The issue was caused by a CDN configuration change that altered cache key behavior, significantly increasing traffic to our origin servers. This overwhelmed our autoscaling groups, which hit their scaling limits, resulting in a service outage. ## Impact URL API degradation lasted for approximately 29 minutes. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The configuration change was identified as the root cause and rolled back. The service stabilized immediately after the rollback, and full functionality was restored by 15:00 UTC. ## Action Items ### Short-term * Review and improve escalation processes for alert notifications to reduce resolution time. * Adjust staging environment testing to better simulate production traffic patterns for similar changes. ## Long-term * Assess autoscaling limits to ensure sufficient capacity for handling unexpected traffic spikes. We sincerely apologize for the disruption and are committed to learning from this incident to improve the reliability of our services. Thank you for your understanding and continued support.
## Incident Summary On November 25, 2024, from 16:01 to 19:44 UTC, the “Uploading from 3rd Party Services” feature of our platform was unavailable. This feature allows users to upload files from social networks and storage services like Dropbox, Facebook, Google Drive, and Instagram. ## Timeline * **15:50 UTC** – A server configuration update was initiated to streamline deployments in our Kubernetes environment. * **16:01 UTC** – The outage began as requests to the “Uploading from 3rd Party Services” feature failed to be processed. * **19:00 UTC** – Investigation revealed that the application networking was incorrectly configured. * **19:15 UTC** – A fix was implemented to re-configure the application networking. * **19:44 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The outage was caused by a synchronization issue between repositories during a server configuration update. This misalignment between repositories, compounded by a lack of explicit configuration, led to the service outage. ## Impact For approximately 3 hours and 43 minutes, users were unable to upload files from third-party services. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The application configuration was updated and service functionality was fully restored at 19:44 UTC. ## Action Items ### Short-term * Improve feature status monitoring and alerting to detect outages faster. ## Long-term * Improve synchronization processes between repositories to avoid dependency misalignments. We sincerely apologize for the disruption this caused to our users. We take this incident seriously and are committed to implementing the above action items to prevent similar occurrences in the future. Thank you for your patience and continued trust in our platform. If you have any questions or need further details, please don’t hesitate to reach out.
## Incident Summary On 25 September 2024, an issue with webhook delivery was identified, affecting clients between 10:07 and 12:19 UTC. The delay impacted webhook notifications, with no data loss but a significant delay in processing and delivery. ## Timeline * 10:07 UTC – A system configuration change was made, which inadvertently disrupted webhook processing. * 10:07 UTC – Webhook delivery issues began. * 12:05 UTC – The problem was identified and resolved, with backlogged webhooks being processed. * 12:12 UTC – The first webhook was successfully delivered after the fix. * 12:19 UTC – All queued events were processed, with delivery confirmed for all affected users. ## Root Cause The issue was caused by a configuration change that resulted in the webhook delivery system not processing events correctly. Despite initial signs of system health, the disruption went undetected due to gaps in the system’s monitoring tools. ## Impact * Webhook delivery was delayed for approximately 2 hours. * Customers experienced delays in receiving event notifications. * No data was lost, but delivery delays were significant due to a backlog in event processing. ## Challenges During Resolution * Monitoring systems indicated that components of the webhook system were healthy, which delayed identification of the underlying problem. ## Resolution * Webhook processing was restarted, and we verified that all queued events were delivered without any data loss. * The incident was fully resolved by 12:19 UTC, with all webhooks processed and delivered. ## Action Items ### Short-term * Improve the system’s monitoring and alerting to better detect issues with webhook processing. ## Long-term * Explore options to improve the resilience of our webhook delivery system, including scaling the infrastructure to better handle failures.
## REST and Upload API service degradation \(incident #wvjpwt1qtpkn\) **Date**: 2024-07-08 **Authors**: Arsenii Baranov **Status**: Complete **Summary**: From 09:01:53 to 09:33:12 UTC we've experienced higher latencies and in the end a complete outage of Upload and REST API. **Root Causes**: PostgreSQL performance degradation due to human-factor. **Trigger**: Misconfiguration. **Resolution**: Changed our DBA-related processes, fixed monitoring related issue. **Detection**: Our Customer Success team detected the issue and escalated to the Engineering team. **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Fix monitoring misconfiguration | mitigate | **DONE** | | Improve DBA maintenance approach | prevent | **DONE** | ## Lessons Learned **What went well** * Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN. * Our incident mitigation strategy was right and worked immediately. **What went wrong** * This incident was detected in non-automatic way due to alert misconfiguration. * We failed to process API request during the incident. ## Timeline 2024-07-08 _\(all times UTC\)_ * 08:00 Database maintenance started * 09:01 **SERVICE DEGRADATION BEGINS** * 09:23 Our customer success team escalates issue to Infrastructure team * 09:29 Issue localised and fixed * 09:33 **SERVICE DEGRADATION ENDS**
# -/json/ and -/main\_colors/ operations returning 400 status **Date**: 2023-11-21 **Authors**: Alyosha Gusev **Status**: Complete, action items in progress **Summary**: From 2023-11-21 17:40 to 2023-11-22 18:40 -/json/ and -/main\_colors/ operations started to return 400 status **Root Causes**: HTTP rewrites misconfiguration **Trigger**: Deploy of the new functionality that enables our customers to save more money on traffic, and end users to experience lower load times \(improvements to automatic image optimisation\) **Resolution**: Bugfix deploy **Detection**: Our automatic tests detected the issue **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Fix rewrites | mitigate | **DONE** | | Improve visibility of failing tests notification | prevent | **DONE** | ## Lessons Learned **What went well** * Problem was detected by automatic tests * Bugfix was deployed immediately **What went wrong** * We didn’t found out that tests were failing immediately ## Timeline All times UTC * 2023-11-21 17:40: **SERVICE DEGRADATION BEGINS**. Deploy of the new functionality * 2023-11-22 16:18: Team notices failing tests * 2023-11-22 16:19: Issue localised, Incident response team is formed * 2023-11-22 16:19: Team starts bugfix implementation * 2023-11-22 17:40: Fix deployed to production * 2023-11-22 18:40: Last cached error response expired * 2023-11-22 18:40: **SERVICE DEGRADATION ENDS**
This incident has been resolved.
## Upload API and Video processing services degradation \(incident #5r4zj8shr69c\) **Date**: 2023-10-02 **Authors**: Alyosha Gusev, Denis Bondar **Status**: Complete **Summary**: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services. **Root Causes**: Cascading failure due to combination of exceptionally high amount of requests to Upload API. **Trigger**: Latent bug triggered by sudden traffic spike. **Resolution**: Changed our throttling politics, increased resources for processing. **Detection**: Our Customer Success team detected the issue and escalated to the Engineering team. **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Test corresponding alerts for correctness | mitigate | **DONE** | | Improve our upload processing system to remove bottleneck that we found | prevent | **DONE** | | Fix service access issue for team members that form potential response teams | mitigate | **DONE** | ## Lessons Learned **What went well** * Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN. * Our incident mitigation strategy was right and worked immediately. **What went wrong** * This incident was detected in non-automatic way due to alert misconfiguration. * Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner. ## Timeline 2023-10-02 _\(all times UTC\)_ * 14:15 Our upload processing queue start filling * 14:20 **SERVICE DEGRADATION BEGINS** * 15:23 Our customer success team escalates issue to Infrastructure team * 15:31 Issue localised * 15:41 Incident response team is formed * 15:51:13 Adjusted our throttling policies * 15:51:38 Increased number of processing instances * 16:40 **SERVICE DEGRADATION ENDS** Processing queues clear
Scheduled and completed maintenance windows are separated from incidents.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
The scheduled maintenance has been completed.
Uptimus tracks the official Uploadcare status page, normalizes upstream events, and separates incidents from scheduled maintenance.
Official source
https://status.uploadcare.com
Adapter
STATUSPAGE IO
Alert streams
Incidents, component changes, and maintenance windows.
Public SEO page
Indexable status history for users searching outage information.
Regional reports can be layered on top of official provider status when user signals are available.
Showing 1 to 9 of 9 tracked components.
| Component | Status | Type | Last changed |
|---|---|---|---|
Processing engines | Operational | Group | Not recorded |
Document processing | Operational | Component | Not recorded |
Upload API | Operational | Component | Not recorded |
Video processing | Operational | Component | Not recorded |
Webhooks The service sends real-time updates about uploads. | Operational | Component | Not recorded |
REST API | Operational | Component | Not recorded |
CDN | Operational | Component | Not recorded |
Dashboard | Operational | Component | Not recorded |
Website | Operational | Component | Not recorded |
Follow outages, degraded components, and maintenance updates in your Uptimus workspace with email, push, and webhook alerts.
Official provider components
Incident and maintenance separation
Workspace alerts and webhooks
Related status pages based on category, adapter type, and operational history.
Uploadcare is currently marked as Operational in Uptimus based on the latest official status page check.
Supported status page providers are checked continuously by our scraper scheduler. The public page is cached briefly for SEO and performance.
No. Uptimus stores incidents and maintenance windows separately when the upstream provider exposes enough detail.
Yes. Create an Uptimus workspace, follow this provider, and choose email, push, or webhook notifications.