Backfilling historical data for March 8, 2023 incident
Incident Report for Datadog US3
Resolved
We have finished backfilling data across all products: all data received during the incident that had been successfully buffered but unprocessed, is now fully accessible on the platform. Due to the nature of this outage, you may see some residual gaps in the data we received within the first few hours after the start of the incident.

We truly appreciate your patience and understanding during this incident.
Posted Mar 10, 2023 - 00:02 EST
Update
We have completed backfill of data for the following products


* Database Monitoring
* Serverless Monitoring


We are now in the process of validating and verifying data across all customers in those products.

For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 21:18 EST
Update
We have also completed backfilling data for the following products:

RUM

We are now in the process of validating and verifying data across all customers in those products.

For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 18:03 EST
Update
We have completed backfill of data for the following products:
* APM traces and services
* Logs
* Network Performance Monitoring
* Network Device Monitoring
* Profiling
* CI Visibility
and are now in the process of validating and verifying data across all customers in those products.
For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 15:19 EST
Update
All Datadog services are now available and able to receive, query, and report on live data. Monitors continue to be evaluated correctly since live data has been restored. Some customers may still observe gaps in historical data for parts of the last 24 hours.

We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 12:13 EST
Update
All Datadog services are now available and able to receive, query, and report on live data. Monitors continue to be evaluated correctly since live data has been restored. Some customers may still observe gaps in historical data for parts of the last 24 hours.

We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 07:10 EST
Update
Monitors continue to be evaluated correctly since live data has been restored.

Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 05:12 EST
Update
APM Traces and Error Tracking are operational. We will continue to monitor progress towards recovering the remaining services.

Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 04:18 EST
Update
Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 04:00 EST
Monitoring
Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Posted Mar 09, 2023 - 03:58 EST
Update
APM Traces and Error Tracking are operational. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 09, 2023 - 03:12 EST
Update
Security Monitoring is operational. SLOs are operational. Cloud Integrations are operational. Profiling recent data is available for queries. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 09, 2023 - 02:06 EST
Update
RUM is fully operational. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 09, 2023 - 01:34 EST
Update
Logs Management is operational, live data and alerting are back to normal. External Archives and Log Forwarding are still delayed. Metrics are fully operational. Serverless monitoring is operational. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 09, 2023 - 01:33 EST
Update
Network Device Monitoring is fully operational. Metrics generated from Logs are now available. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 09, 2023 - 00:20 EST
Update
We're in the process of enabling metric alerts for some customers for time windows less than 1 hour.

Network Performance Monitoring is fully operational. Event Management is fully operational. Error Tracking is partially available. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 08, 2023 - 23:29 EST
Update
The Synthetics product is fully operational. We're seeing partial recovery for Serverless Monitoring, as well as metrics from our cloud provider integrations. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 08, 2023 - 22:15 EST
Update
Monitors for Logs and Service Checks are operational. Database Monitoring is operational. We will continue to monitor progress towards recovering the remaining services.
Posted Mar 08, 2023 - 21:13 EST
Update
Live data is now available for Logs, and CI Visibility is fully operational. We're seeing partial recovery for Watchdog. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Posted Mar 08, 2023 - 19:56 EST
Update
We are continuing to work on a fix for this issue.
Posted Mar 08, 2023 - 18:50 EST
Update
Live Search on last 15 mins for APM Traces is recovered. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Posted Mar 08, 2023 - 18:49 EST
Update
We're seeing partial recovery across several products including Security Monitoring, CI Visibility and Network Performance Monitoring. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Posted Mar 08, 2023 - 18:22 EST
Update
We're seeing partial recovery across several products including SLOs and Logs. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Posted Mar 08, 2023 - 17:50 EST
Update
Processes and their respective monitors, and Metrics are operational in US3. There may be gaps in historical metric data. We continue progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Posted Mar 08, 2023 - 17:06 EST
Update
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 16:21 EST
Update
At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products and regions . The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed.

We will share a more detailed analysis post-recovery, but at a very high level:
A system update on a number of hosts controlling our compute clusters caused a subset of these hosts to lose network connectivity
As a result a number of the corresponding clusters entered unhealthy states and caused failures in a number of the internal services, datastores and applications hosted on these clusters.

Our current status is:
We identified and mitigated the initial issue, and rebuilt our clusters
We also have recovered a number of our applications and services, including our web portals
We are now working on recovering and catching-up the rest of our data systems for metrics, traces and logs across the regions that are still affected (see region-specific status pages). The recovery work is currently constrained by the number and large scale of the systems involved.

What to expect next:
We are focusing on bringing back live data for all customers and all products before catching-up on any historical data we may have stored during the outage
We expect live data recovery in a matter of hours (not minutes, and not days)
We will continue to issue regular updates as the situation unfolds

We understand how critical Datadog is to your business, we sincerely apologize for the inconvenience and we are working hard to resolve this issue.
Posted Mar 08, 2023 - 15:39 EST
Update
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 15:12 EST
Update
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 14:27 EST
Update
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 13:44 EST
Update
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 13:15 EST
Update
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 12:30 EST
Update
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 11:47 EST
Update
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 11:09 EST
Update
We are still working on the identified issue and are making continued progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 10:34 EST
Update
We are still working on the identified issue and are making continued progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 09:48 EST
Update
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 09:06 EST
Update
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 08:29 EST
Update
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 07:44 EST
Update
We have identified the issue, and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
Posted Mar 08, 2023 - 07:08 EST
Identified
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
Posted Mar 08, 2023 - 06:23 EST
Update
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
Posted Mar 08, 2023 - 05:34 EST
Update
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
Posted Mar 08, 2023 - 04:38 EST
Update
We are continuing to investigate this issue.
Posted Mar 08, 2023 - 03:51 EST
Update
We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.
Posted Mar 08, 2023 - 03:38 EST
Update
We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.
Posted Mar 08, 2023 - 03:04 EST
Update
We are investigating issues causing delayed data ingestion across all data types. As a result monitor notifications may be delayed, and you may observe delayed data throughout the web app.
Posted Mar 08, 2023 - 02:21 EST
Investigating
We are investigating loading issues on our web application. As a result, some users might be getting errors when loading the web application.
Posted Mar 08, 2023 - 01:37 EST
This incident affected: APM, CI Visibility, Cloud Security Management, Incident Management, Log Management, Metrics and Infra Monitoring, Mobile Application, Monitors, NPM, Profiling, RUM, Serverless, Synthetics, and Web Application.