Investigating a Quality Failure
This walkthrough guides you through the complete process of investigating and resolving a data quality failure.
Scenario
You've been notified of a quality check failure on a critical data product. Your goal is to investigate the root cause, resolve the issue, and prevent recurrence.
Step 1: Review the Alert
Access the Alerts Center
Navigate to Quality Management → Alerts Center, find your alert using the filters if needed, and click to open the details.
Understand the Failure
Review the alert details to understand what happened. Identify the Check Name to know which validation failed, check the Timestamp to see when it occurred, read the Error Message for specific failure details, and confirm the Affected Product.
Note the timestamp—this helps correlate with upstream events like pipeline runs or deployments.
Step 2: View Check Execution History
From the alert, click through to the Data Product and select the Quality tab. Find the failed check and expand its history to analyze trends.
What to Look For
Determine if this is a one-time failure or a recurring pattern. Look at when the check last passed and try to identify what changed in the environment between the last success and the current failure.
Step 3: Investigate Root Cause
Check Upstream Dependencies
Navigate to the Lineage tab and view upstream products. Check if any upstream checks have also failed, which would indicate a propagated issue.
Review Recent Changes
Consider potential triggers such as schema modifications in source tables, changes to pipeline schedules, significant fluctuations in data volume, or known upstream data issues.
Examine the Data
For a deeper investigation, review the check's SQL query and configuration. If necessary, query the source data directly to compare the current state against expected values.
Step 4: Create an Issue
If you've identified a problem that requires work, create an issue. From the alert, click Create Issue and fill in details: a clear Title, a Priority based on business impact, and a Description incorporating your findings.
Good Issue Titles
Avoid vague titles like "Check failing" or "Data issue". Instead, use specific descriptions like "Freshness check failing since pipeline delay on Jan 15" or "30% null rate in customer_email after schema migration".
Step 5: Resolve the Issue
For Data Pipeline Issues
If the issue lies in the pipeline, contact the data engineering team to review logs, trigger a backfill if needed, and monitor for successful completion.
For Configuration Issues
If the check itself is the problem, update the threshold, adjust the timing to match data availability, or update the expected values if the source data has changed intentionally.
For Schema Changes
When schema changes are the cause, update the product metadata, modify the quality checks to align with the new structure, and communicate the changes to downstream consumers.
Step 6: Verify Resolution
Wait for the next scheduled check execution or trigger one manually. Verify that the check passes and monitor it for a few subsequent runs to ensure stability.
Close the Loop
Once verified, resolve the alert in the system, update and close the associated issue, and document your findings for future reference.
Step 7: Prevent Recurrence
Consider adding additional checks to catch similar issues earlier or setting up alerting for upstream dependencies. If this is a common issue type, add it to your team runbooks, create monitoring for leading indicators, or consider automated remediation strategies.
Key Takeaways
Start with context to understand the full picture before diving in, and use lineage to identify if problems originate upstream. Document thoroughly for future reference, and focus on preventing recurrence by addressing root causes rather than just fixing symptoms.