Skip to main content

Investigating a Quality Failure

This walkthrough guides you through the complete process of investigating and resolving a data quality failure.

Scenario

You've been notified of a quality check failure on a critical data product. Your goal is to investigate the root cause, resolve the issue, and prevent recurrence.

Step 1: Review the Alert

Access the Alerts Center

Navigate to Quality Management → Alerts Center, find your alert using the filters if needed, and click to open the details.

Understand the Failure

Review the alert details to understand what happened. Identify the Check Name to know which validation failed, check the Timestamp to see when it occurred, read the Error Message for specific failure details, and confirm the Affected Product.

tip

Note the timestamp—this helps correlate with upstream events like pipeline runs or deployments.

Step 2: View Check Execution History

From the alert, click through to the Data Product and select the Quality tab. Find the failed check and expand its history to analyze trends.

What to Look For

Determine if this is a one-time failure or a recurring pattern. Look at when the check last passed and try to identify what changed in the environment between the last success and the current failure.

Step 3: Investigate Root Cause

Check Upstream Dependencies

Navigate to the Lineage tab and view upstream products. Check if any upstream checks have also failed, which would indicate a propagated issue.

Review Recent Changes

Consider potential triggers such as schema modifications in source tables, changes to pipeline schedules, significant fluctuations in data volume, or known upstream data issues.

Examine the Data

For a deeper investigation, review the check's SQL query and configuration. If necessary, query the source data directly to compare the current state against expected values.

Step 4: Create an Issue

If you've identified a problem that requires work, create an issue. From the alert, click Create Issue and fill in details: a clear Title, a Priority based on business impact, and a Description incorporating your findings.

Good Issue Titles

Avoid vague titles like "Check failing" or "Data issue". Instead, use specific descriptions like "Freshness check failing since pipeline delay on Jan 15" or "30% null rate in customer_email after schema migration".

Step 5: Resolve the Issue

For Data Pipeline Issues

If the issue lies in the pipeline, contact the data engineering team to review logs, trigger a backfill if needed, and monitor for successful completion.

For Configuration Issues

If the check itself is the problem, update the threshold, adjust the timing to match data availability, or update the expected values if the source data has changed intentionally.

For Schema Changes

When schema changes are the cause, update the product metadata, modify the quality checks to align with the new structure, and communicate the changes to downstream consumers.

Step 6: Verify Resolution

Wait for the next scheduled check execution or trigger one manually. Verify that the check passes and monitor it for a few subsequent runs to ensure stability.

Close the Loop

Once verified, resolve the alert in the system, update and close the associated issue, and document your findings for future reference.

Step 7: Prevent Recurrence

Consider adding additional checks to catch similar issues earlier or setting up alerting for upstream dependencies. If this is a common issue type, add it to your team runbooks, create monitoring for leading indicators, or consider automated remediation strategies.

Key Takeaways

Start with context to understand the full picture before diving in, and use lineage to identify if problems originate upstream. Document thoroughly for future reference, and focus on preventing recurrence by addressing root causes rather than just fixing symptoms.