Syncing dbt Metadata
If your organization uses dbt to manage data transformations, you can sync your dbt project's metadata — models, columns, lineage, and documentation — directly into Qarion's data catalog. This keeps your catalog in sync with your transformation layer without requiring manual registration of each model.
This tutorial walks through parsing dbt artifacts, creating and updating products from model metadata, syncing lineage relationships, and automating the process in CI/CD.
Prerequisites
Before you begin, you'll need a Qarion API key with Editor permissions in the target space, a dbt project with generated artifacts (specifically manifest.json and catalog.json), and the Python requests library installed.
Overview
The sync process follows four stages. First, you parse the dbt artifacts to extract model definitions, column metadata, and dependency relationships. Then you create or update data products in Qarion for each model. Next, you sync the lineage relationships derived from dbt's dependency graph. Finally, you automate the entire pipeline so it runs after every dbt build.
Step 1: Parse dbt Artifacts
dbt generates two key artifacts during a build: manifest.json, which contains the project's dependency graph, model definitions, and compiled SQL, and catalog.json, which contains schema-level metadata like column names, types, and row counts. Together, these files provide everything Qarion needs to populate the catalog.
The following function parses the manifest and filters for model nodes (excluding tests, sources, and other dbt node types):
import json
def parse_dbt_manifest(manifest_path):
"""Extract model metadata from dbt manifest."""
with open(manifest_path) as f:
manifest = json.load(f)
models = {}
for node_id, node in manifest["nodes"].items():
if node["resource_type"] == "model":
models[node_id] = {
"name": node["name"],
"description": node.get("description", ""),
"schema": node["schema"],
"database": node["database"],
"columns": node.get("columns", {}),
"depends_on": node.get("depends_on", {}).get("nodes", []),
"tags": node.get("tags", []),
"meta": node.get("meta", {})
}
return models
Step 2: Create or Update Products
With the parsed model data, you can now register each model as a data product in Qarion. The function below constructs the product payload from dbt metadata and creates the product via the API. If a product with the same slug already exists, it updates the existing record instead:
import requests
API_BASE = "https://api.qarion.com"
API_KEY = "your-api-key"
SPACE_SLUG = "analytics"
def sync_product(model):
"""Create or update a product from dbt model metadata."""
product_data = {
"name": model["name"],
"product_type": "table",
"provider": "dbt",
"description": model["description"],
"hosting_location": f"{model['database']}.{model['schema']}.{model['name']}"
}
# Try to create; update if already exists
response = requests.post(
f"{API_BASE}/catalog/spaces/{SPACE_SLUG}/products",
headers={"Authorization": f"Bearer {API_KEY}"},
json=product_data
)
if response.status_code == 409:
# Product exists, update instead
existing = find_product_by_slug(model["name"])
response = requests.patch(
f"{API_BASE}/catalog/spaces/{SPACE_SLUG}/products/{existing['id']}",
headers={"Authorization": f"Bearer {API_KEY}"},
json=product_data
)
return response.json()
The hosting_location field is constructed from the dbt model's database, schema, and name — giving catalog consumers a direct reference to where the data lives in the warehouse.
Step 3: Sync Fields
dbt models often include column-level documentation that describes each field's purpose, data type, and business meaning. Syncing this metadata into Qarion populates the schema tab in the catalog, making datasets discoverable at the column level:
def sync_fields(product_id, columns):
"""Sync column metadata from dbt to Qarion."""
for col_name, col_info in columns.items():
field_data = {
"name": col_name,
"description": col_info.get("description", ""),
"data_type": col_info.get("data_type", "unknown"),
"is_nullable": True,
"is_primary_key": col_info.get("meta", {}).get("is_primary_key", False)
}
requests.post(
f"{API_BASE}/catalog/spaces/{SPACE_SLUG}/products/{product_id}/fields",
headers={"Authorization": f"Bearer {API_KEY}"},
json=field_data
)
Step 4: Sync Lineage
dbt's depends_on field provides an explicit dependency graph — every model knows which other models it reads from. By translating these dependencies into Qarion lineage relationships, you get a complete data flow visualization that's always in sync with your actual transformations:
def sync_lineage(models, product_map):
"""Create lineage relationships from dbt dependencies."""
for model_id, model in models.items():
product = product_map.get(model["name"])
if not product:
continue
upstream_ids = []
for dep in model["depends_on"]:
dep_name = dep.split(".")[-1]
dep_product = product_map.get(dep_name)
if dep_product:
upstream_ids.append(dep_product["id"])
if upstream_ids:
requests.put(
f"{API_BASE}/catalog/spaces/{SPACE_SLUG}/products/{product['id']}/lineage",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"upstream_ids": upstream_ids}
)
The function iterates through each model's depends_on list, resolves the dependency names to Qarion product IDs, and creates the corresponding upstream lineage relationships.
Full Sync Script
The following script ties all the steps together into a single executable that parses dbt artifacts and performs a complete sync:
def full_dbt_sync(manifest_path, catalog_path=None):
"""Complete dbt-to-Qarion sync pipeline."""
print("Parsing dbt manifest...")
models = parse_dbt_manifest(manifest_path)
print(f"Found {len(models)} models")
# Step 1: Sync products
product_map = {}
for model_id, model in models.items():
print(f"Syncing product: {model['name']}")
product = sync_product(model)
product_map[model["name"]] = product
# Step 2: Sync fields if available
if model["columns"]:
sync_fields(product["id"], model["columns"])
# Step 3: Sync lineage
print("Syncing lineage relationships...")
sync_lineage(models, product_map)
print(f"Sync complete: {len(product_map)} products synced")
# Run the sync
full_dbt_sync("target/manifest.json")
CI/CD Automation
The real power of this sync comes from automating it as part of your CI/CD pipeline. By running the sync after every dbt build, your catalog stays permanently in sync with your transformation layer — new models are registered automatically, renamed models are updated, and lineage relationships reflect the latest dependency graph.
GitHub Actions
- name: Sync dbt to Qarion
run: |
pip install requests
python scripts/dbt_sync.py
env:
QARION_API_KEY: ${{ secrets.QARION_API_KEY }}
DBT_MANIFEST_PATH: target/manifest.json
Key Considerations
The sync script should be idempotent — running it multiple times with the same input should produce the same result. This is handled by the create-or-update pattern in sync_product, which checks for existing products before creating new ones.
For large dbt projects with hundreds of models, consider batching API calls and adding retry logic (see the Rate Limits guide) to avoid hitting the API's per-minute limits during the sync.
Related
- Lineage Concepts — Understanding data lineage
- SDK Examples — Python SDK patterns
- Quality Automation — Automate quality checks