This content originally appeared on DEV Community and was authored by prakhyatkarri
Introduction
In modern data engineering workflows, managing raw data and its transformations is crucial for providing clean, reliable, and structured data to analytics teams and downstream applications. Databricks Unity Catalog provides a unified data governance model that simplifies and secures data access across the entire organization. This guide will walk through a comprehensive process for ingesting raw data into a shared bronze layer and transforming it into multiple silver layers in a Unity Catalog setup, with data stored in catalog-managed tables instead of traditional file paths. We will use examples and data flow diagrams to illustrate the concepts.
Overview of the Layers
Raw Layer: Unstructured or semi-structured data as it arrives from source systems (e.g., logs, IoT feeds, application data).
Bronze Layer: A landing area where raw data is cleaned and minimally processed. This layer maintains data fidelity while addressing obvious data quality issues.
Silver Layer: Curated data for specific use cases. Data from the bronze layer is cleaned, structured, and aggregated into independent tables optimized for downstream use (e.g., analytics, reporting, machine learning).
Data Flow Diagram
+-------------------+ +-------------------+ +-------------------+
| Raw Layer | ---> | Bronze Layer | ---> | Silver Layers |
| (External Data) | | (Cleaned Data) | | (Curated Data) |
+-------------------+ +-------------------+ +-------------------+
|
+-----------+-----------+
| |
+-------------------+ +-------------------+ +-------------------+
| Bronze to Silver 1| | Bronze to Silver 2| | Bronze to Silver n|
| (Specific Use Case)| | (Specific Use Case)| | (Specific Use Case)|
+-------------------+ +-------------------+ +-------------------+
Step-by-Step Guide
1. Ingesting Data into the Raw Layer
The Raw Layer typically represents data coming from various external sources such as JSON, CSV, or logs, and is unprocessed. This data is brought into Databricks either through batch or streaming jobs. You could use Databricks Auto Loader for efficient file ingestion from cloud storage systems like Azure Data Lake Storage (ADLS).
Here’s an example of a batch ingestion job using PySpark:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("RawToBronzeIngestion").getOrCreate()
# Define source and target paths for raw data
raw_data_path = "abfss://raw@storage_account.dfs.core.windows.net/dataset/"
bronze_table = "catalog_name.schema_name.bronze_table"
# Load the raw data
raw_df = spark.read.option("header", True).csv(raw_data_path)
# Write the raw data into the bronze layer
raw_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)
Key Points:
Source: Data is ingested from external systems or cloud storage.
Target: The raw data is ingested directly into a Unity Catalog table in the bronze layer.
Table Management: Unity Catalog manages the metadata for the tables, providing security and governance.
2. Bronze Layer: Initial Cleansing and Normalization
The Bronze Layer is a lightly processed version of the raw data. The data is stored in Unity Catalog tables, allowing for basic cleansing and structural changes (e.g., handling nulls, normalizing data types). This layer acts as a source of truth and an archive of the original data with minor cleaning.
Let’s clean the raw data and store it in the bronze table:
from pyspark.sql.functions import col
# Clean and transform raw data
bronze_df = raw_df \
.dropDuplicates() \
.filter(col("important_column").isNotNull()) \
.withColumn("processed_at", current_timestamp())
# Write the cleaned data into the bronze table
bronze_df.write.format("delta").mode("overwrite").saveAsTable(bronze_table)
Key Steps:
Data Cleansing: Remove duplicates, null values, and other obvious quality issues.
Column Enrichment: Add processing metadata such as timestamps for auditing.
Governance: Data in the bronze layer is managed and versioned with Delta Lake, enhancing traceability.
3. Transforming Bronze Data into Multiple Silver Layers
The Silver Layer represents clean, structured, and optimized datasets for specific downstream use cases. These could include separate tables for different business units or functional areas. The goal is to provide ready-to-use data for analytics, machine learning, or reporting.
Each silver table is designed with specific business logic in mind. Below, we walk through two different transformations from the bronze layer into silver layers.
Silver Layer 1: Customer Data Cleansing and Aggregation
For example, if you have customer data, you may want to clean and aggregate customer transactions for reporting purposes.
# Load the bronze data
bronze_df = spark.table(bronze_table)
# Transform and cleanse data for the silver layer
silver_customer_df = bronze_df \
.filter(col("transaction_type") == "purchase") \
.groupBy("customer_id") \
.agg(sum("transaction_amount").alias("total_purchase"))
# Write the curated data to the silver layer
silver_customer_table = "catalog_name.schema_name.silver_customers"
silver_customer_df.write.format("delta").mode("overwrite").saveAsTable(silver_customer_table)
Silver Layer 2: Product Data Transformation
You might have product data that needs to be structured for inventory tracking:
# Filter and transform product-related data
silver_product_df = bronze_df \
.filter(col("category").isNotNull()) \
.select("product_id", "category", "inventory_status")
# Write the product data to another silver table
silver_product_table = "catalog_name.schema_name.silver_products"
silver_product_df.write.format("delta").mode("overwrite").saveAsTable(silver_product_table)
Key Points for Silver Layer:
Business Logic: Each silver table represents a different subset of the bronze data, filtered and transformed to serve specific use cases.
Table Partitioning: Silver tables can be partitioned for performance optimization based on frequently queried columns (e.g., date, customer region).
Storage Optimization: Data stored in the silver layer can be aggregated, cleaned, and formatted for efficient querying.
4. Managing Unity Catalog Tables
All data at each stage (raw, bronze, silver) is managed as Unity Catalog tables. This provides centralized control over access permissions, schema evolution, and lineage tracking. You can also enforce data masking or row-level security policies based on user roles.
To view and manage the tables:
-- List all Unity Catalog tables in a specific schema
SHOW TABLES IN catalog_name.schema_name;
5. Scheduling and Automation of Pipelines
Data pipelines for moving from raw to bronze and from bronze to silver can be automated using Databricks Workflows or Azure Data Factory. This ensures the ingestion and transformation processes are consistent and run on predefined schedules.
Here’s a sample setup of a Databricks job to automate this pipeline:
# Job Configuration (Pseudo Code)
job = {
"name": "BronzeToSilverPipeline",
"tasks": [
{
"task_key": "ingest_raw_data",
"notebook_path": "/Repos/IngestRawData",
"schedule": "0 2 * * *" # Daily at 2 AM
},
{
"task_key": "transform_to_silver",
"notebook_path": "/Repos/TransformToSilver",
"depends_on": "ingest_raw_data"
}
]
}
This ensures each pipeline step runs in sequence and that data is consistently updated from raw to bronze to silver layers.
Conclusion
Ingesting and transforming data within Databricks using Unity Catalog offers a scalable and governed approach for managing data pipelines. By following this shared bronze layer and independent silver layers architecture, organizations can ensure data reliability, performance, and security. Leveraging Delta Lake and Unity Catalog features like access control and auditing ensures compliance while providing the flexibility to transform and prepare data for diverse use cases.
This architecture can be expanded further with additional layers (e.g., gold layer for final analytics) and extended with machine learning models. By managing all data in catalog-based tables, you create a unified environment with centralized governance and control.
Some reference links to help you build data pipelines using Databricks, Delta lake, Pyspark and Unity Catalog.
Unity Catalog Overview
Delta Lake Guide
Auto Loader for Data Ingestion
PySpark for Data Transformations
Databricks Jobs and Workflows
Azure Data Lake Integration
This content originally appeared on DEV Community and was authored by prakhyatkarri
prakhyatkarri | Sciencx (2024-10-20T21:59:58+00:00) From a Unified Bronze Layer to Multiple Silver Layers: Streamlining Data Transformation in Databricks Unity Catalog. Retrieved from https://www.scien.cx/2024/10/20/from-a-unified-bronze-layer-to-multiple-silver-layers-streamlining-data-transformation-in-databricks-unity-catalog/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.