Real-World Data Management Using Databricks, DBT, and ADF
Let’s explore a setup to manage data using Databricks, DBT (Data Build Tool), Azure Data Lake Storage (ADLS), and Azure Data Factory (ADF) in a real-world scenario.
Scenario: E-Commerce Analytics Platform
Imagine we’re working for an e-commerce company, and we need to manage and analyze data related to customer interactions, transactions, and website clicks. Here’s how we’d handle the entire process using these modern tools.
1. Data Ingestion with Azure Data Factory (ADF)
First, the raw data coming from various sources like transaction logs, customer profiles, and clickstreams needs to be stored in a data lake. Azure Data Factory (ADF) is a great tool for this.
- ADF helps me pull data from multiple sources (APIs, databases, files) into Azure Data Lake Storage (ADLS), which acts as my data lake.
- we can set up pipelines in ADF that will regularly fetch new data and store it in ADLS, keeping the data updated.
For example, a pipeline can be set up to move daily transaction data into ADLS in formats like CSV or JSON.
2. Storing Data in Azure Data Lake Storage (ADLS)
Now, we have raw data being continuously stored in ADLS. This data lake stores both structured (e.g., transaction records) and unstructured data (e.g., user interaction logs).
ADLS helps me scale my storage as the amount of data grows, and it is cost-effective for large datasets. In the data lake, files can be stored in directories like:
/raw/transactions/
/raw/customer_profiles/
/raw/click_stream/
At this point, the data is still raw and needs to be cleaned and transformed for analysis.
3. Transforming Data Using Databricks and DBT
Here comes the power of Databricks and DBT.
-
Databricks: Databricks provides a unified platform for big data processing. Using Apache Spark on Databricks, we can easily process the raw data stored in ADLS. It integrates well with ADLS, so we can read and write data efficiently.
-
DBT: While Databricks is great for processing, DBT helps me automate and manage the transformation logic. DBT allows me to write SQL queries to transform data into the structure we need for analytics, and it keeps everything version-controlled.
Let’s walk through an example of how we would clean and transform transaction data.
Example Workflow:
- Reading raw data: we use Databricks to read raw data from ADLS:
raw_transactions_df = spark.read.csv("/mnt/adls/raw/transactions/")
- Writing a DBT model: In DBT, we create a model that cleans the data. A DBT model is essentially a SQL query that transforms raw data into a usable format:
-- models/cleaned_transactions.sql SELECT transaction_id, customer_id, CAST(amount AS DOUBLE) AS amount, status FROM raw.transactions WHERE status = 'completed'
- Running DBT: we run the DBT model, and DBT handles the transformation in Databricks:
dbt run
This creates a new cleaned table,
cleaned_transactions
, in Delta Lake, which is Databricks’ transactional data storage layer.
4. Storing Transformed Data in Delta Lake
Delta Lake adds important features like ACID transactions and version control to the data stored in ADLS. So, after transforming the data with DBT, it gets stored as a Delta Table.
In this case, my cleaned transaction data might be saved in Delta Lake at:
/mnt/adls/delta/cleaned_transactions/
Delta Lake makes it easy for analysts to query this transformed data and trust its accuracy because of its strong transactional guarantees.
5. Metadata Management with Hive Metastore
The transformed data is registered in the Hive Metastore, which is part of Databricks. This metastore acts like a catalog that tracks where the data is stored and what its structure looks like.
For example, the cleaned transaction data is now a table called cleaned_transactions
, and Hive Metastore tracks this table so anyone using Databricks can easily query it.
6. Data Governance with Unity Catalog
As the amount of data grows and more people start using it, we need to make sure that the data is secure and governed properly. Unity Catalog in Databricks helps with this.
- It ensures that the right people have the right access (for example, analysts can only view data, while engineers can edit it).
- It also tracks the lineage of data, so we know where the data comes from and how it’s been transformed.
Conclusion
By combining Databricks, DBT, ADF, and ADLS, we can build a powerful, scalable data management system for my e-commerce platform. Here’s a quick summary of what each tool does:
- Azure Data Factory (ADF): Ingests raw data into ADLS.
- Azure Data Lake Storage (ADLS): Stores raw and transformed data.
- Databricks: Processes and transforms large datasets using Spark.
- DBT: Manages SQL-based transformations in Databricks.
- Delta Lake: Stores transformed data with transactional guarantees.
- Hive Metastore: Tracks the metadata for all tables and datasets.
- Unity Catalog: Ensures governance and security.
End to end Project using Databricks and Unity Catalog
Concepts Covered:
We will use Delta Lake format as this is the format recommended for real-time projects. For access control we will use Unity Catalog We will use Spark structured streaming to see how We will use batch processing mode We will use CI/CD using Azure Devops
Drawbacks of ADLS
ADLS != Database
Relational database is acidic. They store quality data. This quality is can’t be guarantted in ADLS
Delta lake makes a lakehouse.
Importing project(.dbc) file
It will import all the notebooks and your import will look like this: