From Raw Data to Analytics-Ready: Getting Started with dbt on Databricks SQL
- Gauravsingh Pundir

- Feb 10
- 7 min read
Updated: Feb 25
This blog series explores analytics engineering with dbt on Databricks SQL, covering everything from foundational concepts to production-ready pipelines.
Part 1 – Introduction to dbt with Databricks SQL
Introduces dbt fundamentals and demonstrates real-world integration with Databricks SQL.
Part 2 – Analytics Engineering in Practice: Real-World Data Modeling with dbt
Examines proven data modelling approaches and best practices drawn from real-world implementations.
Part 3 – From Development to Production: Building Enterprise-Grade dbt Pipelines on Databricks SQL
Focuses on building robust, enterprise-grade data transformation pipelines using dbt and Databricks SQL.
Introduction
As organizations push toward data democratization, the ability to enable self-service development has become essential. dbt is designed to simplify and streamline data solution development for analysts, engineers, and other users who need to build or write logic to transform or prepare data for analytics.
If you use Databricks, and especially Databricks SQL as a data warehouse or query engine for analytics in your cloud data platform, then the combination of dbt (data build tool) and Databricks SQL is a powerful duo that enables data teams to deliver analytics faster and with confidence.
This blog series explores our journey implementing a production-grade dbt project on Databricks SQL, sharing insights from building a sophisticated data warehouse with multiple models across multiple business domains. From modeling best practices to deployment strategies, you will discover how this powerful combination helps organizations move faster while maintaining data quality and governance, while also streamlining and standardizing the data transformation development process for analysts, engineers, and other users alike.
This blog will provide insights on:
What is dbt and its building blocks
How to set up dbt in your local IDE environment and connecting to databricks
The software development process with dbt
Note dbt Core works with all major cloud data warehouses like Databricks SQL, Snowflake, Microsoft Fabric, etc. If you use any of those other data warehouses or engines, most of the best practices and lessons learned shared here will also apply.
Quick Intro to Databricks and Databricks SQL
Databricks is a unified analytics platform built on Apache Spark that provides a collaborative environment for data engineering, data science, and analytics.
Databricks SQL is the SQL analytics service within Databricks that provides:
Serverless SQL Warehouses: On-demand compute for SQL workloads
Unity Catalog: Unified governance for data and AI assets
Delta Lake Integration: ACID transactions and time travel capabilities
Collaborative Workspace: Shared notebooks, dashboards, and SQL queries
The platform's ability to handle both batch and streaming workloads, combined with its enterprise-grade security and governance features, makes it an ideal foundation for modern data architectures.
What is dbt and Its Building Blocks?
dbt (data build tool) is a transformation framework that enables data teams to build, test, and deploy analytics code using SQL and version control. It operates on the principle of analytics engineering – treating data transformations as software development.
Key Components of dbt
A dbt project usually consists of the following components:
Profiles
Connection profiles are defined in the profiles.yml file. This is the main configuration file for a dbt project and specifies the connection details of your data warehouse (Databricks SQL, Microsoft Fabric, Snowflake, Synapse, etc.).
In a real company, there are typically multiple environments in the data platform like dev, uat, and prod. In this case, you should define multiple connection profiles in profiles.yml, one for each platform environment in your company. You can configure dbt to connect with Databricks using OAuth, Service Principal and through Unity Catalog access.
File location (Windows):
C:\Users\<your-user-name>\.dbt\profiles.ymlExample: Databricks profiles with real-world authentication options
Below is an example profiles.yml that demonstrates common Databricks authentication patterns used in companies.
1. OAuth authentication (recommended for local development)
OAuth is typically used by data engineers and analytics engineers when running dbt locally from their IDE. Authentication happens via the browser and uses the user’s Databricks identity.
usecase_name:
target: dev
outputs:
dev:
type: databricks
host: https://adb-123456789012.3.azuredatabricks.net
http_path: /sql/1.0/warehouses/abc123def456
catalog: dev_catalog
schema: analytics
threads: 4
auth_type: oauthKey points:
No secrets are stored locally.
Uses the logged-in Databricks user.
Best suited for local development and experimentation.
Access is controlled via Unity Catalog permissions.
2. Service Principal authentication (common for CI/CD and production)
Service principals are commonly used for automated dbt runs in CI/CD pipelines, scheduled jobs, and production deployments.
usecase_name:
target: dev
outputs:
prod:
type: databricks
host: https://adb-123456789012.3.azuredatabricks.net
http_path: /sql/1.0/warehouses/abc123def456
catalog: dev_catalog
schema: analytics
auth_type: oauth
client_id: "{{ env_var('DBT_DATABRICKS_CLIENT_ID') }}"
client_secret: "{{ env_var('DBT_DATABRICKS_CLIENT_SECRET') }}"
tenant_id: "{{ env_var('DBT_DATABRICKS_TENANT_ID') }}"Key points:
Uses an Azure AD service principal.
Secrets are injected via environment variables, not hardcoded.
Ideal for production-grade dbt runs.
Permissions are managed centrally in Unity Catalog.
3. Unity Catalog access (important in enterprise setups)
When Unity Catalog is enabled, you must explicitly specify the catalog and schema. These control data access and governance.
usecase_name:
target: dev
outputs:
uat:
type: databricks
host: https://adb-123456789012.3.azuredatabricks.net
http_path: /sql/1.0/warehouses/abc123def456
catalog: dev_catalog
schema: analytics
threads: 4
auth_type: oauthKey points:
catalog maps to the Unity Catalog catalog.
schema maps to the Unity Catalog schema.
Access is enforced using Unity Catalog grants, not dbt.
Models
Models are the most important piece in a dbt project. It is a best practice to structure dbt projects based on the following layered approach:
Staging
According to dbt, staging models should be the building blocks for all subsequent analytics work. These models are directly connected to sources.
For details:
<Staging: Preparing our atomic building blocks | dbt Developer Hub >
Intermediate (optional)
This is an optional layer and contains models that serve a specific transformation purpose. According to dbt, intermediate models are meant for isolating complex operations, for re‑graining (grouping, pivoting, etc.), and structural simplification.
In addition, intermediate models are used to materialize data when your SQL logic is very complex. By materializing this logic into intermediate tables, you can:
Break down complex transformations into simpler steps
Reuse the same intermediate tables in multiple downstream models
For details, refer to:
<Intermediate: Purpose-built transformation steps | dbt Developer Hub >
Introduce the intermediate layer depending on the complexity of transformations.
Marts
These are business-defined models that are usually connected to BI tools like Power BI.
For an in-depth documentation on marts, see:
Macros
Macros are reusable code / SQL functions written using the Jinja2 templating language.
Our project includes examples like:
set_table_owner(): Automatically sets table ownership
log_run_results(): Captures execution metadata
They are used to:
Reduce redundancy
Standardize logic
Simplify complex logic
Improve reusability
For details, refer to:
Note Detailed macro best practices and more advanced patterns will be covered in a later blog post.
Tests
dbt provides generic, out‑of‑the‑box data quality tests to check:
Uniqueness constraints
Referential integrity
Data freshness
You can also build custom business logic tests.
For further details, refer to:
We will discuss tests, seeds, and other components further in this series.
How to Set Up a dbt Core Project on Your Local IDE Environment
Prerequisites
Python Environment
pip install dbt-databricks
pip install dbt-coreNote dbt-core is the open-source version of dbt.
Databricks Connection
Service principal authentication
OAuth for development environments
Unity Catalog access
As defined in profiles section above.
Project Structure
A typical dbt project structure:
dbt_project/
├── dbt_project.yml # Project configuration
├── profiles.yml # Connection profiles
├── packages.yml # External dependencies
├── models/ # SQL models
│ ├── staging/ # Raw data layer
│ ├── intermediate/ # Intermediate materialization of complex logic,
│ │ # reused by multiple downstream models
│ └── marts/ # Presentation layer
├── macros/ # Reusable functions
├── tests/ # Data quality tests
└── seeds/ # Reference dataStep by step development process for dbt
Step 1: Clone the Repository
git clone your_repo
cd your_repoResult: Project files are available locally.
Step 2: Open the Project in VS Code
code .Result: The project opens in the Visual Studio Code workspace.
Step 3: Create and Activate a Virtual Environment
python -m venv venv
.\venv\Scripts\Activate.ps1If needed:
Set-ExecutionPolicy RemoteSigned -Scope CurrentUserResult: You now have an isolated Python environment.
Step 4: Install Python Dependencies
pip install -r requirements.txtResult: dbt and other required packages are installed.
Step 5: Create Local profiles.yml
Create or update your local profiles.yml file as described in the Profiles section above.
These steps (1–5) are required only the first time you set up the project on your machine.
Step 6: Create a Feature Branch
git checkout -b feature/your_new_branchWork in a dedicated feature branch to isolate your changes and avoid conflicts.
Step 7: Create the Model File
Path: models/stage/dim/dim_company.sql
{{ config(
materialized = 'view',
alias = 'dim_company',
file_format = 'delta'
) }}
SELECT ID, NAME
FROM {{ source('data_source', 'table_name') }}This SQL logic creates a view selecting company master data from the defined table.
How to set up the source table is described in Step 9.
Step 8: Add Documentation & Tests
Path: models/stage/dim/dim_company.yml
version: 2
models:
- name: dim_company
description: "Company master data"
columns:
- name: ID
tests: [not_null, unique]
- name: NAME
description: "Company name"Adding documentation improves model transparency, and tests help maintain data quality.
Step 9: Define the Source Table
Path: models/stage/sources/source_stage.yml
version: 2
sources:
- name: data_source # Logical name used for reference in dbt
database: databricks_catalog_name
schema: schema_name
tables:
- name: table_nameThis source definition tells dbt where to find the raw (curated) table.
Step 10: Run and Test the Model
dbt run --select stage.dim.dim_company
dbt test --select stage.dim.dim_companyRun the model to build it in the target database and then test it to validate your logic.
Step 11: Generate and View Documentation
dbt docs generate
dbt docs serveThis generates a browsable documentation site for your project, showing model details, tests, and lineage.
Useful dbt commands:
dbt deps # Install packages
dbt seed # Load reference data
dbt run --select staging # Run staging models
dbt test # Execute data quality tests
dbt run -m dim_company # Run specific model and its dependencyStep 12: Commit Your Work
git add models/stage/
git commit -m "add dim_company model"Step 13: Push to Azure DevOps/Github/GitLabs or other Git like repo
git push --set-upstream origin feature/your_branch_nameConclusion
The combination of dbt and Databricks SQL has transformed our data analytics capabilities, enabling:
Faster Development: Modular, reusable transformations
Better Data Quality: Comprehensive testing and monitoring
Improved Collaboration: Version-controlled, well-documented code
Scalable Architecture: Handles complex business requirements efficiently
This powerful duo provides the foundation for modern data teams to deliver analytics faster while maintaining the accuracy and trust that organizations require in today's data-driven world.
This blog post series is based on real-world experience implementing a large scale production dbt projects high number of models on Databricks SQL across different business areas, showcasing practical insights and lessons learned from building scalable data analytics solutions.
Stay tune for Part 2 – Analytics Engineering in Practice: Real-World Data Modeling with dbt
#dbt #dbt-core #databricks #databricks-sql #analytics-engineering #data-democratization #data-transformation #data-warehouse #data-modeling #unity-catalog #delta-lake #best-practices #modern-data-stack

Comments