top of page

From Raw Data to Analytics-Ready: Getting Started with dbt on Databricks SQL

Updated: Feb 25

This blog series explores analytics engineering with dbt on Databricks SQL, covering everything from foundational concepts to production-ready pipelines.


  • Part 1 – Introduction to dbt with Databricks SQL

    Introduces dbt fundamentals and demonstrates real-world integration with Databricks SQL.


  • Part 2 – Analytics Engineering in Practice: Real-World Data Modeling with dbt

    Examines proven data modelling approaches and best practices drawn from real-world implementations.


  • Part 3 – From Development to Production: Building Enterprise-Grade dbt Pipelines on Databricks SQL

    Focuses on building robust, enterprise-grade data transformation pipelines using dbt and Databricks SQL.



Introduction


As organizations push toward data democratization, the ability to enable self-service development has become essential. dbt is designed to simplify and streamline data solution development for analysts, engineers, and other users who need to build or write logic to transform or prepare data for analytics.


If you use Databricks, and especially Databricks SQL as a data warehouse or query engine for analytics in your cloud data platform, then the combination of dbt (data build tool) and Databricks SQL is a powerful duo that enables data teams to deliver analytics faster and with confidence.


This blog series explores our journey implementing a production-grade dbt project on Databricks SQL, sharing insights from building a sophisticated data warehouse with multiple models across multiple business domains. From modeling best practices to deployment strategies, you will discover how this powerful combination helps organizations move faster while maintaining data quality and governance, while also streamlining and standardizing the data transformation development process for analysts, engineers, and other users alike.


This blog will provide insights on:

  • What is dbt and its building blocks

  • How to set up dbt in your local IDE environment and connecting to databricks

  • The software development process with dbt


Note  dbt Core works with all major cloud data warehouses like Databricks SQL, Snowflake, Microsoft Fabric, etc. If you use any of those other data warehouses or engines, most of the best practices and lessons learned shared here will also apply.


Quick Intro to Databricks and Databricks SQL


Databricks is a unified analytics platform built on Apache Spark that provides a collaborative environment for data engineering, data science, and analytics.


Databricks SQL is the SQL analytics service within Databricks that provides:


  • Serverless SQL Warehouses: On-demand compute for SQL workloads

  • Unity Catalog: Unified governance for data and AI assets

  • Delta Lake Integration: ACID transactions and time travel capabilities

  • Collaborative Workspace: Shared notebooks, dashboards, and SQL queries


The platform's ability to handle both batch and streaming workloads, combined with its enterprise-grade security and governance features, makes it an ideal foundation for modern data architectures.



What is dbt and Its Building Blocks?


dbt (data build tool) is a transformation framework that enables data teams to build, test, and deploy analytics code using SQL and version control. It operates on the principle of analytics engineering – treating data transformations as software development.


Key Components of dbt


A dbt project usually consists of the following components:


Profiles


Connection profiles are defined in the profiles.yml file. This is the main configuration file for a dbt project and specifies the connection details of your data warehouse (Databricks SQL, Microsoft Fabric, Snowflake, Synapse, etc.).


In a real company, there are typically multiple environments in the data platform like dev, uat, and prod. In this case, you should define multiple connection profiles in profiles.yml, one for each platform environment in your company. You can configure dbt to connect with Databricks using OAuth, Service Principal and through Unity Catalog access.


File location (Windows):

C:\Users\<your-user-name>\.dbt\profiles.yml

Example: Databricks profiles with real-world authentication options


Below is an example profiles.yml that demonstrates common Databricks authentication patterns used in companies.



OAuth is typically used by data engineers and analytics engineers when running dbt locally from their IDE. Authentication happens via the browser and uses the user’s Databricks identity.

usecase_name:
  target: dev
  outputs:
    dev:
      type: databricks
      host: https://adb-123456789012.3.azuredatabricks.net
     http_path: /sql/1.0/warehouses/abc123def456
      catalog: dev_catalog
      schema: analytics
      threads: 4
      auth_type: oauth

Key points:

  • No secrets are stored locally.

  • Uses the logged-in Databricks user.

  • Best suited for local development and experimentation.

  • Access is controlled via Unity Catalog permissions.


2. Service Principal authentication (common for CI/CD and production)


Service principals are commonly used for automated dbt runs in CI/CD pipelines, scheduled jobs, and production deployments.

usecase_name:
  target: dev
  outputs:
    prod:
      type: databricks
      host: https://adb-123456789012.3.azuredatabricks.net
     http_path: /sql/1.0/warehouses/abc123def456
      catalog: dev_catalog
      schema: analytics
      auth_type: oauth
      client_id: "{{ env_var('DBT_DATABRICKS_CLIENT_ID') }}"
      client_secret: "{{ env_var('DBT_DATABRICKS_CLIENT_SECRET') }}"
      tenant_id: "{{ env_var('DBT_DATABRICKS_TENANT_ID') }}"

Key points:

  • Uses an Azure AD service principal.

  • Secrets are injected via environment variables, not hardcoded.

  • Ideal for production-grade dbt runs.

  • Permissions are managed centrally in Unity Catalog.


3. Unity Catalog access (important in enterprise setups)


When Unity Catalog is enabled, you must explicitly specify the catalog and schema. These control data access and governance.

usecase_name:
  target: dev
  outputs:
    uat:
      type: databricks
      host: https://adb-123456789012.3.azuredatabricks.net
      http_path: /sql/1.0/warehouses/abc123def456
      catalog: dev_catalog
      schema: analytics
      threads: 4
      auth_type: oauth

Key points:

  • catalog maps to the Unity Catalog catalog.

  • schema maps to the Unity Catalog schema.

  • Access is enforced using Unity Catalog grants, not dbt.


Models


Models are the most important piece in a dbt project. It is a best practice to structure dbt projects based on the following layered approach:


  • Staging 


    According to dbt, staging models should be the building blocks for all subsequent analytics work. These models are directly connected to sources.

    For details:

    <Staging: Preparing our atomic building blocks | dbt Developer Hub >


  • Intermediate (optional)


    This is an optional layer and contains models that serve a specific transformation purpose. According to dbt, intermediate models are meant for isolating complex operations, for re‑graining (grouping, pivoting, etc.), and structural simplification.

    In addition, intermediate models are used to materialize data when your SQL logic is very complex. By materializing this logic into intermediate tables, you can:

    • Break down complex transformations into simpler steps

    • Reuse the same intermediate tables in multiple downstream models

    For details, refer to:

    <Intermediate: Purpose-built transformation steps | dbt Developer Hub >


    Introduce the intermediate layer depending on the complexity of transformations.


  • Marts 


    These are business-defined models that are usually connected to BI tools like Power BI.

    For an in-depth documentation on marts, see:

    <Marts: Business-defined entities | dbt Developer Hub >


Macros


Macros are reusable code / SQL functions written using the Jinja2 templating language.


Our project includes examples like:

  • set_table_owner(): Automatically sets table ownership

  • log_run_results(): Captures execution metadata


They are used to:

  • Reduce redundancy

  • Standardize logic

  • Simplify complex logic

  • Improve reusability


For details, refer to:


Note Detailed macro best practices and more advanced patterns will be covered in a later blog post.

Tests


dbt provides generic, out‑of‑the‑box data quality tests to check:

  • Uniqueness constraints

  • Referential integrity

  • Data freshness


You can also build custom business logic tests.


For further details, refer to:


We will discuss tests, seeds, and other components further in this series.



How to Set Up a dbt Core Project on Your Local IDE Environment


Prerequisites


  1. Python Environment

pip install dbt-databricks
pip install dbt-core
Note   dbt-core is the open-source version of dbt.

  1. Databricks Connection

    • Service principal authentication

    • OAuth for development environments

    • Unity Catalog access


    As defined in profiles section above.


Project Structure


A typical dbt project structure:


dbt_project/
├── dbt_project.yml      # Project configuration
├── profiles.yml         # Connection profiles
├── packages.yml         # External dependencies
├── models/              # SQL models
│   ├── staging/         # Raw data layer
│   ├── intermediate/    # Intermediate materialization of complex logic,
│   │                    # reused by multiple downstream models
│   └── marts/           # Presentation layer
├── macros/              # Reusable functions
├── tests/               # Data quality tests
└── seeds/               # Reference data

Step by step development process for dbt 


Step 1: Clone the Repository

git clone your_repo
cd your_repo

Result: Project files are available locally.



Step 2: Open the Project in VS Code

code .

Result: The project opens in the Visual Studio Code workspace.



Step 3: Create and Activate a Virtual Environment

python -m venv venv
.\venv\Scripts\Activate.ps1

If needed:

Set-ExecutionPolicy RemoteSigned -Scope CurrentUser

Result: You now have an isolated Python environment.



Step 4: Install Python Dependencies

pip install -r requirements.txt

Result: dbt and other required packages are installed.



Step 5: Create Local profiles.yml


Create or update your local profiles.yml file as described in the Profiles section above.

These steps (1–5) are required only the first time you set up the project on your machine.



Step 6: Create a Feature Branch

git checkout -b feature/your_new_branch

Work in a dedicated feature branch to isolate your changes and avoid conflicts.



Step 7: Create the Model File


Path: models/stage/dim/dim_company.sql

{{ config(
    materialized = 'view',
    alias = 'dim_company',
    file_format = 'delta'
) }}
SELECT ID, NAME
FROM {{ source('data_source', 'table_name') }}

This SQL logic creates a view selecting company master data from the defined table.


How to set up the source table is described in Step 9.



Step 8: Add Documentation & Tests


Path: models/stage/dim/dim_company.yml

version: 2
models:
  - name: dim_company
    description: "Company master data"
    columns:
      - name: ID
        tests: [not_null, unique]
      - name: NAME
        description: "Company name"

Adding documentation improves model transparency, and tests help maintain data quality.



Step 9: Define the Source Table


Path: models/stage/sources/source_stage.yml

version: 2
sources:
  - name: data_source        # Logical name used for reference in dbt
    database: databricks_catalog_name
    schema: schema_name
    tables:
      - name: table_name

This source definition tells dbt where to find the raw (curated) table.



Step 10: Run and Test the Model

dbt run --select stage.dim.dim_company
dbt test --select stage.dim.dim_company

Run the model to build it in the target database and then test it to validate your logic.



Step 11: Generate and View Documentation

dbt docs generate
dbt docs serve

This generates a browsable documentation site for your project, showing model details, tests, and lineage.


Useful dbt commands:

dbt deps                    # Install packages
dbt seed                    # Load reference data
dbt run --select staging    # Run staging models
dbt test                    # Execute data quality tests
dbt run -m dim_company      # Run specific model and its dependency


Step 12: Commit Your Work

git add models/stage/
git commit -m "add dim_company model"


Step 13: Push to Azure DevOps/Github/GitLabs or other Git like repo

git push --set-upstream origin feature/your_branch_name

Conclusion


The combination of dbt and Databricks SQL has transformed our data analytics capabilities, enabling:


  • Faster Development: Modular, reusable transformations


  • Better Data Quality: Comprehensive testing and monitoring


  • Improved Collaboration: Version-controlled, well-documented code


  • Scalable Architecture: Handles complex business requirements efficiently


This powerful duo provides the foundation for modern data teams to deliver analytics faster while maintaining the accuracy and trust that organizations require in today's data-driven world.


This blog post series is based on real-world experience implementing a large scale production dbt projects high number of models on Databricks SQL across different business areas, showcasing practical insights and lessons learned from building scalable data analytics solutions.


Stay tune for Part 2 – Analytics Engineering in Practice: Real-World Data Modeling with dbt


#dbt #dbt-core #databricks #databricks-sql #analytics-engineering #data-democratization #data-transformation #data-warehouse #data-modeling #unity-catalog #delta-lake #best-practices #modern-data-stack

Recent Posts

See All

Comments


bottom of page