top of page

Analytics Engineering in Practice: Real-World Data Modeling with dbt

Updated: Feb 26

This is part 2 of our blog series exploring Analytics Engineering with dbt on Databricks SQL, covering everything from foundational concepts to production-ready pipelines. Refer part 1: From Raw Data to Analytics-Ready: Getting Started with dbt on Databricks SQL



Introduction to dbt and Its Building Blocks


Comprehensive and well-structured documentation is already available on the official dbt website, particularly in the Best practice guides | dbt Developer Hub. This article is not intended to replace or replicate that material.


Instead, it aims to consolidate the most essential best practices to keep in mind when working with dbt as an analytics engineer. In addition, it highlights a set of supplementary recommendations drawn from practical experience across previous projects—insights that may not always be explicitly covered in the official documentation but can significantly improve project structure, maintainability, and scalability.


Key files in a dbt project


The table below summarizes the main files in a dbt project:

File

Purpose

Location

dbt_project.yml

The main configuration file for your dbt project. Defines project name, model paths, folder-level configs, default materializations, macros, seeds, and tests.

Root of your dbt repo

profiles.yml

Stores connection settings (credentials, cluster/warehouse info, catalog/schema targets) for environments like dev, prod. It’s usually located in ~/.dbt/profiles.yml or set via DBT_PROFILES_DIR.

User’s local .dbt directory or environment

packages.yml

Lists external dbt packages (like dbt_utils, dbt_expectations, dbt_databricks_utils). dbt installs them with dbt deps.

Project root

models/ folder

Holds your SQL transformation models (staging, intermediate, marts). Each .sql file defines one model (table/view/incremental/ephemeral).

models/

models/*.yml files

Contain metadata, tests, and documentation for models in that folder (e.g., column descriptions, constraints).

Next to SQL models

macros/ folder

Holds custom Jinja macros that encapsulate reusable SQL or logic. Often used for date logic, incremental filters, etc.

macros/

seeds/ folder

Contains CSV files that dbt can load directly as database tables (via dbt seed). Great for static reference data (e.g., country codes).

seeds/

snapshots/ folder

Defines slowly changing dimension (SCD) snapshots to track changes in source data over time.

snapshots/

tests/ folder (optional)

Holds custom test definitions (SQL + macros). Standard tests live in the YAML files next to models.

tests/

analysis/ folder (optional)

For ad-hoc analyses and exploratory SQL not part of the main DAG. These are not materialized unless explicitly run.

analysis/

docs/ folder (optional)

For manual documentation and Markdown files that integrate into dbt docs.

docs/

target/ folder

Auto-generated build artifacts: compiled SQL, run logs, and manifest files (manifest.json). Created by dbt during runs.

Generated automatically

logs/ folder

Stores dbt’s runtime logs (debug info, performance metrics).

Generated automatically


dbt Models


Models form the core building blocks of any dbt project. A widely adopted best practice is to organize them using a layered architecture, as outlined below:


Staging


According to dbt guidance, staging models act as the atomic building blocks for all downstream analytics work. They are directly connected to source tables and are primarily responsible for cleaning, standardizing, and lightly transforming raw data into a consistent structure.


Intermediate


The intermediate layer is optional and should be introduced based on the complexity of your transformations. These models are designed for purpose-built transformation steps such as isolating complex logic, re-graining data (e.g., grouping or pivoting), and simplifying structures before they are exposed to business-facing layers.


Marts


Mart models represent business-defined entities and are typically the final layer consumed by BI tools such as Power BI. These models are curated for reporting and analytics use cases and reflect domain-specific logic aligned with business requirements.


It is important to note that these layers are conceptual guidelines rather than strict rules. Their exact definitions can be adapted to suit your project’s needs. You are not limited to only these three layers—additional layers such as a dedicated sources (pre-staging) layer can be introduced where appropriate.


Designing for a Mostly-Linear DAG


A recommended best practice is to aim for a mostly linear DAG. In practical terms:


The goal is to keep the DAG structured and “flattened,” ensuring clear upstream and downstream relationships. In more complex projects, introducing additional transformation layers can actually improve clarity and maintainability rather than complicate the structure (Conceptually depicted in figure 1.1).


While achieving a perfectly linear DAG is not always possible—since models within the same layer may depend on each other—you should avoid patterns where models frequently bypass reusable staging or intermediate layers and reconnect directly to raw sources. Doing so reduces reusability, duplicates transformation logic, and limits scalability as data volume and complexity increase.


A well-structured dbt project emphasizes modularity, reusability, and clarity—ensuring that shared transformations are centralized and consistently leveraged across the DAG.



Conceptual diagram of a mostly-linear dbt DAG architecture illustrating data transformation flow from source systems to staging models, intermediate models, and final analytics marts.


Organization of Modeling Layers


When working with multiple source systems and/or several business use cases, it is strongly recommended to introduce subfolder structures within each modeling layer. A clear folder hierarchy improves maintainability, readability, and ownership across teams.


Structuring the Staging Layer


Within models/staging, subfolders should be organized by source system. In other words, create one folder per source.

For example:

  • models/staging/fivetran/

  • models/staging/ax2009/


This approach keeps transformations logically grouped by their raw data origin and prevents cross-source confusion at the foundational level.


Structuring the Intermediate and Marts Layers


For intermediate and marts, subfolders should be organized by business domain or use case, rather than by source system.

For example:

  • models/marts/logistics/

  • models/marts/finance/

  • models/intermediate/supply_chain/


At this stage, the focus shifts from where the data comes from to how the business uses it. This domain-driven structure supports scalability and clearer ownership boundaries.


Anti-Patterns to Avoid


Maintaining a clean and scalable dbt project also means consciously avoiding common design mistakes:


1. Over-Layering (Unnecessary Layer Hops)


Avoid creating redundant transformations.

If a stg_* model already produces the exact structure required, do not introduce an int_* model that simply passes data through without adding value. Every layer should serve a clear transformation purpose.


2. Building Marts Directly from source()


Creating mart models directly from source() bypasses staging logic. This:

  • Skips standardized data hygiene,

  • Duplicates cleaning logic across marts,

  • Makes testing and documentation harder to maintain.


Always enforce transformation discipline through staging (and intermediate where necessary).


3. Circular Dependencies and “Diamond Spaghetti” DAGs


Avoid circular dependencies or complex “diamond-shaped” DAG structures where logic is scattered across multiple models without clear ownership.

If multiple marts require shared logic:

  • Extract the shared transformation into a single int_* model.

  • Make all dependent marts reference that intermediate model.


This preserves modularity and prevents duplication.


In summary, thoughtful folder organization and disciplined layering are essential for maintaining a clean, scalable dbt project. A well-structured DAG ensures reusability, clear ownership, and long-term maintainability—while avoiding the structural pitfalls that can quickly turn a project into unmanageable complexity.


Configuration files (yml)


Every dbt folder should include YAML files to document, test, and declare metadata for the models inside it. YAMLs make your project self-describing, testable, and consistent and contain key configuration details for your sources and models.


dbt_project.yml


The most important yml file is the dbt_project.yml which is required for all dbt projects. Please refer to dbt_project.yml | dbt Developer Hub for details.


Models and sources yml files


In addition to dbt_project.yml, you should always have two additional types of yml files in your subfolders:


  • _<folder>__models.yml: yml files capturing configurations such as tags, materialization etc at model levelAs you can see in Fig 1.3, these yml files appear under folders containing your dbt models. 


  • _<source_name>__sources.yml: yml files capturing configurations for your data sources. Note that these yml files are only in folders where you are hosting models built directly on top of your sources. 



dbt project directory structure illustrating layered architecture with staging, intermediate, and marts folders containing SQL models, sources, and YAML configuration files.”


For details on the structure of these yml files please refer to: Model configurations | dbt Developer Hub. As an example, a_ folder__models.yml file can be specified as:


models:
   - name: sessions_by_day
     description: model for sessions per day
 
    columns:
     - name: surrogate_key
       description: surrogate_key for table
 
      data_tests:
       - not_null
       - unique
 
    - name: date
       data_tests:
       - not_null
 
    - name: landing_page_path_content
       data_tests:
       - not_null
 
    - name: landing_page_path_topic
       data_tests:
       - not_null
 
    - name: landing_page_path_subtopic
 
    - name: sessions
       data_tests:
       - not_null
 
    - name: pageviews
       data_tests:
       - not_null
 
    - name: bounces
       data_tests:
       - not_null

In dbt_project.yml if you enable persist_docs:


models: 

+persist_docs:  

    relation: true

    columns: true


Then all your created columns and tables will be in fact annotated by the descriptions you entered in the yml files.


It is highly recommended to use dbt codegen package for generating yml files for your dbt project. For details on the usage of this package please refer to: Automatically generate sources.yml, staging models, and schema.yml using dbt codegen - sambaiz-net.


For databricks specific configurations please refer to: Databricks configurations | dbt Developer Hub.


Documentation


Documentation for models can be included in the specific models.yml. However, if descriptions are more detailed, it is recommended to add a docs.md to your project and refer it in your yml files.


Here’s an example:


{% docs keymetrics_by_factors_d %}
 This table contains key metrics including pageviews, unique pageviews,
 exits, total time on page and average time on page by page path and
 additional factors like device model, channel grouping and city
{% enddocs %}

{% docs user_path_d %} 
 This table shows the number of sessions associated with the first five
 page paths
{% enddocs %}
 
{% docs sessions_by_city_d %}
 This table shows sessions, bounces, pageviews for each landing page path
 by city and also contains columns showing total sessions, bounces and
 pageviews for all cities            
{% enddocs %}

And using this in your yml:

  • name: sessions_by_city

  description: '{{ doc("sessions_by_city_d") }}'



dbt Project Structure and Naming Conventions


Project Structure Example


The most important part of a dbt project is the structure of the models folder. Below is how you should try to structure your dbt use case project (assuming you have staging, intermediate, marts layers):


use_case_logistics/ models/ staging/ crm/ erp/ ... intermediate/ business_domain1/ business_domain2/ ... marts/ base/ dimensions/ facts/ business_domain1/ business_domain2/ ...


Marts are the semantic layer of the use case. It is recommended to create one folder per business domain e.g. inventory, procurement, master_data etc. A base folder is necessary only when you have domain independent dimensions or facts in your use case. 


Naming Conventions


Models


Staging:

stg_<source_system>__<table_or_subject>.sql

Example: stg_crm__customer


Intermediate:

int_<business_domain>__<subject>.sql

Example: int_sales__sales_channel


Marts:

Example: fact_sales


yml Files


For a folder named <folder>:


  • Models YAML:

    _<folder>__models.yml.

    Examples:

    • models/marts/inventory/_inventory__models.yml

    • models/marts/master_data/_master_data__models.yml


  • In source-specific staging folders (e.g. staging/crm), we keep:

    • _<system>_ _sources.yml

    • _<system>_ _models.yml

    Example for AX2009:

    models/staging/crm/

      _crm_ _sources.yml   # defines dbt sources

     _crm_ _models.yml    # tests + descriptions for stg models


The leading underscore  _ keeps YAMLs at the top of the folder in VS Code, and  __models makes the purpose obvious.



Data Modeling and Materialization Best Practices


The "dbt mindset"


Words on how to think about modeling with dbt


The biggest value of using dbt is derived from exploiting synergies between the different components in your data model. Modeling with dbt is all about creating dependencies, modularizing your SQL (or python) code and making it reusable in your wider data model. dbt compels engineers to not rush through creation of tables but take a step back and think about transformation layers, how do they fit into your overall modeling approach, what transformations can be modularized and made reusable etc.


In some sense this is nothing new. Anyone who has experience with dimensional modeling for instance knows that there is conceptualization that needs to be carried out before one begins with any real implementation. One should also not take dbt models to be synonymous with data models. Data models can be thought of as bigger recipes, as you may have read elsewhere online, while dbt models are simply ingredients that fit into that bigger recipe.


While creating dbt models, it is crucial to respect the modeling rules pertaining to each transformation layer. The subsections that follow summarize the best practices you should respect while creating dbt models for the three main modeling layers: Staging, intermediate, Marts.


For your dbt project, these layers have a clear purpose and responsibility: 


  • staging: clean & standardize inputs

  • intermediate: business logic, joins & enrichment

  • marts: use-case semantic layer (facts/dims like kontingent, bestaende, etc.)


Words on how to think about model materialization


Please refer to the official documentation for an introduction to dbt materializations: Materializations | dbt Developer Hub. For python, note that only table or incremental are available as materializations.


The best way to think about materializations is in terms of a "materialization hierarchy promotion" model which can be summarized as follows:


Start with a lower level of materialization (ephemeral or view) and see if there are good reasons to move higher up the hierarchy (tables/incrementals). Of course if you know from the outset that you have a very complex model processing at least millions of rows, you can rule out ephemeral or even views.


Cost considerations and performance should always be taken into account when choosing an appropriate materialization. If, for example, you use an ephemeral model that is used in 100s of downstream models and the model is processing large volumes of data, then you are likely incurring higher cloud costs as code is repetitively being executed. On the other hand, if you have a very lightweight logic on top of a table in your database, a "light" materialization option such as ephemeral suffices and there is no merit in flooding your data warehouse with another table. 


Staging


Modeling rules


  • Staging models should be setup for reusability, and all staging models should only use source() or staging models in base

  • It is generally important to not get rid of records at this stage

  • Recasting types, renaming columns etc are acceptable. At this stage it is important to keep data transformations to a minimum

  • Typical wide transformations such as joins or complex aggregations should NOT be done at this stage


Materialization best practices


Default: ephemeral

Use when the following true:

  • Used by few models (rule of thumb: ≤ 2-3 downstream models)

  • Transform is light: projections, simple casts, nullif, renames, simple case

  • No heavy windows (row_number, lag, lead), no big regex, no joins to other big tables

  • Underlying source query runs < 10 seconds on serverless when filtered reasonably


Staging models' materialization can be promoted to view when:

  • A model is reused by 3+ downstream models

  • Query runtime is 10–30 seconds

  • There is “moderate logic”: regex parsing, several computed fields, or many columns

  • You want a canonical place for column definitions and cleaning


Staging models' materialization can be promoted to incremental or table for "complex" cases:

  • Source is > 200 GB total or reads > 50–100 GB per typical run

  • Source is > 200M rows (very rough heuristic)

  • Multi-join staging (especially to other large datasets)

  • Heavy regex work on big tables


Intermediate


Modeling rules


  • The intermediate layer can be used for a range of transformations that support the creation of marts. These transformations can include:

    • Normalization or denormalization of data

    • Truncation of CDCs

    • Pivoting

    • Joins 

  • Intermediate models should only depend on staging models


Materialization best practices


Default: view

Intermediate models by default should be materialized as views when:

  • Int model is not used by too many downstream mart models. It is difficult to write an appropriate number of models here

  • It’s mostly joins/filters that are stable

  • Runtime < 30–60 seconds


Use incremental when:

  • Int model is heavy (joins + windows, recursive patterns, dedup)

  • Output is large and expensive to recompute

  • It’s used by many downstream marts

  • The work scales with history


Marts*


Modeling rules


  • Denormalized and composed with different dimensions and facts

  • Designed for a specific domain e.g. finance and specific analytics purpose

  • Naming convention (Straight from the documentation): Name marts by entity (orders.sql etc).

  • May join:

    • local intermediate models

    • core dims/facts from common repo

    • thin staging models (only for simple lookups)

  • If you have dimensions and facts in your marts layer that will be consumed by business domain data models, store them under marts/base/dimensions + marts/base/facts

  • Must not reference source() directly. Models in marts should strictly only depend on staging or intermediate models.


Materialization best practices


Default: table

Use table when:

  • Dimension tables are small enough to rebuild (typical dims)

  • Facts are not huge or updates are rare

  • Incremental complexity isn’t worth it

  • Full-refresh is cheap and ensures correctness


In certain cases where the computational logic for a mart is simple, it can also suffice to materialize it as a view. 


Use incremental when:

  • Fact tables are large or growing

  • You have a stable unique_key

  • You can define an incremental watermark (e.g., DL_UPD_DT)

  • Incremental strategy "merge" with unique key must be used selectively and only when less computationally expensive incremental strategies do not suffice, because of data update requirements for example


*In case of the common repository, this layer is called "core". 



Conclusion


Designing a robust dbt project is less about individual models and more about the structure that connects them. By clearly separating staging, intermediate, and marts layers, organizing folders by source systems and business domains, and enforcing consistent naming and YAML conventions, you create a data transformation layer that is understandable, testable, and easy to evolve.


The “dbt mindset” emphasizes modularity, reusability, and clear responsibilities for each layer. Combined with a pragmatic materialization promotion approach—starting light and only promoting to heavier materializations when justified by performance, cost, or reuse—you can keep both your warehouse and your dbt project lean and maintainable.


In practice, these guidelines help teams:

  • Avoid brittle, ad‑hoc pipelines tied directly to raw sources

  • Centralize shared logic and reduce duplication

  • Keep DAGs readable even as complexity grows

  • Provide trustworthy, well‑documented marts to downstream consumers


In the next part of this series, we will build on this foundation and dive deeper into advanced patterns such as incremental models and macro best practices, showing how to further scale and industrialize analytics engineering with dbt on Databricks SQL.


A scalable dbt project isn’t about individual models, but about a clear, layered architecture (staging → intermediate → marts), consistent folder and naming conventions, and pragmatic materialization choices that keep your transformations modular, reusable, and easy to evolve as business needs grow.


Stay tuned for Part 3 – From Development to Production: Building Enterprise-Grade dbt Pipelines on Databricks SQL which focuses on building robust, enterprise-grade data transformation pipelines using dbt and Databricks SQL.



Recent Posts

See All

Comments


bottom of page