Designing Effective Data Ingestion: Patterns and Technologies for the Modern Data Lakehouse

gauravsinghpundir
Aug 4
9 min read

In today’s data-driven era, decisions must be supported by reliable data, and organizations increasingly depend on their analytics platforms to derive insights, monitor performance, and make predictions. The analytics platform serves as the central hub for all business intelligence (BI) and artificial intelligence (AI) initiatives within an organization. An enterprise-wide analytics solution must be deployed on this platform to ensure consistency and scalability.

One of foundational capability for data analytics platform is to have easy, robust and scalable data ingestion mechanism. Before BI and AI capabilities can deliver value, a seamless and robust data ingestion process must be in place. Simply put, data ingestion refers to the pipeline that transfers raw data from diverse sources into a unified environment—such as a data lakehouse—where it can be processed, analyzed, and visualized effectively.

Why is data ingestion needed? 📌

Data ingestion is the foremost entry point and a critical step in any analytics platform. Without it, no transformation, dashboard, or model can function. If there is no proper and well-defined approach for integrating and ingesting data from various diverse sources, the burden of data ingestion often shifts to business users—diverting their focus from core business use cases. This not only reduces the effectiveness of the platform but may also lead to increased costs.

You might have heard of the term Medallion Architecture, which refers to a design approach for data ingestion and processing that organizes data into structured layers to prepare datasets for analytical purposes. It typically involves three layers: Bronze, Silver, and Gold. Each layer represents a stage of data refinement—Bronze being the rawest form, and Gold being the most curated and business-ready. This approach helps organizations build reliable and trustworthy data assets for analytics and other downstream applications. Raw data ingestion, therefore, forms the foundational step of any analytics initiative.

(Figure 1. Medallion Architecture, https://www.databricks.com/glossary/medallion-architecture)

We need effective data ingestion for:

· Timely Decision-Making: Business operations depend on up-to-date data from varied sources to make and drive decisions across all business divisions.

· Data Centralization: Data from disparate sources is brought to a central location. Data governance, access controls, and other policies can then be effectively applied—allowing the analytics platform to serve as a single source of truth for all analytical (and sometimes operational) needs. Data from ERP, CRM, and IoT systems can be streamlined centrally.

· Scalability: Modern analytics platforms offer cost-effective storage. A well-designed data ingestion process enables analytical systems (BI, AI, and ML) to scale alongside business needs.

· Foundation for Advanced Analytics: Clean, optimized datasets with historical data loads are a prerequisite for advanced analytics. These datasets can feed into AI and ML models and enable actionable BI dashboards that deliver real-time insights.

(figure 2. Pillars of Effective Data Ingestion in Analytics Platforms)

Motivation for Structuring Ingestion Around Load Patterns

Different data sources and ingestion requirements demand different approaches. Therefore, it is essential to structure data ingestion patterns. This provides the following benefits:

· Efficiency: Minimizes unnecessary data movement and reduces load on source systems.

· Cost Optimization: Lowers compute and storage costs by ingesting only the required data,

avoiding duplication or omission.

· Data Accuracy: Ensures the correct capture and update of necessary data without duplication

or loss.

· Compliance: Supports audit, SLAs, data lineage, and regulatory requirements through

historical data retention, record deletion, access control, and governance.

Ingestion Patterns in a Data Analytics Platform

1️⃣ Full Load or Batch Mode

Definition:

The entire dataset from the source system is copied to the target system.

Common Use Cases:

· The dataset to be copied is small and does not change frequently.

· The refresh frequency requirement for the dataset is flexible and lenient (e.g., once a day, once a week).

· The source system does not support the implementation of incremental pipeline logic.

Benefits:

· The data copy process is straightforward, with no overhead of implementing incremental or partitioning logic.

· All data is copied every time, enabling easier tracking of changes.

o For example, to retain the history of the last n days:

· If the source system does not provide change information and inserts/updates are made directly to the dataset, extracting the entire dataset helps track changes and updates.

· If tracking history is not required, the dataset at the target system can be overwritten to save storage costs.

Challenges:

· Although it reduces overhead, this approach is not efficient for large datasets.

· It can be difficult to clearly define the boundary between full load and incremental load objects.

For example, should a dataset with 100k records be considered a full load object, and one with 101k records be considered incremental? The size is large, but the dataset might not change frequently.

TIP: If the source system supports incremental load ingestion, it is recommended to use it regardless of dataset size.

Examples:

Consider you have data in an Oracle RDBMS system, and you want to use ADF; you would execute SELECT * FROM <your_table> in copy activity

*(Disclaimer ADF screenshots here are only explaination, these dont depict production grade pipelines)

· In Databricks, you can use a CTAS statement:

SQL

CREATE TABLE new_table AS

SELECT * FROM source_table; (find more here https://learn.microsoft.com/en- gb/azure/databricks/sql/language-manual/sql-ref-syntax-ddl-create-table)

NOTE: We will discuss how to create pipelines using some of technologies in the next series of blogs hence stay tuned and let us know if a particular example would be helpful for you.

2️⃣ Incremental Load

Definition:

Only changed or newly inserted data in the source system is copied to the target system.

Common Use Case:

· Used when the dataset is large and changes frequently.

· Performing a full load in this scenario is impractical, costly, and time-consuming.

Benefits:

· Ingesting only changed data optimizes processing and storage by avoiding duplicate records.

· Depending on the volume of changes, ingestion frequency can be increased (for example,

every 10 minutes, 15 minutes, or 30 minutes) so that downstream systems stay up to date.

Challenges:

· Requires support for change-data tracking in the source dataset. This is typically implemented via a datetime column that records insert or update timestamps (often called a watermark, incremental, or audit column). In some cases, multiple columns (e.g., insert_date and update_date) must be evaluated.

· Implementing checkpoint logic in your ingestion tool is critical. Pipelines must be throughly tested to ensure reliability.

Tip: Always include an overlap between runs. For example, if a pipeline run checkpoints at X, use X minus two minutes as the starting checkpoint for the next run.

Examples:

Consider you have data in Oracle RDBMS system, and you want to us ADF; you would execute 'SELECT* from your_table WHERE watermark_col >= your_checkpoint_value ' in copy activity

*(Disclaimer ADF screenshots here are only for explaination, these don't depict production grade pipelines)

· In Databricks, you can use Autoloader (find more about it here https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/)

3️⃣ Initial Load

Definition: A one-time bulk activity to copy historical data from the source system to the target system, usually followed by incremental ingestion. It can involve small to very large datasets.

Common Use Case:

· Used during first-time data onboarding; the object is then moved to an incremental pipeline for periodic refresh.

· Ingesting data from a legacy system as part of a migration.

Challenges:

· Large datasets may require logic to process data in batches or chunks.

For example, with a two-year historical load, you could ingest one month’s data per pipeline run instead of loading all data at once.

· Without batching or chunking logic, the pipeline may become redundant compared to incremental or full-load pipelines.

Benefits:

TIP: Implement batching logic by defining parameters to accept the start and end timestamps in your pipeline. This pipeline can then be used for both the initial load and as a self-healing process if ingestion fails regularly after onboarding. This approach will save countless hours for your data engineering team, and they will thank you for it. It has helped me personally many times.

Examples:

Consider you have data in Oracle RDBMS system and you want to us ADF then you will execute 'select * from your_table where watermark_col >= start_timestamp and watermark_col >= end_timestamp' in copy activity and you loop this copy activity in ForEach Activity.

· In Databricks, you can use autoloader (same in incremental pipeline, compute size will increase as per dataset)

4️⃣ Partition based Load

Definition:

· Ingests or updates data at a partition level rather than loading the entire table.

· Similar to incremental pipelines—only changed data is copied. The difference lies in the column used: incremental pipelines use a watermark, audit, or incremental column (date, timestamp, integer, or LSN), whereas partition-load pipelines use the partition column.

Common Use Case:

· When the source dataset is updated by partition.

Benefits:

· Updates only specified partitions, reducing processing and query time.

· Efficient for time-series or region-based data.

· Optimizes both storage and processing resources.

Tip: If you can identify the partitions that have been updated or inserted, you can design a pipeline to ingest partitions in parallel and significantly reduce ingestion time.

Challenges

· Identifying which partitions have been updated at the source.

· Many data engineers treat partitions as append-only, immutable objects.

TIP: Always monitor the impact of pipelines (especially incremental, initial, and partition) on source systems by coordinating with DBAs. DBAs can be a good—or a bad—friend to data engineers during the data ingestion process.

NOTE: These are mostly batch ingestion patterns discussed above. Streaming and CDC patterns are not covered above. We would be happy to discuss the practical experiences and tips for streaming and CDC approaches in dedicated blog.

Common Technologies for Data Ingestion Lakehouse Platforms

Modern data Lakehouses support a wide range of ingestion technologies, from low-code tools to code-driven frameworks.

Here’s an overview of the most popular options highlighting some key things from our experience:

Azure Data Factory (ADF) / Microsoft Fabric Data Factory

· Offers numerous out-of-the-box connectors for databases, SaaS applications, and cloud storage.

· Cost effective for pure data copy if used correctly even for high volume of data

· Visual, low-code ETL/ELT tool with batch and event-based ingestion.

· Ideal for seamless integration with other Azure services.

· Limited customization capabilities.

· Limited support if you want to use delta tables

· Fabric Data Factory (eventually will replace ADF), relatively new compared to ADF but catching up fast

· Fabric Data Factory uses a capacity-based consumption model rather than pure pay-as-you- go.

Databricks / Spark Code-Based Ingestion

· Highly flexible and scalable, well suited for complex transformations and very large datasets.

· Enables custom ingestion frameworks using PySpark or Scala.

Databricks LakeFlow

· A newer, visual orchestration tool native to Databricks.

· Provides a growing library of ingestion connectors.

· Streamlines pipeline automation and event-driven ingestion.

Talend

· Comprehensive data integration suite with strong transformation and quality features.

· Excels in hybrid and cloud migration projects.

· User/developer-seat licensing can become costly at scale (especially post-Qlik acquisition).

· Offers a suite of data integration tools with strong support for transformation and quality.

FiveTran

· Fully managed ingestion service with prebuilt connectors to a wide variety of sources.

· Primarily designed for incremental syncs.

· Ideal for rapid onboarding of business systems (e.g., Salesforce, NetSuite).

Informatica Cloud

· Enterprise-grade cloud data integration platform.

· Rich support for data quality, governance, and security.

· Well suited for regulated industries.

Matillion

· Cloud-native ELT platform optimized for Snowflake, BigQuery, and Redshift.

· Combines a visual interface with advanced transformation logic.

· Supports Parameterization and Orchestration.

· Generally, more expensive than ADF or code-based alternatives for pure ingestion.

We may have a preferred or recommended ingestion tool, but each project’s requirements demand an objective evaluation. We’ll dive deeper into some of these technologies and their best practices in upcoming blogs—stay tuned!

Common Data Ingestion Sources

Each company and team have variety of systems, databases, applications, etc. in their IT landscape from which you might need to extract data for analytics into data analytics platform. Here are some of the most commonly used data sources (not exhaustive list):

· SQL Server

· Oracle

· Teradata

· REST APIs

· Salesforce

· MySQL

· PostgreSQL

· SAP

· Dynamics 365

· FTP/SFTP

· MongoDB

We will explore several of these sources, highlighting key aspect from data ingestion and processing point of view. If you work extensively with any of them, let us know—we’d love to exchange experiences.

✅Summary

Efficient data ingestion isn’t just about moving data from point A to point B. There is no one-size-fits-all approach: you must adapt to source characteristics, target-system needs, and cost constraints. Structuring ingestion around load patterns—full, incremental, partition, and initial—provides clarity, efficiency, and resilience. Combined with the right tools—whether it’s ADF, Databricks, or FiveTran—you can build a robust foundation for your modern data lakehouse.

In upcoming posts, we’ll share our learnings, technology evaluations, and few advanced optimizations from our practical experience in real world projects with large scale implementations which you can use to level up your data ingestion game.

We hope you enjoyed this article and can leverage some of the tips shared here in your projects.

Please let us know which topics pique your interest or any pain points you’d like us to discuss or require help with.

#DataIngestion #ETL #DataPipelines #PipelineType #AnalyticsFoundation #MedallionArchitecture #Databricks #ADF #Fabric #DataFactory #EnablerMindsBlogs #TeamEnablerMinds

Designing Effective Data Ingestion: Patterns and Technologies for the Modern Data Lakehouse

We need effective data ingestion for:

Ingestion Patterns in a Data Analytics Platform

1️⃣ Full Load or Batch Mode

2️⃣ Incremental Load

3️⃣ Initial Load

4️⃣ Partition based Load

Common Technologies for Data Ingestion Lakehouse Platforms

Common Data Ingestion Sources

✅Summary

Recent Posts

Comments