Skip to content
Blueprint Technologies - Data information specialists
Main Menu
  • What we do

      Artificial Intelligence

      Intelligent SOP
      Generative AI
      Video analytics

      Engineering

      Application development
      Cloud & infrastructure
      Lakehouse optimization

      Data & Analytics

      Data platform modernization
      Data governance
      Data management
      Data migration
      Data science & analytics

      Strategy

      TCO planning
      Productization
      Future proofing
  • Industries

      Manufacturing

      Enhance productivity and efficiency through tailored technology solutions, optimizing processes, and drive innovation in manufacturing operations.

      Retail

      Revolutionize customer experiences through innovative technology solutions for seamless shopping journeys and enhanced retail operations.

      Health & Life Sciences

      Advance healthcare outcomes and pharmaceutical innovations through cutting-edge technology solutions and data-driven strategies.

      Financial Services

      Empower financial institutions with secure and scalable technology solutions, driving digital transformation, and personalized customer experiences.

  • Databricks

      Databricks
      Center of Excellence

      Maximize your Databricks experience with our comprehensive Center of Excellence resources and support.

      QuickStarts

      Proof-of-value projects designed to get you started quickly on Databricks.

      Accelerated Data Migration

      Regardless of the source, we specialize in migration your data to Databricks with speed and quality.

      Unity Catalog Migration

      Accelerate your UC migration and minimize errors with our meticulously tested Brickbuilder approved solution.

      Lakehouse Optimizer

      Get higher return on your investment and minimize your total cost of ownership with self-facilitated optimization.

      Accelerated Snowflake to Databricks Migration

      Unlock increased cost savings, heightened operational efficiency, and enhanced analytical capabilities. 

  • Our work
  • Insights
  • About

      Our Approach

      Discover our holistic approach to uncovering strategic opportunities.

      Careers

      Explore exciting career opportunities and join our team today.

      News

      Get the latest updates and insights about our company.

      Events

      Stay updated on upcoming events and webinars.

      Our Partners

      Get to know our trusted technology partners and collaborators.

Connect
Blueprint Technologies - Data information specialists

Data acquisition best practices

By Blueprint Team

Introducing the first blog post in a new series from the Blueprint team. In the series, we will discuss the crucial elements that make a successful data ecosystem framework and why a comprehensive data strategy is key to future-proofing your business. Be sure to follow our socials to get updates when each post is published. This week we are covering data acquisition sources, patterns, and best practices for your data lakehouse.

Organizations looking to future-proof their business will find success or failure determined by a few critical elements within their strategy—the first being the development of a robust data estate. Prioritizing this has a significant impact on business, specifically through cost savings and improved scalability. While a robust data estate delivers high-quality data for downstream processing and engineering, its implementation requires organizations to first acquire data from a wide range of sources, including traditional databases, SaaS platforms, APIs, files, and third-party data providers. To achieve this, modern data acquisition platforms like Fivetran, Azure Data Factory, and Informatica Cloud are designed to work with data lakehouses like Databricks Delta. As the foundation for all your digital transformation strategies, Blueprint enables you to leverage industry-leading services to accelerate your ability to adapt faster, innovate better, and expand more efficiently.

Data acquisition

Let’s dive deep into the data acquisition process, starting with the sources of data eligible for the data lake. These include:

  • Transactional databases like SQL, PostgreSQL, MySQL, and others
  • SaaS platforms like Workday and Salesforce
  • APIs
  • IoT time-series data
  • Semi-structured data like JSON, XML, and CSV

Once data sources have been identified, organizations can determine the most appropriate data acquisition pattern and interval for each source. There are several patterns to choose from, including batch incremental, full data loads, CDC, and file pick-up. 

Batch incremental

Batch incremental is a pattern where data is acquired periodically, and only new or modified data is retrieved from the source system. This pattern is useful when dealing with large data sets that are frequently updated. It often relies on a date key in the source system to establish the ‘last used location’ or watermark. Full data loads, on the other hand, involve acquiring all data from a source system. This pattern is useful when dealing with smaller data sets that are not updated frequently. Full data loads are also useful as an initial load to ensure that all data is captured.

Change-data-capture (CDC) involves capturing changes to a source system’s data in real-time or near real-time. CDC can be useful in situations where immediate access to new data is necessary. File pick-up is a pattern where data is acquired by reading files from a specified location, such as a network file share or an FTP server. This pattern is useful when dealing with data that is generated by external systems or partners.

Other acquisition patterns or a hybrid of those above may be appropriate for specific sources or scenarios, and it is important to choose the best approach for the data being acquired.

It’s worth noting that the chosen acquisition pattern will also determine the frequency with which data is acquired. For example, some data sources may require frequent updates to stay up to date, while others may be static or infrequently updated. Understanding the different acquisition patterns and choosing the best one for each data source is crucial for building a successful data estate.

The Bronze layer

An important concept in Databricks Lakehouse architecture is the Bronze layer, which enables organizations to keep their data in its raw form, to preserve its integrity, ability to recover or rebuild, and provide a point of audit should it be needed. When data is first acquired, it is stored in the Bronze layer in its original form, without any transformations or modifications. This makes it easier to access and analyze the data in its original state and preserves its lineage. It serves as a staging area for all incoming data, where it is immediately available for use by downstream data engineering and transformation processes within the data pipeline.

Data is processed from the Bronze layer to the Silver layer, where it is transformed and optimized for analytical use. The ability to write data directly to storage as raw, without any transformation or modification, is a hallmark of the data lake architecture, which helps to save money on ingestion-processing and enables faster access to data for downstream processing and engineering.

Extract, load, transform

The E-L-T pattern, or extract, load, and transform, is a common data processing pattern used in the Lakehouse architecture. In this pattern, data is first extracted from various sources, then loaded into the data lake in its raw form, and finally transformed into higher quality tables as the data pipeline progresses. One of the benefits of writing data directly to storage as raw data in the Bronze layer is that it can help organizations save money on ingestion and processing costs. By not immediately transforming the data, the organization can avoid the cost of processing and transforming data that may not be needed in the future. 

 Additionally, the retention schedule of data in the Bronze layer is important to consider. Organizations need to determine how long they need to keep data in the Bronze layer before it is transformed or moved to a different layer. This retention schedule can be determined by factors such as compliance regulations, data usage patterns, and storage costs. By having a clear retention schedule, organizations can avoid unnecessary storage costs and ensure that data is retained for as long as it is needed. 

Databricks and Python for data acquisition

There are many first-class data acquisition platforms like FiveTran, Matallion, Azure Data Factory, and many others. Blueprint is familiar with them and skilled in leveraging Databricks natively for data acquisition routines if it makes the most sense from a cost management and labor point of view. Native Python routines in Databricks can provide a highly customizable approach to data acquisition. By writing Python code snippets, organizations can acquire data from nearly all sources, including JDBC, APIs, and other storage locations. This approach is highly beneficial if the client or firm is tolerant of a ‘build’ strategy over a ‘buy’ strategy like with FiveTran or other data acquisition tools.

One of the benefits of using Python for data acquisition is the ability to customize data acquisition pipelines. Python can be used to acquire, transform, and process data in real-time, batch, or streaming mode. Additionally, Python provides a highly flexible approach to data acquisition as it can be customized to meet the specific requirements of each organization’s data ecosystem. 

At Blueprint, our team has extensive experience with leading data acquisition platforms and can help organizations replace legacy technologies to reduce data engineering labor, costs, and increase scalability. 

Ready to getstarted?

Engage with us

Share with your network

You may also enjoy

Classic vs. Serverless: Exploring Databricks’ latest Innovations

Explore the benefits of Databricks’ serverless solutions, which simplify resource management, improve productivity, and optimize costs. Discover key insights and best practices to enhance your data strategy with cutting-edge serverless technologies.

Help for FinOps Leaders – How the Lakehouse Optimizer can assist with your Lakehouse 

Discover how FinOps leaders manage cloud and data costs effectively while maximizing business value. Learn how the Lakehouse Optimizer (LHO) addresses common business problems through discovery, optimization, and operation.
Blueprint Technologies - Data information specialists

What we do

  • Generative AI
  • Video analytics
  • Application development
  • Cloud and infrastructure
  • Data platform modernization
  • Data governance
  • Data management
  • Data science and analytics
  • TCO Planning 
  • Productization
  • Future Proofing
  • Intelligent SOP
  • Lakehouse Optimization
  • Data Migrations
  • Generative AI
  • Video analytics
  • Application development
  • Cloud and infrastructure
  • Data platform modernization
  • Data governance
  • Data management
  • Data science and analytics
  • TCO Planning 
  • Productization
  • Future Proofing
  • Intelligent SOP
  • Lakehouse Optimization
  • Data Migrations

Industries

  • Manufacturing
  • Retail
  • Health & Life Sciences
  • Financial Services
  • Manufacturing
  • Retail
  • Health & Life Sciences
  • Financial Services

Databricks

  • Databricks Center of Excellence
  • QuickStart Offerings
  • Accelerated Data Migration
  • Accelerated Unity Catalog Migration
  • The Lakehouse Optimizer
  • Accelerated Snowflake to Databricks Migration
  • Databricks Center of Excellence
  • QuickStart Offerings
  • Accelerated Data Migration
  • Accelerated Unity Catalog Migration
  • The Lakehouse Optimizer
  • Accelerated Snowflake to Databricks Migration

About

  • Our approach
  • News
  • Events
  • Partners
  • Careers
  • Our approach
  • News
  • Events
  • Partners
  • Careers

Insights

Our work

Support

Contact us

Linkedin Youtube Facebook Instagram

© 2024 Blueprint Technologies, LLC.
2600 116th Avenue Northeast, First Floor
Bellevue, WA 98004

All rights reserved.

Media Kit

Employer Health Plan

Privacy Notice