A data warehouse is a data storage system designed to ingest and analyze data in real-time typically to uncover business intelligence used in day-to-day operations. These systems draw relevant data from many other operational systems, like enterprise resource planning (ERP) systems, customer relationship management (CRM) systems, billing systems, and supply chain systems, and apply sophisticated visualization, reporting, and business analytics to help inform companies easily.
While many data warehouses do archive data for posterity, this role more frequently falls to data lakes because data lakes accommodate extensive data storage for multiple non-compatible data types. Data warehouse storage, however, is still a highly relevant enterprise storage solution, especially for sensitive customer information, proprietary business information, and storing refined data.
Data warehouses play a downstream role in enterprise data pipelines. Either the warehouse will dip into a data lake, drawing out relevant data from the pool of structured, unstructured, and semi-structured data, or it collects data from multiple data sources and operational systems.
Then, upon ingestion, the data warehouse will clean, transform, or otherwise prepare the data to fit into a data schema that it can store and use—unlike data lakes, data warehouses operate on hierarchical systems instead of flat. From the warehouse, or as data is moved inside, analytics and further transformations are applied to it, sending the refined insights to reporting, or on to another system like a data mart, to serve smaller specialized sets of users.
Data warehouses are complex, automated processes, but data warehouse key concepts are:
- Business Intelligence — The original main goal of data warehouses, business intelligence extracted from vast pools of business data, attempts to bring context and understanding to historical, current, and future views of the business operations.
- Extract, Transform, Load (ETL) — ETL is a common process for ingesting information into a data warehouse. It is concerned with collecting data from disparate sources, cleaning and transforming that data into usable formats, and loading that data into the warehouse for processing.
- Online Analytical Processing Servers (OLAP) — OLAP is a class of analytics server found in in data warehouse data pipelines: relational OLAP (ROLAP) caters to relational database, multidimensional OLAP (MOLAP) utilize multidimensional models, hybrid OLAP (HOLAP) combines ROLAP and MOLAP, and then there are specialized SQL servers.
- Data Mart — Data marts are downstream subsets of data warehouses, storing and presenting data for a particular silo. Typically, silos are created around business departments, like marketing, and sales, for those users to access relevant and current information about their subject area.
Data warehouses are designed around a three-tier approach. As technology continues to advance, the three-stage model may be implemented differently but the standard data warehouse architecture currently supports the collection, ingestion, transformation, and analysis of unprocessed or raw data into data usable for other systems, or immediately actionable by the company.
- Bottom Tier — In the bottom tier, the data warehouse is concerned with collecting data from multiple sources and systems, cleaning the data to eliminate data inaccuracies, corruptions, duplications, and incompletes, and then formats that data into a predefined schema. The main goal is to turn raw data into reliable data. The common process that data engineers use to achieve this is the extract, transform, and load process, or ETL. ETL aligns with the above process, extract data from sources, transform data into usable format, and then load that data into the data warehouse system for use by end users. Success in the bottom tier operations cannot be overstressed, as the saying goes, trash in, trash out.
- Middle Tier — The middle layer provides an access interface for data scientists and analysts to the data prepared in the bottom tier. Online analytical processing servers (OLAP) are used to rapidly query databases and return insights. Companies can choose from four types of OLAP servers: relational OLAP (ROLAP), multidimensional OLAP (MOLAP), hybrid OLAP (HOLAP), or specialized SQL servers.
- Top Tier — The top tier systems deliver reports, visualizations, and OLAP analysis. This is the most familiar layer of the data warehouse architecture, in some organizations, every department has access to this layer for insights in their respective areas. Leading solutions even grant significant data capabilities for end users, such as access to historical records, integrations between different systems, and, if applicable, real-time information.
Traditional data warehouses have increasingly moved to cloud supported architectures. The data warehouse value proposition combined with the scalable technologies in the cloud have created storage opportunities for enterprises to double down on the data efforts. Several of the cloud enabling advantages include:
- Virtually Infinite Scalability — The main draw to cloud architectures is the near infinite scalability afforded to cloud consumers on-demand. As enterprise systems become more data hungry, their data warehouses will need room to expand quickly.
- Infrastructure Cost Controls — Coupled with virtually infinite scalability are the cost controls that come with it. Because cloud services are usually based on subscription or pay-as-you-go models, organizations can anticipate and control their IT costs, pay more in peak times when needed, and pay less when there is low traffic.
- Near Instant Setup — Leading cloud vendor solutions, like Snowflake, Amazon Redshift or Azure Synapse Analytics, are typically much cheaper and faster to set up than on-premises systems. Because interfacing with the data warehouse is through the cloud, simple setups can be just a few clicks away.
- Access Data Lake Infrastructure — Moving to the cloud provides enterprise data warehouse access to data lake technologies. Most leading cloud providers have made the extension of data warehouses with data lakes straightforward and painless. Other configurations can include integrating on-premise data warehouses in a hybrid style with cloud data lakes.
Data warehouses, unlike data lakes, are considered scheme-write systems, meaning that when data is stored in a data warehouse, it is fitted into a predefined data scheme which helps in cataloging and organizing. This process alludes to the fact that data warehouses are designed to carefully prepare data before storage so that analysis can quickly follow.
Though data warehouses cannot store the same volume as data lakes, to try would be exceptionally cost-prohibitive, they are helpful in processing immediate, critical data metrics helpful to real-time business operations. Oftentimes, enterprises use data lakes as a base in their data stack, connecting it to data warehouses, or other AI and machine learning analytics through their data pipeline.
Data lakes are broader data repository systems with data ingestion as a primary concern over data analysis. Though analytics are developing around data lakes, data lakes are highly inclusive, accepting all data types, supporting all users, and easy to adapt. Because of these characteristics, data lakes potentially hold the deepest business insights. The challenge in drawing out those insights is defined by the very data lake characteristics that enable deep insights, so much data and the breadth of diversity requires time to process and analyze.
In contrast, data warehouses standardize data formats at ingestion so that insights can be quickly delivered about domain specific channels in a timely manner, such as marketing insights, or account billings. Conceptually, data warehouses represent an increase in data refinement at the sacrifice of data scope over data lakes.
Data warehouses can also be classified into three main types based on their use.
- Enterprise Data Warehouse (EDW) — EDW represents the common understanding of data warehouses, a central repository that stores business data for the purpose of drawing insight. These systems are full featured, offering data organization, categorization, and a unified approach to accessing and securing data.
- Operational Data Store (ODS) — ODS systems extend functionality by combining data sources, but typically in a snapshot or real-time fashion. These systems are geared for operational applications and tend to be designed to perform faster than EDWs.
- Data Marts — Data marts are subsets of data warehouses, normally receivers of processed or analyzed insights from the data warehouse, and grouped together underneath defined silos. These silos can be mapped for many purposes, but are normally set up to serve individual business departments of the enterprise, like billing, or customer service.
Enterprises deploy data warehouses as a storage solution to control and understand their data assets, and thereby make faster quality business decisions. For many reluctant businesses, basic data warehousing is a necessary cost to remain viable, but leading companies are leveraging them and integrating technologies to improve their competitive advantages.
- Higher Quality Data — Data warehouses ingest massive volumes of data from multiple systems and data sources. By aggregating this data into a central repository, removing bad data, and standardizing it into a usable format, organizations can apply any number of new operations to turn data into actionable insight. Data warehouses are specifically set up to help companies create higher quality data.
- Rapid Business Insights — Valuable business insights stem from the high-quality data produced by data warehouses. Companies that establish data warehouses know that rapid business insights have positively impacted their businesses. Through analytics and integrations, companies can quickly turn high quality data into valuable insights.
- Informed Decision Making — Data warehouses support integrations with game-changing analytics technologies, like data-mining, artificial intelligence, machine learning, and industry specific tools. These technologies can pull patterns out of the data that traditional analytics simply is not strong enough to do. Such advanced applications can have tremendous benefits, and cost savings, for larger business systems, like global supply chains.
- Increased Competitive Advantage — Data warehouses and the analytics they provide give a significant boost to a company’s competitive advantage. But more companies are adopting these technologies and equally benefiting from their capabilities. Yet, data warehouses are more than just the analytics. Data warehouses provide a hook for integrating novel technologies and processes and building on previous data technologies, like AI and machine learning, effectively forming a foundational platform for companies to augment their capabilities and increase competitive advantage.