DataOps, an umbrella term, refers to an organizations' collections of technical practices, workflows, cultural norms, and architectural patterns surrounding how they integrate their data assets into their business goals and processes. This means that each organization's data pipelines are likely to be configured differently, however, in general, DataOps efforts intend to enable four capabilities within the company:
1. Rapid innovation, experimentation, and testing, to deliver data insights to users and customers.
2. Heighten data quality with extremely low error rates.
3. Synchronized collaboration across teams of people, environments, and technologies.
4. Orchestrated monitoring of data pipelines to ensure clear measurements, and transparent results.
Data Pipelines are a key concept in DataOps and encompass all data processing steps to rapidly deliver comprehensive and curated data to data consumers. Analogously, these are sometimes referred to as data factories to impart the feeling of tangibility to data assets and the thinking that data is a raw material to be processed.
Data engineers may architect multiple data pipelines in their DataOps designs, making pipelines between data providers and data preparers, and pipelines to serve data consumers. For example, one data pipeline may flow application data into a data warehouse, another from a data lake to an analytics platform, and further pipelines may feedback into themselves simply for processing, like sales data back into a sales platform.
At a high level, the mission of DataOps is to maximize data as an enterprise asset by turning it into business value. DataOps has been summarized as the process to enable the right data to get to the right place at the right time.
Technological and business conditions, however, have only over the last decade set the stage for DataOps as a recognized emerging set of practices, developing out of other practices like DevOps. A formalized framework has yet to coalesce around developing best practices. In all, many believe that DataOps is still transitioning through the early stages of the hype cycle, though the marketplace has welcomed many vendor solutions, including end-to-end DataOps platforms.
DataOps has foundations stemming from many processes historically grounded in DevOps. Today three schools of thought comprise the main foundational principles of DataOps: Agile Development, DevOps, and Lean Manufacturing.
With a CAGR of 23%, data is expected to grow to over 180 Zettabytes of data by 2025, a trend motivating numerous vendors to develop DataOps tools and platforms. Organizations can add to their DataOps to manage the flood of insight-rich data that is anticipated.
DataOps platforms provide end-to-end data control encompassing everything from data ingestion to analytics and report, whereas DataOps tools target one of the 6 capabilities of DataOps:
At small levels, companies can improve their DataOps processes and accuracy using specialized data tools that improve their data pipelines. The overarching mission of DataOps, however, is to achieve a full organization-wide culture change that appreciates data first and drives to maximize all data assets. The following DataOps framework elements help guide organizations in thinking holistically about their people, processes, and technology.
1. Enabling Technologies — Use technologies such as IT automation, and data management tools.
2. Adaptive Architecture — Deploy systems that allow for continuous integration (CI) and continuous deployment (CD).
3. Intelligent Metadata — Technology that automatically enriches incoming data.
4. DataOps Methodology — Game plan for deploying analytics and data pipelines, and adhering to data governance policies.
5. Culture and People — Cultivation of organizational ethos that appreciates and utilizes data and aims to maximize data assets.
DataOps is not DevOps, but DataOps processes have benefited significantly from DevOps, one of its foundational methodologies. DevOps introduces two capabilities that enable Agile development within DataOps, continuous integration (CI) and continuous deployment (CD). Agile methods demand quick development times, in the form of sprints, however, when it comes to running tests and deployment, the process is manual. This process is slow and error-prone. But, with CI and CD capabilities automation does away with the challenges of Agile thinking, namely the time-consuming and risky aspect of manual workflows.
DataOps introduces to its workflow the CI and CD concepts, enabling the same agile thinking in its data preparations and designs, yet also automating process ala DevOps thinking, so that for data users, data factories seem to disappear. Common stages for a DataOps workflow are:
1. Sandbox Management — The process of creating an isolated environment for experimentations.
2. Development — The design and building of apps.
3. Orchestrate — Two data orchestrations stages occur, the first orchestrates a representative copy of data for testing and development, and the second orchestrates the data factory itself.
4. Test — The testing stage targets code rather than data. However, in the orchestration stages, the testing of data is a primary task.
5. Deploy — Similar to DevOps, after successful code tests, the code is deployed to production.
6. Monitor — At all stages, monitoring occurs, but particular attention on end monitoring of the data factory so data exits pristine and honest.
A defining characteristic of DataOps is the numerous roles that interact and contribute to the accumulation, processing, and use of a company's data assets. Towards the extreme, companies whose data assets are their main value proposition have the most immediate need to understand the people engaging with proprietary information. These DataOps roles can be generally classified as data consumers, data preparers, and data suppliers.
Data Suppliers — Data suppliers are the end data owners, like database administrators, responsible for data management, processing, and user access control of a company’s DataOps.
Data Preparers — Due to the ever-complicating nature of DataOps, a middle ground of roles is developing between data engineers, data suppliers, and data consumers. Data engineers build the pipelines that refine raw data into new usable, valuable, and monetizable data. Data curators is a developing role that begins with the needs of consumers to optimize accordingly DataOps content to businesses the needed context for enhancing final assets. Another developing role due to heightened requirements around data governance is the data steward which is responsible for developing company data governance policies and ensuring compliance.
Data Consumers — Data Consumers receive the final data output and the largest group that interacts with DataOps assets. Many roles have emerged: data scientists apply data to solve business problems, data citizens are frontline workers in need of real-time information, and data developers need accurate DataOps as they build business applications that use those pipelines.