All Blogs

Driving Data Discovery and Reliability for Better Business Decision Making

Liam Yu
Senior Product Solutions Marketing Manager, Integrated Systems

September 25, 2023


Enterprises are drowning in data. Structured, semi-structured or unstructured data for the modern, data-driven enterprise is everything, everywhere, all at once. But that’s also a challenge for enterprises looking to transform their data into usable information for business success.

The sheer volume of data is challenging the ability of enterprises to find trustworthy, reliable data to drive their business decisions. Traditional data catalogs offer only structured data discovery. There is no end-to-end solution to help organizations discover trusted data across all data types.

What’s needed is a solution that brings together the three key components of the data challenge puzzle: data discovery, data observability and data reliability.

These three components also increasingly align with the needs of enterprises to apply data to improve work performance, make sound business decisions and derive value from the data they have.

Data Discovery

Whether data is in a PDF document, Word document, relational data base, log, or from machine-recorded telemetry, the typical enterprise has huge amounts of it. A recent study calculated the total volume of data stored by a typical enterprise at 10 petabytes (PBs) — an amount equivalent to more than 23 billion files — of which more than half (52% ) was considered dark data, meaning that it is data that has no value assigned to it at all.

The reason for that amount of unclassified data is simple: There is not enough time in the day for anyone, in any enterprise, to spend the time to understand what percentage of its data is valuable to the organization. Collecting it and integrating that data typically involves manually extracting it from several different sources, formats, system vendors and a variety of locations on-prem, across multiple clouds and edge architectures.

The concept of data discovery is synonymous with the reading or profiling of that data. New artificial intelligence (AI) and machine learning (ML) software tools, such as Pentaho Data Catalog, allow enterprises to automate the classification, tagging and management of data files to understand the quality of its data. These tools allow enterprises to understand data content and context by generating insights about that data or its metadata. For example, it could reveal how many times somebody’s name is mentioned in a hospital’s patient medical records. Or it could uncover the number of times the phrase, “interest rate” is used in a customer’s financial records.

This is crucial for enterprises looking to determine how much of its data is of value to its business, influences positive customer outcomes or generates better business processes. Only by automating the data discovery process can enterprises take the first step to capturing those kinds of insights about its data.

Data Observability

The second pillar to a successful data strategy is ensuring that data is visible and meaningful to its business users. Data observability provides the capability to monitor data usage across the enterprise. Monitoring that is critical to answering such questions as: Who’s using the data? Where did the data originate? Has it been changed? And if it has been changed, when, where, why and by whom?

Data observability supplies enterprises with the capability to track and record each data file, document, or record.  Armed with that information, enterprises can create a baseline for normal behavior. That understanding is critical to protecting the enterprise against cyberattacks.  It makes it easier for the enterprise to identify potentially threatening abnormal or anomalous behaviors.

Data observability also allows enterprises to identify data that, after it was collected, has never been touched or consumed. “Dark” or “Dead” Data that may never have value to the business. Enterprises can determine if that unconsumed data should be moved to less costly storage media, archived or retired. And finally, data observability enables organizations to see how it uses data in normal, day-to-day operations. It allows enterprises to monitor data in real time, more nimbly improve business processes and even contribute to achieving sustainability goals.

Data Reliability

The third pillar to a successful data strategy is determining if the data is trustworthy and reliable. Can the data be trusted in making mission critical decisions? Data reliability is enabled by the other two pillars of the strategy:

  • Data Discovery: The automation around understanding what data is.
  • Data Observability: Monitoring its usage across the organization.
  • Data Reliability: Where does the data come from? What is its quality? Is it accurate? Do I trust the source of that data? Who’s changed it? Is it consistent from end-to-end?

A Single Version of Truth

A data strategy built on these three pillars allows enterprises to apply the data they hold to improve business operations, enable better business decision making and drive AI-assisted automation across their organizations.

The crucial first step is enabling the automation for data discovery. Only through data discovery automation can enterprises obtain intelligence about all its data. It is the key to creating a single version of truth by understanding the correct and most accurate version of its data. The data that enterprises will know to be trustworthy and reliable to inform better business decision making and drive future business success.

Liam Yu is Senior Product Marketing Manager, Data Management, Hitachi Vantara.