Data virtualization is an approach to data analysis that overcomes the challenges of drawing on data stores in various physical locations by creating a virtualized logical data layer that can integrate data sources of multiple types from many global sources without the need to draw in and manipulate data into an additional data store as in data warehouses. This approach beneficially eliminates the need for error prone data replication or data migrations that can lead to data corruption. For the end data users, data is presented in a single unified view, often with advanced visualizations.
Data virtualization was born as a solution to the challenges of data lakes. When data lakes began to pool up as a result of massive data producing technologies like social media, IoT enabled devices, and mobile devices, organizations began searching for a way to integrate these data sources, often in unstructured formats, with their traditional structured business data.
Data warehouses take unstructured data lake data and through a data extraction, transformation, and loading (ETL) process structure it to be analyzed later. It's a process that requires extra data replication and storage. This approach has a high potential for data inconsistencies as numerous ETL processes are necessary. Also repositories must be kept in sync. Data replication, further, raises data security and governance concerns that impact sensitive data, namely where to store users' sensitive data. The solution to these challenges is data virtualization with features that allow data customers to view, access, and perform analytics on disparate data sources regardless of location without having to move that data.
Data federations are subsets of data virtualization. Conceptually, data federations are like virtual data warehouses, where they can logically map remote data sources and then run queries on those sources to draw data into a structured schema. This strict data model is created within a virtual environment, unlike a data warehouse which replicates data within its own physical stores.
Two factors are pressuring enterprise data strategies, the massive volume of data generated by devices and systems used in day-to-day operations, and the incapability of on-premise infrastructure to cost-effectively keep pace with those demands. While this has technologically cornered some organizations who have stalled in the efforts to migrate to the cloud, others have learned of the benefits of data virtualization to corral disparate data sources and incorporate them into a unified real-time data view.
The main features of data virtualization middleware are:
Data virtualization requires software to connect with data sources and create the abstracted data layer where data scientists can then design data views and perform rich analytics. With many of the large cloud providers, Azure, IBM, AWS, etc., installing a data virtualization can be as easy as a 3-step process, amounting to simply opting to turn on the data virtualization service and making some configuration choices. These cloud providers offer data virtualization as an extension.
When companies are not operating in the cloud, and on their own systems, they may be required to install server and client applications that perform the virtualization, and then make a few configuration choices, such as adding and selecting data sources.
Deploying a data virtualization architecture awards several advantages to organizations.
Logical data warehouses (LDW) and traditional data warehouses (or classical data warehouses) are terms that complement each other. A traditional data warehouse is a physical set of servers and storage that draws data in, transforming it to fit its schema, and then analyzing and storing it for consumption. But to meet the challenges presented by using disparate data sources alongside data warehouses, and while avoiding the extra effort and capital costs of integrating a data lake, a logical data warehouse can be used.
Logical data warehouses are virtualization architectures that rest atop a data warehouse that extend a traditional data warehouse's reach into other data repositories, like enterprise data lakes, data warehouses, Hadoop, or NoSQL. LDWs provide a holistic view of an organization’s data without the need to format or transform consolidated data. Essentially, LDWs give data warehouses the advantages of data virtualization, including the ability to access and combine with unstructured data without requiring its movement.
Data warehouses and data lakes are two elements within a larger data storage ecosystem. While they both have massive storage features, each stores data differently which gives each their advantages. Data lakes store data in unstructured, native, raw formats. Data warehouses hold their data in schema structured formats which requires first processing to align data to the schema.
For data lakes, the purpose is to be optimized to store data in a manner that allows it to pile up. Because data today is being produced in such volumes as to require data analysis when resources allow or on-demand when cases arise, data must be allowed to easily accumulate. In this way data is never disposed of, just saved until needed. Data lakes use a system called object storage, which treats huge tracts of unstructured data as individual objects rather than file systems which treat data as files. Object storage facilitates massive data storage, usually in the petabyte range.
Data warehouses on the other hand can store huge amounts of data only after it’s been fitted into a data schema that structures the data warehouse (data warehouses are more selective of their data, so they store much less data than data lakes). This data manipulation requires resources, and often presents a necessary extra step for drawing insights out of data. Because the function of the data warehouse is migrating to the cloud, being replaced with data virtualization and data federations in combination with AI and data lakes, the use of on-premise data warehouses will decline sharply in the coming years.
Generally, data virtualization, like other virtualizations such as application virtualization and server virtualization, provides IT departments with exceptional flexibility in designing and architecting solutions to larger business needs. Virtualization provides an abstracted separation between the user of services and those responsible for ensuring the services are available. With data virtualization, data consumers can draw upon data analytics unconcerned with where and how the underlying data is stored, while admins can organize how they provide that service without disrupting its use.
Data virtualization supports several use cases. The following are ideal uses for data virtualization.
Data virtualization tools all aim to create a unified view for users through a logical collection of separate data sources. The following are identified by Gartner and G2 as the leaders in data virtualization tools.