Beyond the Data Lake: How Data Virtualization Drives Real Enterprise Agility

"Data lake" has become one of the most overused terms in enterprise technology. Organizations deploy data lakes with the expectation that consolidating their data assets in one place will unlock analytical power, enable faster decision-making, and reduce the friction that slows their response to market changes. The reality, for many, has been more complicated.

The fundamental problem with the data lake as a strategy is that the premise -- migrating all data to a single location -- has never been fully achievable and is becoming less so over time. Data sources multiply faster than integration projects can keep pace with them. Formats vary enormously, from structured transactional records to sensor streams to unstructured documents. Security and access constraints create additional barriers. And the work of normalizing and integrating disparate data sources in advance of knowing how they will be used is both expensive and slow.

What organizations actually need is not a single repository but a coherent framework for accessing diverse data sources in context, maintaining quality and governance at scale, and supporting the pace of experimentation and adaptation that competitive markets now demand. Data virtualization, combined with master data management and sound governance practices, provides that framework.

This article originally appeared in the September/October 2016 issue of IT Pro, published by the IEEE Computer Society.

The Data Integration Challenge in Modern Enterprises

Enterprise data today is not a single thing. It is a mix of structured and unstructured sources, real-time and historical records, internal and external inputs, each with its own format, quality characteristics, and access requirements.

Structured transactional data from ERP and CRM systems needs to be combined with unstructured content from documents, email, and knowledge management tools. Sensor and clickstream data -- high-velocity, high-volume, often noisy -- requires preprocessing before it can be integrated with verified internal records. External data sources, which vary widely in quality, need cleansing before they can be trusted alongside internal sources. Performance dashboards draw on knowledge management systems as well as analytical platforms.

Attempting to normalize and integrate all of these streams in advance, to suit a limited set of anticipated use cases, is impractical. Sources change too quickly. New use cases emerge before the previous integration project is complete. And the investment required to build and maintain those integrations grows with every new source added to the landscape.

At the same time, leaving data unmanaged creates its own set of problems. When data quality is not maintained, when data supply chains are not governed, downstream users and applications cannot trust their results. Time and effort get consumed verifying, rechecking, and in some cases reproducing analyses that should have been reliable the first time.

The challenge is finding a path between these two failure modes: the brittle, expensive, pre-integrated approach on one side, and the ungoverned data swamp on the other.

What Data Virtualization Actually Does

Data virtualization addresses this challenge by inserting a logical layer between data sources and the applications and analysts that consume them. Rather than migrating data into a new physical location, sources remain in place. Quality standards, ownership definitions, and data transformations are managed at this logical layer. Governance controls are applied consistently across sources. A centralized catalog tracks and documents what is available, who owns it, and what its characteristics are.

From the user's perspective, diverse and distributed data sources appear as a unified repository. The complexity of the underlying landscape is hidden. Analysts and data scientists can access what they need without needing to navigate the architectural details of each source system.

Virtualization also enables the separation of the semantic layer -- the terminology, definitions, translations, and business language that give data meaning -- from the physical data sources themselves. This separation is critical. When business terminology is embedded in individual source systems, changing a definition or reconciling different systems' use of the same term requires changes across multiple platforms. When that semantic layer is managed independently at the virtualization level, business language can be harmonized, updated, and applied consistently without touching underlying systems.

The result is faster integration, more consistent access, and a governance framework that can keep pace with the rate of change in the data landscape rather than constantly lagging behind it.

Digital Agility Requires More Than Technology

Data virtualization matters because digital agility matters -- and digital agility is not simply a technology capability. It is an organizational capability that depends on how quickly an enterprise can take in new information, make sense of it, and translate insight into action.

Organizations are under continuous pressure to experiment with new offerings, processes, and business models. The underlying systems must support that experimentation rather than constrain it. When information flows are slow, when data sources cannot be recombined dynamically, when analytical results cannot be quickly disseminated to the people who need to act on them, the pace of innovation slows. Every layer of brittle integration, every data translation bottleneck, every disconnected process is friction that reduces the organization's clock speed.

Business requirements have always evolved faster than IT systems. The difference now is that the competitive cost of that gap has increased dramatically. Organizations that can absorb new data sources, create new applications, and turn analysis into operational action faster than their competitors have a structural advantage that compounds over time.

How Virtualization Supports Customer Intelligence

One of the most valuable applications of data virtualization is in developing a comprehensive, real-time view of the customer. Understanding what customers need, how they are experiencing the brand, and what is likely to influence their next decision requires synthesizing data from a wide range of sources: transaction systems, call center transcripts, social media activity, loyalty program feedback, net promoter scores, online behavior, email interactions, and more.

Building and maintaining that view demands three interconnected capabilities. First, the organization needs a solid baseline understanding of each customer relationship, which requires integrating and aggregating data across all externally facing systems and internal customer intelligence tools. Second, it needs the ability to experiment with offerings, promotions, and product configurations, which requires internal access to the information and collaboration systems that translate customer insights into tangible proposals. Third, it needs to capture the results of those experiments quickly and feed them back into the analytical cycle, which depends on data being readily accessible when needed.

The faster these cycles can turn, the faster the organization can identify what works, correct what does not, and deepen the connections that drive customer loyalty. Data virtualization provides the access layer that makes this possible, processing both real-time behavioral signals and historical records to produce synthesized outputs that inform the next decision.

From Insight to Action at the Front Line

Generating analytical insight is only half the challenge. The other half is moving that insight through the organization quickly enough for it to change what front-line employees actually do.

Transformation means behavioral change at the operational level. Higher-level analytical findings must be translated into specific, actionable guidance and delivered to the people responsible for execution. Organizations need to analyze data, convert findings into usable knowledge, disseminate that knowledge in the form of updated processes, and then measure whether those processes are being followed. Each of these steps requires data to move through the organization reliably and quickly.

When this chain of activity runs through layers of fragile integrations, inconsistent terminology, or disconnected systems, the clock speed of change slows. Analytical cycles that should take days stretch into weeks. Employees who cannot locate the results of a recent analysis may run it again, wasting capacity. Insights that arrived too late to influence a decision lose their value.

Reducing the cognitive load on users -- by simplifying access, harmonizing business language across systems, and hiding the complexity of underlying data sources -- also improves the organization's capacity to absorb change itself. The limiting factor in transformation programs is increasingly not budget, resources, or technology. It is the human capacity to process and adapt to change while managing existing responsibilities. Anything that makes information easier to find and use directly expands that capacity.

Data Virtualization as the Foundation for Governance and MDM

Data virtualization is not a replacement for master data management or data governance. It is the mechanism through which the benefits of those programs are realized operationally.

With a virtualization layer in place, master data can be drawn from a verified enterprise source of record and combined with operational and real-time data to deliver a consistent, unified view of any target object or process. Data quality monitoring can be applied at the virtualization layer, with normalization operations handling routine transformations before data reaches consuming applications. Governance and compliance mechanisms can be embedded at the point of data onboarding rather than applied inconsistently downstream.

The virtualization layer also serves as an enterprise data catalog -- a centralized point of visibility into what data exists, where it comes from, who owns it, and how it may be used. That visibility is foundational to both governance effectiveness and analytical productivity.

Improving enterprise data agility through this kind of infrastructure is not a technology project with a defined end state. It is an ongoing organizational capability -- one that becomes more valuable as the data landscape grows more complex and the competitive premium on speed and adaptability continues to increase.


This article was originally published in IT Pro by the IEEE Computer Society and has been revised for Earley.com.

Meet the Author
Earley Information Science Team

We're passionate about managing data, content, and organizational knowledge. For 25 years, we've supported business outcomes by making information findable, usable, and valuable.