Every CIO is familiar with the promise of big data: deeper customer understanding, faster time to market, anomaly detection, competitive intelligence, and the data foundations that machine learning requires. The gap between that promise and what most organizations actually achieve is wide, and it tends to persist despite significant investment in data infrastructure.
The organizations that extract real value from big data share a common characteristic. They understand their objectives clearly, they know what questions they want to ask of the data, and they have invested in organizing their data and the processes that act on it. That investment begins with something deceptively simple: knowing what data you have and understanding how it can be applied to create value for the enterprise.
Most organizations fall short at this starting point. They are not fully aware of the range of data sources available to them, and they have not thought systematically about how those sources connect to business outcomes. The behavioral data generated by marketing automation platforms is a representative example. These systems capture a rich stream of user interactions and electronic body language, but if that signal is never routed back into content processes and engagement strategies, it produces no value. The data is collected and then effectively discarded.
Making productive use of data requires more than collecting it. It requires closing the loop between observation and action, and doing so at two distinct levels.
The first loop operates at the level of individual interventions: collecting data, observing patterns, taking action, and then measuring whether that action produced the intended effect. At this level, adjustments are incremental. A piece of content is revised. A product relationship is reconfigured. A cross-sell rule is modified. A promotional offer is retargeted to a different segment. The objective is fixed; the adjustment is in the specific lever being pulled.
The second loop operates at a higher level of abstraction. Here, the lessons accumulated from multiple first-loop cycles inform a reassessment of the underlying hypothesis itself, and potentially a change in what types of interventions are even on the table. The macro objective may remain the same, but the model for achieving it is revised based on what has been learned. This is the level at which strategy adapts rather than tactics adjust.
The vapor trail of user behavior data is the raw material for both loops. Patterns of clicks, conversions, downloads, and abandonment signal which content relationships and merchandising structures are working and which are not. Small adjustments can be made quickly. Larger structural changes, requiring development, testing, quality review, and deployment, may take months from identification to implementation. The intervention timescales at the two levels are fundamentally different, and governance structures need to account for that difference.
When the number of variables is large or the volume of interactions exceeds what any analyst can process manually, these feedback loops require more automated approaches. Machine learning can operate on the signals at a scale and speed that manual analysis cannot match. But automation does not eliminate the need for human judgment about metrics, objectives, and hypotheses. It depends on them.
The familiar definition of big data focuses on volume, variety, and velocity. But the more operationally useful distinction is between large data and big data.
Large data is high-volume structured output from established systems: transaction records from a major retailer, for example. It is well-formed, consistent, and governed by a defined schema. The elements are understood, normalized, and comparable across time periods and contexts. Analyses on large data are technically demanding but conceptually straightforward.
Big data is something different. It streams in from diverse sources, often with inconsistent definitions of core concepts. One system may define a customer as a household; another as an individual. One source may use one naming convention for a data element; another uses a different one for the same concept. Before any meaningful analysis can happen, these inconsistencies have to be reconciled. The most interesting insights typically come from combining disparate sources and looking for patterns across data sets, but that combination step is precisely where the work gets hard.
Big data technologies address part of this challenge by enabling processing of large volumes on commodity hardware, distributing workloads to reduce cost and latency. But the technology does not resolve the underlying data quality and consistency problems. A data lake that accepts any format from any source without governance will accumulate data faster than any organization can make use of it.
The data lake concept is architecturally useful: a repository that accepts heterogeneous data types and structures without requiring the predefined schema that traditional data warehouses demand. Weather feeds, traffic sensors, click streams, social media output, and transactional records can all flow into the same environment. Machine learning algorithms can then process the accumulated data and look for patterns across what would otherwise be siloed sources.
The problem is that without organization, cataloguing, and governance, the data lake quickly becomes a data swamp. Data accumulates without sufficient metadata to make it retrievable. Lineage information is missing. Ownership is unclear. Quality assessments are absent. The data is technically present but practically inaccessible to anyone trying to answer a specific business question.
Just as data warehousing required reference data, the data lake requires something analogous: standard names for products, markets, customer types, promotion categories, demographic classifications, and data conditions. It requires metadata that documents history, lineage, ownership, usage rights, source information, and quality. It requires business terminology that is consistent across systems and comprehensible to the analysts and decision-makers who need to work with the data.
Big data does not interpret itself. The patterns that matter are not self-evident in the raw data stream. Analysts have to know what they are looking for, or at least what categories of patterns are worth surfacing.
Consider an omnichannel retail scenario. The objective is to understand how weather, pedestrian traffic, and mobile behavior interact with in-store and online promotions across different customer segments. The inputs include sensor data measuring foot traffic in stores, clickstream data from the retailer's website, mobile data from third parties correlated with anonymized demographic attributes, and promotional campaign records. New data points arrive continuously. The volume is substantial and the variety is significant.
To extract value from that combination, analysts need to form hypotheses first. If the goal is to identify conversations about specific products in social data, the system needs a definition of those products and the many ways customers might refer to them, including misspellings and informal variants. If sentiment analysis is the goal, the system needs to know which product characteristics customers typically call out positively or negatively. If the goal is to correlate foot traffic with sales lift from a promotion, the sensor data parameters from different systems need to be reconciled into a common definitional framework.
The questions that produce value are not generic. They are specific: What are the purchasing patterns of high-value customers on days with poor weather? How does promotional response vary by demographic segment when the offer is delivered via mobile versus email? What product combinations are most frequently associated with high-margin sales for a particular customer profile? The answers to those questions are in the data. Getting to them requires knowing what to ask.
The common thread running through all of these challenges is the need for a knowledge architecture: a consistent framework of language, concepts, and terminology that gives the organization a shared basis for defining, organizing, and interpreting its data.
A knowledge architecture is not simply a data dictionary or a metadata schema, though it includes both. It is the contextual scaffolding that allows diverse data sources to be understood in relation to each other and in relation to the business processes they support. It defines how products, customers, markets, and events are named and classified across systems. It establishes the organizing principles that machine learning algorithms can use as a foundation for pattern detection rather than having to derive those principles from scratch with every new analysis.
Building and maintaining that architecture requires governance: defined ownership, clear policies for managing changes as the business evolves, and processes for ensuring that the terminology remains consistent and current across the systems that depend on it. It also requires collaboration between technical teams who understand the data infrastructure and business teams who understand what the data needs to mean in operational terms. Neither group can do this work alone.
Big data has real potential. The organizations that realize it are not necessarily those with the most data or the most sophisticated algorithms. They are the ones that have invested in understanding their data, organizing it around consistent business concepts, and building the governance processes that keep that organization coherent over time. The knowledge architecture is not the exciting part of a big data initiative. It is the part that determines whether the initiative delivers.
This article originally appeared on CMSWire and has been revised for Earley.com.