Ignore AI Fearmongering and Start by Improving Data Quality

Organizations should use LLMs and Gen AI to fix their data for AI-powered transformation. This is how.

I recently attended a webinar delivered by a large AI technology vendor. One of the presenters had a slide implying that without AI, your business is dead. I find that a little bit alarmist.

Yes, it’s very important to get the AI train out of the station and it’s very important to consider the various ways that will help your business. That said, alarmist proclamations and urgent admonishments are just not helpful.

What do you do with the statement “AI or Die”? It’s not actionable, and one of the other things that the speaker said was, “If you don’t do it fast enough you will die.” They suggested that organizations need to do this, they need to do it now, and they need to go full bore.

This is why you’re seeing such insanity as paying $1,000,000 salaries to AI specialists. The problem is not AI. The fundamental problem is data quality, availability, and structure.

A recent Harvard Business Review article by Tom Davenport and co-author Priyanka Tiwari articulated that 46% of executives recognize that they have a data quality issue. My answer is that the other 54% are lying or in denial. Although over 90% acknowledged the importance of data strategy, more than half had made no changes in their data, and only 11% strongly agreed that they had the right data foundation for Gen AI. Even though most organizations are having trouble admitting that they have core data quality and completeness issues, they are nevertheless using technologies that require quality data.

Data is the Foundation for AI-Powered Personalization

Much of AI is centered around deriving and contextualizing knowledge and insights from various sources. Getting the right information for the right person at the right time has been the mantra of personalization, contextualization, and knowledge management for decades. Personalization is about information in context. Improving search results is about giving people the information they need in the context in which they need it — another example of personalization. It’s all about improving the user experience

Data is the foundation for any advanced personalization and contextualization efforts — whether for an employee or a customer. Data is used to describe the user — who they are and what they need. Data is also what describes the users’ real-time digital body language based on interactions with the organization — click streams, search results, responses to marketing campaigns, interactions with customer service, and more.

Putting information in context requires that the source information be organized and structured. A big misconception is that Gen AI — using Retrieval Augmented Generation or RAG — will solve the problem of hallucinations (conclusions not based upon real information). RAG does have the potential to do this. But only if the system returns the right information. This is the old problem of search and retrieval. 

The Search and Retrieval Problem

What search and retrieval come back to is the need to

  1. Have the information needed to answer questions and
  2. Structure and tag that information for retrieval.

This latter condition is the one that gives organizations the biggest headache. 

When people say “make it like Google,” what do they mean? Generally, it’s the ability to search a vast number of documents, images, and other content and quickly return relevant results. I respond with “If you spent as much time and money optimizing content for Google as people who get top rankings, then, yes, it would be like Google.” That is not the answer that leadership wants to hear.

Using AI to Fix the IA (Information Architecture)

Fortunately, certain tools can be used to improve data quality, which in turn makes Gen AI function more effectively.

When content is ingested into a vector space (vectors are mathematical representations of text that are operated on by algorithms), the chunks can be parsed somewhat arbitrarily, rather than semantically (meaning the chunks do not necessarily contain specific answers to questions). Or, content can be broken into pieces that answer specific questions or contain specific procedures. The difference is in the “pre-processing” of the content. If the chunks are tagged with metadata, the LLM will be more effective at answering specific questions, because greater meaning has been built into the content.

To do this correctly, the algorithm needs to have access to two things:

  1. A content model (the metadata that describes the structure of a piece of information)
  2. Taxonomies and ontologies (the knowledge scaffolding of the organization)

These are contained in a “reference architecture” — the big picture organizing principles and the detailed way that content and data are described.

Here is a diagram of an example reference information architecture:

Ontology: The Knowledge and Data Scaffolding of the Enterprise

This figure illustrates the use of an ontology as the source of all the organizing principles used to power enterprise applications. An ontology consists of multiple taxonomies across departments and applications and the relationships between those taxonomies – relating a “product” taxonomy to a “services” taxonomy through the relationship “services for product” for example. Think of it as the knowledge and data scaffolding of the enterprise. 

For example, a CRM will include “customer types,” “content types,” “product categories,” “campaign types,” and more. An eCommerce system will require a source of consistent product categories and classes as well as the various descriptors that define product characteristics and attributes. Internal employee-facing tools require the same descriptors that are consistent with other tools and technologies in the enterprise (consistent product information and content from a knowledge base, for example) to provide lower friction information access. The organization cannot have acts of heroics upstream and expect a seamless customer experience; information must be consistent across multiple departments and functions so that employees can get the information they need to serve customers and the organization most effectively. 

Documenting a Reference Information Architecture

Documenting and optimizing a reference enterprise information architecture provides AI tools with the business concepts and entities that are important to the organization. It helps the AI determine what terms are used so that the correct information can be retrieved for customers and employees. (See “There’s No AI Without IA“.)

The reference architecture can be ingested into the LLM along with data that needs to be enriched or corrected. Other sources of information are also brought into the LLM that form additional signals for the vector representation, and a “templated prompt” is used to ask the LLM to “fill in the blanks” around needed data.

The following diagram represents the various signals that can help the LLM fix corporate data (in this case product descriptions and attributes)

Here is an example of before and after for a single product record. This example shows how product data can be cleansed and improved using an LLM and the correct reference information architecture. The revised version is much more useful to users than the original, which relied on product codes that were not meaningful to users and lacked a detailed product description. The same approach can be used for other types of information, including customer data and knowledge for a support organization.

With this improved data, Gen AI can provide many advanced capabilities such as a personalized experience, more appropriate recommendations, conversational commerce, etc.

Without good quality, well-structured data, no technology initiative will work to its potential. This is especially true of Generative AI and LLM-based tools. 

Yes, AI is critical. Yes, it needs to be integrated as a core capability. Yes, it can improve customer and employee experience. But don’t waste time worrying about “AI or die.” Take positive action, beginning with improving the data that will power these powerful algorithms.

 

This article was featured in Customer Think. https://customerthink.com/ignore-ai-fearmongering-and-start-by-improving-data-quality/

Meet the Author
Seth Earley

Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.