Digital Transformation, Corporate Data and Gen AI: LLMs and the Challenge of Retrieval

A major goal of any digital transformation is to digitize or improve digitized processes to speed information flows. Gen AI has the potential to do so significantly. However, the generative AI hype continues full-on in the marketplace and new vendors enter seemingly by the minute. What many customers are now realizing, however, is that generative AI is not going to produce value for them by answering questions using the knowledge model inherent in the technologies that power LLMs. Those technologies contain models of the world based on relationships among terms and concepts learned through the ingestion of Internet content. The understanding contained in language models is getting better all the time. Still, they are trained on public content, which is not effective for AI applications in the context of proprietary products, services, and solutions.

The Nature of Training Data

I was reading an article on LinkedIn recently by a chief data officer extolling the virtues of quality “training data.” What exactly is “training data”? Is it the data that you use to build a model — the foundation model as it’s called? Or is it the data that’s used to describe the terminology of a specific industry to create vertical-focused models for finance or life sciences, for example?

Training data can be either, but it is much more than that. From a practical viewpoint, in many cases, training data refers to the data used to answer questions in a factually correct manner. Therefore, the data must be accurate and complete, which is not true if all the training is on public data.

Proprietary products may have significant amounts of customer-facing support data publicly available, but more technical engineering details or advanced configuration, solution architecture, or troubleshooting methods frequently are part of confidential IP and therefore are not publicly available. Sensitive, confidential, competitive, and differentiating information is part of the core intellectual property of the organization and making it publicly available becomes a competitive risk.

Overcoming Hallucinations

Everyone following generative AI is familiar by now with the problem of hallucinations. This occurs when a large language model (LLM) generates an answer that sounds reasonable but is not factually correct. I learned this early on when I asked ChatGPT about my background. It produced lots of interesting accolades and credits, most of which were not true, but all of which sounded reasonable. Anyone with my background could potentially have those characteristics on their resume. When in doubt, generative AI doesn’t admit ignorance; it makes up what it thinks is a reasonable answer.

The best approach to overcoming the hazards of hallucinations is retrieval augmented generation (RAG), which builds on an LLM-based foundation model to enhance it with carefully curated and structured information. This is not changing the model — it is providing the answers instead of the knowledge representation in the model. This information may be from the organization’s knowledge base or an authoritative external source. If the model is instructed to only provide the answer from the designated source and to answer “I don’t know” if the answer is not in that source, this approach will reduce the likelihood of hallucinations by providing only the information that leads to a valid answer or the response of “I don’t know based on this data source”.

The Snake Eating its Tail

Because so much content on the internet right now is being created by generative AI models, and those generative AI models are in turn feeding on the generated content, there is increasing concern that one day we will see what’s referred to as “model collapse,”[1] in which the quality degrades over time because all of the information is from a self-cycling source, kind of like the snake eating its tail. The way to avert this outcome is to keep injecting information that enables current, accurate responses.

The real value of generative AI is about both its ability to answer questions and to make interactions more understandable and conversational for humans. This is truly an astounding feat, especially in the case of complex questions and answers. But note the “retrieval” in “retrieval augmented generation.” Sophisticated answers can only occur when the information is properly structured and findable.

Building a House Requires an Architect (or at least a Design)

I frequently use the metaphor of building a house when I talk about the need to design an intentional information architecture. You don’t start with digging holes and pouring concrete. You begin with a design, an architecture for the project. In the information world, we do the same. Without pushing the metaphor too far, let’s assume we have multiple types of plans for a house (foundation, plumbing, HVAC, electrical, etc.). These plans lay out how the different components of the building will function and how they will interact with each other.

For the information world, one set of plans is a content model in which we define what I call the “is-ness” and “about-ness” of data and content. What is this thing? It’s a contract, for example. Well, if I have a thousand contracts in a pile, how do I tell them apart? It could be a labor contract, a work order, a consulting contract, an employment contract, a statement of work, a real estate contract, or a loan contract. That’s what I refer to as “is-ness.”

If there are a hundred employment contracts in that pile of a thousand contracts, how can they be distinguished in a way that makes it possible to search for the exact one we want? The employment contracts have to be identified by various attributes such as the job role, the name of the applicant or employee, the department, and so on.

These attributes constitute the “about-ness” of each contract. Now, this is important because we want our generative AI models to retrieve precise and specific corporate knowledge, not generic knowledge. All of those things are differentiators of that content and go into the content model. For support content, we need information such as the model of the device, configuration settings, etc. Otherwise, a support rep (whether human or AI) cannot locate what is needed to solve the problem.

Back to the Challenges of Search

Search results are only as good as the content, the structure, the data, and the architecture. Without those elements, we can’t retrieve the correct information in context. It’s the old problem of search and retrieval, which is the problem we’re trying to solve, even when we start looking at advanced technologies, such as AI-powered tools. It’s about retrieving information. It’s about getting answers. It’s about personalizing the results.

Yes, it can be about generating new copy using generative AP, but it’s also about retrieving something that’s based on your brand, your brand voice, your strategy, your product differentiators, and so on. The point here is that without defining the nature of the content, retrieval will still not work despite the advances of modular RAG.

In construction, there should be a balance between standards and differentiation. When you’re building a house, you need to abide by certain standard codes. You mustuse piping and fixtures and plumbing and electrical components that comply with interoperability standards. We’re all using standard voltages for appliances. The differentiation comes in with the design of the layout, the design of the fixtures, the design of the rooms, and other details. Standardization brings efficiency, and differentiation provides a competitive advantage.

Standardization and Differentiation

In the case of information design, what should be standard and what should we be differentiating on? Should the foundation LLM be used as a differentiator? I would say no. Organizations differentiate based on knowledge — knowledge about customer needs, manufacturing processes, routes to market, messaging, and many other details about how they run their operations and interact with the marketplace. They should differentiate based on their processes, content, and data, readying it for retrieval and personalized, role-based, in-context access. That is the unique knowledge that can be surfaced by an AI that uses RAG. But organizations can standardize on other things — they don’t need to invent or have their own document standards, APIs, internet protocols, or, I would argue, foundation models.

Generative AI will enable new tools and approaches that will improve the fundamental quality of the data. With that higher-quality data, a huge number of applications will be possible, from hyper-personalized customer and employee experiences to agents that will perform research and execute high-level instructions to achieve the goals and objectives of the user.

The future of LLMs and generative AI is scary to some and promising to others. But it will all build on the things that customers value and that make organizations and brands who they are — the fundamental knowledge that differentiates their operations internally and differentiates their presence in the marketplace.

[1]https://techcrunch.com/2024/07/24/model-collapse-scientists-warn-against-letting-ai-eat-its-own-tail/

This article was featured in Customer Think. https://customerthink.com/digital-transformation-corporate-data-and-gen-ai-llms-and-the-challenge-of-retrieval/

Meet the Author
Seth Earley

Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.