Meaning to Data: Knowledge Graphs, Vector Databases, and Ontologies

There is No Free Lunch – The Perennial Problem of Information Curation 

 Over the past few decades, organizations have been struggling with capturing, managing, and accessing unstructured data – the knowledge artefacts that embody how the organization serves customers and solves problems.  This information can be in the form of highly technical solutions, specifications, manufacturing procedures, but it also includes go-to-market strategies and tactics, human resources policies, and everything else that is written down.  While structured and transactional data has been managed and curated more effectively, many challenges remain around data silos, inconsistent data models, varying semantics and definitions, as well as how analytics are managed and operationalized.   Each generation of new technology (knowledge portals, semantic search, data warehouses, data lakes, graph data, knowledge graphs, and now large language models (LLMs) promise to solve the problem.  But none of these approaches can fix what is foundationally, fundamentally flawed – the data hygiene and content curation processes that have been perennially underfunded and under-resourced.   

The Forces of Entropy 

Building on this messy world of data and information that has not been well managed, is difficult, no matter what the application. Of course, every organization has challenges with their data quality at some level, although some are in better shape than others. Usually enough data architecture and content/data curation and cleansing is done to launch the system, but without ongoing curation and management processes to measure and remediate data issues that naturally crop up over time, the new system will gradually succumb to the forces of entropy. Over time it will become less and less useful. A new content or knowledge system starts out great – a nice new clean environment – but without the correct processes and measures, it gets messy over time.   

Unfortunately, without sufficient funding to solve challenges at their source, the IT organization is left with little choice but to build upon inconsistent, poor quality and/or missing data.  Many times, a new project catalyzes a cleanup effort, but without truly addressing the information governance gaps that defeat the cleanup over time.   

As a result, customer support organizations carry on with poorly curated knowledge, leading to higher costs and lower customer satisfaction, and sales organizations struggle with locating the most effective collateral. Analytics teams produce the same analyses repeatedly rather than cataloging what they already have, due to variations in definitions and semantics.  

Enter AI  

Add to this the forces of rapid advances in AI, specifically Generative AI and LLMs, and the need to develop competitive advantage and not be left behind increases the urgency to do something.  

In some cases, organizations are attempting to train models with their own content, but training a model from scratch can be costly, difficult, and lead to unexpected outcomes. The underlying problem lies in the quality of training content and data.  What exactly is training data?  Well curated and structured content and data assets containing the correct descriptors (metadata) – in many cases, the very things that people are compensating for when deploying LLMs.    

There is a misconception that LLMs alone will deal with poor quality and missing data.  That is partly true.  An LLM can be used to improve product data, for example, but requires the correct context to do so – product names, categories and attributes, along with related content and knowledge artefacts.  

What is RAG?  

Retrieval Augmented Generation (RAG) improves LLM performance by providing a source of truth for the model.  When you ask an LLM a question (ChatGPT or another of LLM-based tools such as Gemini, Perplexity, Mixtral, etc.), the mechanism it uses is based on a representation of the world that is created by ingesting enormous amounts of content from the internet.  This content is processed by deep neural networks and a variety of mechanisms for statistically predicting what words are most likely to relate to the query.  That is an overly simplistic explanation; however, at the core, ChatGPT and other LLM-based tools are prediction mechanisms.  Their ability to provide human sounding answers is due to mathematical equations that treat text as a series or numeric values, and then iteratively operate on those values to provide an answer.   

However, if the answer is not contained in the LLM’s understanding of the world, it will still provide a response – one that sounds reasonable or is statistically likely – but is potentially factually incorrect.  These are the so called “hallucinations” – answers that do not have a basis in fact.   

Some people refer to LLMs as a “stochastic parrot”: 

From ChatGPT with reference to https://dl.acm.org/doi/pdf/10.1145/3442188.3445922:  

Stochastic: Stochastic refers to the probabilistic methods these models use to generate text. They don't understand language in a human-like way but predict the next word in a sequence based on statistical patterns learned from vast amounts of text data. 

Parrot like behavior Like parrots, LLMs can repeat or mimic human language without understanding the meaning behind it. They can produce coherent and contextually appropriate responses but lack true comprehension or intentionality. 

The LLM has a representation of the world that does not necessarily contain corporate information (unless that information has been made publicly available).  When a question is asked (a prompt), it is ingested into a vector space.  A vector is a mathematical representation of text.  Ingesting text and converting it into a series of numbers in the vector space is referred to as embedding.  Vectors capture the nuances of language by modeling “features” of the content. Features are essentially metadata – the “about-ness” of a piece of information.  Features represent multiple dimensions of information.  Vector representations can have hundreds, thousands or even tens of thousands of dimensions. It is difficult to think in more than three dimensions (four if we add time), but mathematically, it is feasible.   

The greater the number of dimensions, the more complex the text that the LLM can process.  The greater the number of dimensions, the greater the nuance captured in the data.  In vector similarity search, the prompt vector is compared to other terms and phrases in the vector space.  The vectors that are closest in “n-dimensional space” form the basis for the output.   By using additional signals (essentially metadata) from customer segments, industry, interests, behaviors, preferences and more, the closer the model can get to that conceptual “location” in multi-dimensional vector space.   

Think of the GPS in your car – it provides directions to get to a specific geographical location.  If you want it to take you to a particular location with certain characteristics (say a moderately priced Italian restaurant with good reviews) those additional clues will help the GPS guide, you to the correct location – those additional user preference signals will give the GPS more context.  It is, essentially, navigating in “n-dimensional” space – each characteristic, price, cuisine, rating, etc., provides additional dimensions for the query.    The same thing happens in the vector database – the more details we provide (through more specific prompts or through customer behaviors, preferences, configurations, prior requests, etc.), the closer we get to the correct output.  

Providing Ground Truth  

One way of reducing incorrect answers is to provide the LLM with a source of ground truth – the knowledge and facts from a support knowledge base, for example – to provide customers or customer support reps with the correct information.  The LLM retrieves the information from that source rather than from its own knowledge of the world represented in the vector space of the model.  There are many ways to do this.  One way is to simply query a knowledge source using standard full text or faceted search.  The challenge is the same one that search has always had – missing or poor-quality content – it is not well tagged or structured to return specific answers to questions.   

Another way to treat content is to ingest into the vector space.  Content needs to be broken up for ingestion, and if those components are well tagged and chunked into semantically meaningful chunks (an answer to a question for example), the results will be more accurate.  Instructing the LLM to only answer from the knowledge source and to answer “I don’t know” when information to answer the question is missing will significantly reduce, if not eliminate, hallucinations.   

The Value of Metadata 

Applying metadata is very important, because metadata forms additional signals that provide nuance and context for that content. In fact, in one study, my company found that with the use of metadata enriched embeddings, an LLM was able to answer questions from a knowledge source with up to 83% accuracy, as compared to just 53% without it. The use of a “knowledge architecture” – the content models and metadata used to structure content.  This study was based on a gold standard set of 60 use cases that models were tested against in cases where we knew the correct answer. 

Ontology as Scaffolding, Knowledge Graph as Reference Data 

An ontology is used as a reference source for metadata and controlled vocabularies of content architecture.  The ontology describes a domain of information by considering the “big picture” organizing principles representing what is important to the organization.  Different domains will have different “buckets” or categories representing the “domain model.”  For example, the domain model for an insurance company will have products, services, content types, customer types, risks, operational regions and so on.  A pharmaceutical company will have biochemical pathways, generic drugs, branded drugs, chemical names, diseases, indications, treatments, symptoms, drug targets, and mechanisms of action.  An industrial manufacturer will have product types, product attributes, industries, customer types, processes, environments and other entities.   When the various vocabularies (the terms that populate each entity in the domain model – the list of product categories for example) are created, the result is an enterprise taxonomy.  By describing relationships between taxonomies (indications for a disease or risks in a region) we can build an ontology which consists of all the vocabularies in the domain and the relationships among them.  

The ontology forms the knowledge scaffolding of the enterprise.  Using that scaffolding to access and organize data and content provides a knowledge graph. The knowledge graph becomes a reference for content models, tagging of information and the content and data itself.  When integrated with an LLM, the knowledge graph becomes the ground truth for the LLM.  It is an access point for reference information and the “source of truth” for retrieval using an LLM.    

Improving Data Quality with RAG 

LLMs can also be used to improve data quality and data fill, once the reference ontology is designed.  Many sources of information can be normalized and contextualized using ontology.  These sources include knowledge articles, specification sheets, web pages, troubleshooting guides, industry standards, style guides, user profiles and user behaviors – the telemetry from clickstreams, searches, campaign responses, call center history and more.   

The mechanism for fixing the data uses a form of RAG where the prompt is unenriched data, and the result is enriched data.    We use an approach called “modular RAG” which uses multiple, state-of-the art algorithms to process data and programmatically generate data enrichments.   

The Modular RAG approach uses ontologies as a reference point for controlled vocabularies and relationships between products for example (accessories, related products, solution kits and more), as well as attributes (specifications or additional elements) that describe product details and applications.   

Conclusion 

Industry experts have long advised organizations that AI runs on data.  Organizations are becoming more aware of the foundational requirements of AI applications from a data quality, architecture and governance perspective.  Disappointment resulting from AI project failures is leading to a re-evaluation of approaches and a reassessment of expectations about what AI can and cannot do.  The beginning point is an ontology that forms the “knowledge scaffolding” for the organization – the organizing principles and vocabularies used to tag content and structure data – and the ways that entities relate to one another.    

Building that reference is a starting point to make sense of and contextualize information.  The goal is to make information more contextually relevant – whether that is customer facing content to assist in product selection, related products on an ecommerce site, troubleshooting information for field support, or strategic plans for a senior executive.  The goal of the right information for the right person is the goal of context.  With LLMs and Gen AI, we have a greater ability to ingest signals and use “digital body language” to provide that in context information.  But the starting point is good data or at the very least good data architecture that can be leveraged to produce good data to drive AI -powered applications.   

A major goal of any digital transformation is to digitize or improve digitized processes to speed information flows. Gen AI has the potential to do so significantly. However, the generative AI hype continues full-on in the marketplace and new vendors enter seemingly by the minute. What many customers are now realizing, however, is that generative AI is not going to produce value for them by answering questions using the knowledge model inherent in the technologies that power LLMs. Those technologies contain models of the world based on relationships among terms and concepts learned through the ingestion of Internet content. The understanding contained in language models is getting better all the time. Still, they are trained on public content, which is not effective for AI applications in the context of proprietary products, services, and solutions.

This article was featured in The Enterprise AI World Sourcebook. https://www.enterpriseaiworld.com/Articles/Editorial/Features/The-Enterprise-AI-Sourcebook-is-Here-165511.aspx

 

Meet the Author
Seth Earley

Seth Earley is the Founder & CEO of Earley Information Science and the author of the award winning book The AI-Powered Enterprise: Harness the Power of Ontologies to Make Your Business Smarter, Faster, and More Profitable. An expert with 20+ years experience in Knowledge Strategy, Data and Information Architecture, Search-based Applications and Information Findability solutions. He has worked with a diverse roster of Fortune 1000 companies helping them to achieve higher levels of operating performance.