Expert Insights | Earley Information Science

Semantic Tagging and Data Mining to Improve SEO

Written by Seth Earley | May 17, 2016 4:00:00 AM

What do we mean by “semantic tagging”?  Isn’t that just tagging? The answer is yes and no.

Semantics refers to the study of meaning.  Tagging is conducted in order to inform a user (or another computer) about the meaning and nature of a piece of content or even a piece of data.  It provides the context.  What is this object?  It is a product page. What is it about?  It is about a servo motor, or a bicycle, or a Cuisinart.  When we tag content we are telling a user what the content is so that they are able to locate it.  We are also telling search engines what the content is and so people can locate it when they type what they are looking for in the search box of Google or Bing.  Are those the same thing?  They are closely related but not the same.

Does Product Data Boost Organic Search?

Google and other web search engines don’t really care that much about your product data and they certainly don’t care about keywords.  Google is looking for richness of content, incoming links and a host of other “signals” that tell the engine what content is most valuable.  Increasingly, product pages are being comprised of structured data rather than unstructured web pages. This is done so that Google can render “rich snippets” and “product cards” – rather than just showing title and description, the search result contains a variety of details that make it more likely that a user will click through to the page.

The question is whether this content – which is structured with so-called “micro tags” – will rank better than untagged content.   In fact, this topic brought about spirited debate with my colleagues.  There is almost no direct evidence or documentation that Google algorithms boost content that is semantically tagged.   One site that claims to list “all 200 signals” used by Google suggests that pages that support schema.org micro-formats may rank above pages that do not.  Since pages that render with additional data afforded by micro-formats have higher click through rates, the boost could be due to those higher rates.  There is no question that micro formats (or micro tags) improve visibility of content, which is a good thing.

Why Mine Content?

Content is tagged by an indexing process.  The main signals that search engines look for or come from the text itself – density of keywords, incoming links and a host of other signals.  How can we do a better job of optimizing content through mining?  One way to think of this is by considering the different ways that users describe what they are looking for and using those term variants in the product page.  If the user is looking for a cell phone and the product is listed under mobile phone, and does not contain “cell phone” in the description, then search engines will miss the page.  It would be possible to create content about “how to choose a cell phone” that would be better optimized for search.  That page could contain a link to the mobile phone page and an offer for the consumer.  Keyword analysis tools provide term usage frequency so that the correct terms can be optimized. Mining content for alternative terms can provide targets for content optimization. 

Mining Content for Task Relationships

Another aspect of content mining pertains to creating product combinations and relationships.  “Works with” might be an important relationship that would allow for related content to be aggregated with the target content.  Google uses internal links and related content as signals for boosting search ranking, but this approach is also inherently valuable for the user.  Content can be mined for processes that require particular products, and these product relationships can be surfaced in ways that improve the user experience by helping them solve their problem as well as strengthening the signals used by ranking algorithms.  These simple keyword relationships such as “works with” (e.g., this saw blade “works with” this reciprocating saw), or “replaces” (e.g., this LED bulb “replaces” this incandescent bulb), or “is consumed” by (e.g., this toner cartridge “is consumed by” this inkjet printer) are precursors to a more process-based approach (e.g., how to replace the saw blade in your reciprocating saw, or how to lower your electric bill by installing high-efficiency replacement bulbs). We can get much of this precursor information right from the product data itself. Those concepts can be chained together to build a more comprehensive task-based ontology to solve more complex problems or to zero in on specific meanings of terms for particular audiences.  This can be mined from publicly available information (like publications from standards organizations, how-to information and product manuals).

What Users Search for Versus What they Buy – Revealing Hidden Patterns

Entry points resulting from organic search may lead directly to conversions (purchase or further inquiry) or may lead to browsing behavior and conversion to another product.  Those relationships can also be mined and analyzed to derive related concepts that can then be associated with product data.  Imagine a user was searching for “hydraulic pump repair” and then purchased valves, cylinders, gaskets, sealants, servo motors, and pistons.  Those product relationships can be captured to assemble potential choices for future customers making similar searches. Related documents and literature can provide applications and solutions that can become additional metadata for products.  When instantiated in a product information management (PIM) system, content management system or ecommerce application, this information provides additional ways for customers to research and shop for products.  Schema.org does allow for some related product data which could differentiate one seller from another.

Mining and Combining User Data  

Sales data can be combined with profile data, preferences, social graph and real time clickstream behaviors to build a nuanced view of the customer.  Browse sessions – what individuals search for, click on and what pages they view – can make inferences about their interests and their intent.  Affiliate content, social media feeds and cross-domain cookies can also be used to round out that profile in the absence of any declared information about preferences. Past purchase history is a good source of information, although that data needs to be tempered so we don’t end up trying to sell people things they have already bought.  Consider a customer who comes to the site looking for motherboards, memory, CPUs, power supplies, disk drives and cases.  We might infer that they are interested in building a computer.  Another customer might share some similar but incomplete browse and search patterns (they only searched for memory). If their other characteristics cause them to fall into a similar customer as the first, we can also infer their interest in building a computer – even if we have limited data.  Over time, shopper patterns are derived and associated, which provides the clues about what they are interested in, what is important, and what they intend to buy. 

Deriving Structure Using Natural Language Processing

We can apply natural language processing (NLP) to extract features and attributes from content, and transform them into structured data.  Unstructured product information contains entities, attributes, and values that can be extracted from the text using a variety of machine learning and NLP techniques. An example of this might be a descriptive paragraph about a set of safety goggles where it states that the goggles have a wrap-around lens for unobstructed field of view. NLP can decode the syntax and semantics of the sentence and draw an inference about those characteristics, which can then be used in search and merchandizing. We can infer that “wrap-around lenses” improve visibility and then create a schema attribute (such as “field of view”) and value (such as unobstructed, or 180 degrees, etc.) to add to the product data. This approach is also effective when there are lists of features in a semi-structured form (e.g., a table in HTML or PDF, a bulleted list, etc.).  Product features can also comprise solutions based on the requirements of the solution, and the solution can be a combination of attributes rather than being explicitly defined and declared.

Deep Linking and Micro tagging

Up until now we have been talking for the most part about building out a deep data model to serve up relevant content to users, drawing from multiple sources of evidence. The last point about semantic tagging is describing how customers can leverage all the inferred and synthesized metadata and express a set of semantic tags on their pages, making them more SEO optimized. For instance, if we generate a set of several thousand landing pages, we can use the same metadata to tag up the page and page components that we used to generate the page, reinforcing its relevance to the topic. This is an emerging area which will quickly take off, especially when there are more services parsing page data and delivering “answers, rather than page views.” Think Siri results that are tailored to a specific use case, rather than search results that make the user hunt down a way to interact with the content. 

Products at the end of the long tail are a mash-up of attributes – Red Calvin Klein crewneck women’s size 3 Irish knit sweater is comprised of color, brand, style, gender, size, weave, and product type.  Micro tagging allows for detailed and specific entry to a product meeting particular needs of a customer.  This approach is also appropriate for detailed retrieval of context-specific products.

“Semantic micro tags” are metadata.  Semantic tagging is simply tagging based on meaning.  “Although the content of web pages has been capable of some "automated processing" since the inception of the web, such processing is difficult because the markup tags used to display information on the web do not explain what the information means. Microformats can bridge this gap by attaching semantics, and thereby obviate other, more complicated, methods of automated processing, such as natural language processing or screen scraping. The use, adoption and processing of microformats enables data items to be indexed, searched for, saved, or cross-referenced, so that information can be reused or combined.”

Semantics can be approached from many different perspectives that range from adding micro-tags to conducting various analytical techniques that help optimize the search experience. Relationship information about products that are used together can help expedite and streamline the user’s progress. Entity extraction can be used to derive information about product features and use. Finally, combining information from multiple sources can be used to produce insights that are deeper than any single source, resulting in an enhanced user experience.