What’s caused this recent hubbub is a new research paper out of Google about an idea called “Knowledge-Based Trust,” which algorithmically determines the trustworthiness of web content based on the probability that it is factually correct.
The first thing you should know about Knowledge-Based Trust is that it is simply research at this point. John Mueller, who works at Google, has confirmed publicly that it is not currently used in Google Search. The second thing to note is that if Google does one day decide to implement something like this, it would likely augment, rather than replace, its existing methods for assessing the quality of web content.
With that said, it is my opinion that this latest research from Google is very important because it hints at where Google is headed in automating the extraction of knowledge from the web, and because it opens up some interesting possibilities for the future of artificial general intelligence.
What is Knowledge-Based Trust?
At its core, Knowledge-Based Trust is about how the quality of a website, its reliability, can be measured by the factual accuracy of its information. That may seem obvious, but it’s not actually how search engines work today.
Today, a website’s credibility is largely determined by whether other authoritative websites link to it. Google’s initial search breakthrough was to essentially outsource credibility assessment to web authors, harnessing the collective wisdom of millions of people by trusting that what they linked to would be of value to web searchers.
For Google to augment that technique by factoring in the accuracy of a website’s underlying information, it would need to come up with a process for assessing that accuracy in a highly automated and web-scalable – and that’s precisely what Google is exploring with its recently published paper, Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources (PDF).
Extracting Facts from the Web:
The first step in assessing this kind of information accuracy is to actually isolate and extract potential “facts.” In semantic technology circles, facts are structured as triples: a subject related to an object by a predicate (that’s three elements – i.e. a “triple”). As a simple example, “dog, fur color, black” states the proposition that a dog’s fur is black.
In this latest research, Google used some sixteen different techniques, called “extractors,” to identify and extract these kinds of triples from various websites. An example is the pattern “A, married to B” which can crawl through millions of websites to identify and extract facts about people being married to one another. Extracting these kinds of facts from unstructured text, like this article, is tough work, and as we’ll see later, can generate a lot of errors in the process.
To understand the rest of Google’s thinking on knowledge-based trust, it helps to draw on an earlier Google research paper, Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion (PDF). (I wrote about that earlier research in When Machines Know.) In that paper, in addition to extracting triples from websites using unstructured text, the researchers also tapped various forms of structured data, including Document Object Models, tables, and various annotation techniques like microformats and Schema.org. These sources are generally easier for machines to read, but unstructured text is really important because it’s so prevalent on the web.
The latest research refers to this earlier work as “knowledge fusion” it found that by combining, or ‘fusing’ the various extractors mentioned above, they were able to increase their confidence in a particular triple being factually correct. The fusion doesn’t stop there though.
In 2010, Google acquired Freebase, a very large database of facts, structured as triples. Google has since announced plans to shutter Freebase and help move its data over to Wikidata, but not before using the Freebase facts to help jumpstart its massive Knowledge Graph project, first revealed in 2012.
One of the interesting things about datasets like Freebase is that they can be represented as a kind of network of connections between triples. Google researchers analyzed connections between existing triples in ways that enabled them to assess the probability a new triple’s accuracy. In other words, the patterns between existing facts can help to assess the probability of a new fact being true.
Where might Google find new potential facts that it might want to assess against the database? That’s right: from the triples it’s now learning to extract from websites. Web extractors might find articles, for example, that weakly suggest that a Gideon Rosenblatt once worked in China, but when that information is fused with a database fact showing that a Gideon Rosenblatt once worked for a US-China Business Council, and that organization has an office in Beijing, the confidence in the fact that Gideon Rosenblatt once worked in China goes up significantly.
Information Extraction and Quality Assurance
One of the breakthroughs in the latest paper is the introduction of a new layer of analysis that distinguishes between factual errors residing in the underlying information from source websites and factual errors introduced by Google’s own information extraction efforts. The research team found that errors introduced by extraction were far more prevalent than actual errors in the underlying information, so this is significant. The researchers were also able to reduce the probability of introducing factual errors caused by extraction through fine tuning the size of the chunks of information being analyzed.
Both of these breakthroughs could prove quite important to being able to one day operationalizing these techniques in Google Search.
I’ve done my level best here to simplify what are actually two very complex research papers. Here’s another, more visual way of looking at it:
Here you see both the structured and unstructured information pulled from the web and fused together in the lower left box labeled “Web Extracted Facts.” The structured information includes boxes for Document Object Models, tables and annotations. The unstructured information is text-based and has a number of extractors that feed into the actual extraction of triples from raw text. Above these web-extracted facts is a layer of what I’m calling “quality control” that is aimed at separating error contributions due to source material, extractor performance and the granularity of content being analyzed. To the right are Freebase Facts, which use “path-ranking algorithms” (PRA) and neural network models (MLP, or “multilayer perceptron”) to predict the probability of new facts. These two branches are then fused together into what the researchers call “knowledge fusion,” as a way to further boost confidence in the accuracy of a given fact.
To be clear, this is my own framing of how these two research papers fit together, because I think it helps show how these efforts build to something much bigger.
What Are Google’s Goals in Knowledge Fusion?
So, what are the likely goals of this research?
Improving Search Results
The first goal is to use “knowledge-based trust” to help surface quality content in search results. The researchers describe these techniques as “a valuable additional signal for web source quality” – an augmentation to, rather than replacement of, the existing signals Google Search already uses today.
What’s important about this latest round of research is that it demonstrates significant quality improvements through isolating errors caused by the underlying content on websites from errors caused by Google’s own information extraction processes. This is Google experimenting with machine learning features that will help it scale up information extraction processes on a typical Google “Internet scale” and as an extension of the crawling work Google already does for its core search business.
By the time something like this is rolled out, knowledge-based trust could end up looking quite different from what we’re seeing in these research papers. It will also have to be “fused” with existing search signals before it can be operationalized. It’s also quite likely that certain types of information domains will prove to be lower hanging fruit than others; which is to say that assessing sports news websites for their factual accuracy will probably prove feasible far sooner than it will for philosophy websites.
Improving the Knowledge Graph
The other goal of this research is to help grow and improve Google’s Knowledge Graph. This database of facts is an asset that is going to be increasingly important to the company in the years ahead. As the researchers put it:
…Wikipedia growth has essentially plateaued, hence unsolicited contributions from human volunteers may yield a limited amount of knowledge going forward. Therefore, we believe a new approach is necessary to further scale up knowledge base construction. Such an approach should automatically extract facts from the whole Web, to augment the knowledge we collect from human input and structured data sources.
Note that while these innovations do reduce reliance on unsolicited human contributions in places like Wikipedia, Freebase, and now Wikidata, in the bigger picture, Google is simply expanding the methods through which it draws upon human knowledge contributions. Now, rather than pulling the information from structured databases of information, it is extracting that information, on a massive scale, from hundreds of millions, if not billions of websites. In a sense, this is really just an echo of Google’s first breakthrough success, crawling and indexing the web. With these new tools, it just expands that work, using artificial intelligence techniques to harvest the collective knowledge of humanity, and “organize the world’s information and make it universally accessible and useful.”
Over the last few years we’ve grown used to seeing the Knowledge Graph show up in search results as facts in carousels, cards and quick answer boxes. Just recently, Google revealed the latest iteration of this work: some very useful new health-related information displayed as information cards right in search results. But that’s just the start.
Google Now is already well on its way to becoming a virtual personal assistant thanks to Google’s advances in speech recognition, natural language, and yes, the knowledge graph. Google Now is really the core of essence of Google; quite tellingly, when you install the app on your mobile device, it’s simply called “Google.”
Ultimately, where this work heads is artificial general intelligence. We are right now in the early stages of an AI boom, similar in some ways to the dot-com boom of the late nineties. Before long, we will be surrounded by countless applications of artificial intelligence, embedded in most of the services and products we interact with on a daily basis. Most of these applications will be quite narrow and specialized in focus, but not this one that Google is setting out to build. For it to succeed, this AI will need to be general purpose in scope, and it will need to learn from us in a free and scalable way.
If the company is hesitant to talk openly about these ambitions today, it is understandable given the current concerns some are now expressing about AI. But there are hints still with us from Google’s early days, as with this revealing remembrance by Kevin Kelly of a conversation he once had with Google’s Larry Page back in 2002:
“Larry, I still don’t get it. There are so many search companies. Web search, for free? Where does that get you?” … Page’s reply has always stuck with me: “Oh, we’re really making an AI.”