20 million entities is kind of cool. You know what is really cool? A billion entities.
~ The Structured Search Engine
Photo credit: Flávio Santos
Before Google produced the knowledge graph, they worked on a project called the Fact Repository, which was run by people working with Andrew Hogue who is responsible for that video linked to above on the Structured Search Engine (highly recommended watching.) It is a good introduction to knowledge graphs and knowledge bases. Andrew Hogue was also responsible for bringing ideas from Google’s acquisition of MetaWeb and Freebase into Google, which is why the presentation covers aspects of that purchase. Freebase was a manual effort to build a knowledge graph.
We’ve seen later efforts at doing things such as that described in the Biperpedia paper, about using Query log streams to extract data, to build ontologies about different topics, which I wrote about in a post called SEO Moves From Keywords to Ontologies and Query Patterns
So, how does Google automate the building of knowledge graphs to make it something that is web scalable? They stopped the Google Directory after realizing that an effort like the Open Directory Project that the Google Directory was based upon couldn’t keep up with the growth of the Web. The Open Directory Project was also likely used by Google to run focused crawls on the web that covered different topics, as a seed source to make sure the search engine was covering as wide a range of topics as it could.
Efforts like Freebase, which came to Google with the acquisition of Metaweb was a manual effort – something that was unlikely to scale, like the Open Directory Project.
Can Google use a different approach to build a knowledge graph in a way that is automated?
The Biperpedia project that uses query streams to effectively crowdsource the subjects of ontology building is one way to cover a broader range of topics.
Google also acquired a company called Wavii a couple of years ago. I wrote about the acquisition in the post: With Wavii, Did Google Acquire the Future of Web Search?. One of the papers that came from Wavii is worth reading carefully because it describes how a crawler could learn from reading the Web. That paper is: Open Information Extraction: the Second Generation (pdf)
While not working exactly the same way, I was reminded of the Open Information Extraction approach from Wavii by a new patent from Google about building knowledge bases using Context Clouds.
the exemplary embodiments described herein relate to computerized systems and methods for building knowledge bases using context clouds.
Definitions from the Context Clouds Patent
A knowledge base – provides a repository of structured and unstructured data.
A structured knowledge base – may include, for example, one or more knowledge graphs. The data stored in a knowledge base may include:
- information related to entities
- facts about entities
- relationships between entities
Data stored in knowledge bases can be used for various purposes, including processing and responding to user search queries submitted to a search engine.
Sources of data in a knowledge base – may be created and expanded using information from a wide variety of sources, such as electronic documents accessible over a network, including the Internet. Examples of such documents include:
- Press releases
- News items
- Technical papers
- The like
Web pages and other documents may provide information on entities, as well as relationships between entities. Other sources, such as managed databases, may provide information on known entities and relationships between entities.
Building Knowledge Bases Using Context Clouds
The process behind this patent includes things such as:
(1) Parsing text in at least one document on the Internet and detecting a target object in unstructured portions of the parsed text.
(2) Identifying objects that are proximate to the target object,
(3) Determining one or more context clouds for the target object based on the proximate objects,
(4) Determining a relationship associated with the target object, based on an analysis of the proximate objects, the context clouds, and an analysis of other documents containing the target object.
A System for Generating Knowledge Graphs
A System works to
(1) Detect a first data object in a document on the Internet,
(2) Detect a second data object proximate to the first data object in the document,
(3) Identify a third data object associated with the second data object, based on a frequency of co-occurrence of the second data object and the third data object in one or more stored occurrence lists,
(4) Generate, in a knowledge graph stored in a database, a first entry including the first data object and at least one of the third data object or a first predefined relationship between the second data object and the third data object.
This new context clouds patent is:
Computerized systems and methods for building knowledge bases using context clouds
Inventors: Sebastian Steiger, Christopher Semturs, Henrik Grimm, Lode Vandevenne, Danila Sinopalnikov, Nathanael Martin Scharli, David Lecomte and Alexander Lyashuk
Assignee: GOOGLE LLC
US Patent: 10,102,291
Granted: October 16, 2018
Filed: July 6, 2015
Computer-implemented systems and methods are disclosed for building knowledge bases, such as knowledge graphs, using context clouds. According to certain embodiments, a target object is identified in a portion of unstructured or semi-structured data in a target document, which does not conform to a predefined structure or pattern. A knowledge server may build a context cloud for the target document. The knowledge server may analyze one or more other documents stored in a networked database, to identify candidate documents that may include a meaning or relationship associated with the target object. The knowledge server may analyze one or more context clouds for the candidate documents to determine a meaning or relationship of the target object based on objects in the candidate document(s). The knowledge server may associate the determined meanings and/or relationships with the target object in the target document, thereby creating a new portion of a knowledge graph.
What are Context Clouds?
This patent tells us about how Google might build knowledge graphs using context clouds.
Context clouds – may include co-occurring objects, such as words, numbers, characters, and groupings thereof.
This can involve analyzing unstructured information using a database of co-occurrences between objects of the unstructured information and known or structured objects, to determine relationships between the unstructured data objects and the known/structured objects.
Examples of target objects having more than one context cloud. Each of those context clouds can be associated with a different definition or usage of the target object.
A target object which is a particular date (e.g., month, day, and year) could have:
(1) A first context cloud for people born on that date
(2) A second context cloud for people who passed away on that date,
(3) Additional context clouds for events that occurred on that date.
A context cloud contains data about knowledge of different entities that can be used to build knowledge graphs, and can be used to act as seed knowledge for building upon knowledge graphs:
Each context cloud can include other objects, such as entities (people, places, things), and attributes of the objects. Furthermore, each context cloud can include structured data with known relationships, semi-structured data with estimated relationships, and unstructured data with unknown relationships beyond a frequency of co-occurrence.
By introducing the idea that Google is working upon ways to automate the building of knowledge graphs, using things such as Context Clouds, I wanted to summarize this approach, and discuss some of the efforts that have led up to where we are now. Please visit the patent, which goes into more detail on how this approach can be used to build knowledge graphs in a web scalable manner