© Fabian M. Suchanek

Mining Incompleteness of Knowledge Bases

The problem of incompleteness

A knowledge base (KB) is a computer-processable collection of knowledge about the real world. An example of such a KB is YAGO, which we co-develop at Télécom Paris University together with the Max Planck Institute for Informatics. The content of these KBs is usually correct, i.e., it corresponds to reality: If the KB says that Elvis Presley was married to Priscilla Presley, then this is most likely the case. However, KBs are usually not complete: There may be information missing from the KB. For example, the KB may not contain all albums of Elvis Presley. This is a problem for downstream applications of the KB, which will receive answers to queries that do not correspond to reality (e.g., when asking how many albums Elvis released, or when asking whether Elvis released a particular album that happens to be missing from the KB).

The problem is exacerbated by the fact that KBs usually do not store negative information. For example, a KB will not store the fact that Elvis did not (co-)release the album “Brothers in Arms”. While this fact is not in the KB, this does not mean that it is false: KBs operate under an open world assumption, which means that we may not conclude that statements that are absent from the KB are false. The reason is that even the KB creators usually only know (or can extract) the true statements — they do not know which statements are false, or which claims are incomplete. Thus, there is no way to detect from the KB alone whether a statement that is not in the KB is true in reality.

It is relatively easy to compute a score that tells us which percentage of entities has a certain attribute in the KB. But that alone does not tell us whether, in reality, a higher percentage of entities has that attribute or not. In our research team, we develop methods that can estimate where the KB is incomplete with respect to the real world. Our methods can estimate the incompleteness of a KB from the KB itself — without the use of external information.

Detecting missing facts

The KB may be incomplete on facts. For a given subject (say, Elvis Presley), and a given relation (say, released­Album), we can detect whether there are objects missing (i.e., whether there are albums in reality that are not in the KB). We use the rule mining system AMIE for this purpose:

Mining missing facts

If the KB is missing facts, we can try to predict them automatically. This works by help of rule mining. For example, we can mine automatically that people usually live in the same city as their spouse. When we are missing the city of residence of Priscilla Presley, we can predict that she lives in the same city as Elvis Presley.

Detecting obligatory attributes

Some attributes are obligatory for a class, i.e., all instances of that class have the attribute in the real world. For example, all people must have a birth date. Other attributes are not obligatory. For example, not all people have a spouse. In the KB, these attributes cannot easily be distinguished, because both of them may be highly incomplete. We have developed a probabilistic method that can guess whether an attribute is obligatory or not, even if the KB is highly incomplete.

Detecting missing entities

The KB may miss entities. For example, the KB may not contain all cities in France. By help of statistical methods, we can give a lower bound for the number of missing entities per class.