clint-eastwood-the-good-the-bad-and-the-ugly-spiros-soutsos-touched.jpg

Linked Data

Wikidata - Q41483

April 1, 2018

6 mins

Thoughts on working with Wikidata from practical experience - its benefits and drawbacks.

For those of you that have decoded the title, then the Q1415055 is probably self evident. Having spent several weeks building a practical application upon Wikidata, I thought it was time for an honest assessment. There are indeed, good, bad and ugly facets to Wikidata. I have my tin hat on.

The good

Always best to start with the positives. Wikidata does was it says on the tin. It does work. It is a broad, extensive knowledge base, based on the principles of linked data, a superset of the data, machine-extracted from wikipedia. It is crowd and bot curated. Anybody can contribute via its RDF enhanced Mediawiki UI. In the true spirit of linked data it has outbound links to other open knowledge bases, Google knowledge graph, BBC Things, Geonames, and MusicBrainz to name a few, and also links to the information and images served up by its older sisters, Wikipedia and Wikimedia.

Provenance claims can be asserted on any wikipedia statement/fact - in the form of references and sources.

It gets better - Wikidata has a public SPARQL 1.1 endpoint, and a rather nice UI for querying it. Again this works, and seems to deal with contention well. Over the course of several weeks and a lot of querying on this endpoint, I never experienced any issues.

However...

The bad

It’s the model. I am struggling to work out what was the thought process that went into this. For those of you familiar with RDF, you will understand RDF is a W3C framework and standard for creating models that describe some domain (ontologies), and describing things in that domain using statements that conform to those models. The RDF schema has patterns and rules for defining classes of Things, and the properties that describe them. Statements such as
<A> rdf:type <C> declares that resource <A> is an instance of some class <C>.
<D> rdfs:subClassOf <C> declares <D> to a be a subclass of <C> .
Object Oriented software developers have a natural affinity to this form of modelling

Wikidata is built upon the RDF framework. But instead of describing wikidata resources (things) using the RDF schema, wikidata has built its own resource description framework out of RDF (the resource description framework), thus creating an entire abstraction layer between its own model of the world and RDF.

Wikidata resources are divided essentially into two distinct sets. A set of property resources and a set of instance resources, known as items. The items set is the set of all the things in the realm of human knowledge. The set of properties is the collection of semantic predicates that items can be described by. The property resources all have unique IDs prefixed with P and the item IDs are prefixed with Q.

Now here is the kicker. Instead of using the RDF schema to define the properties and classes, wikidata have defined their own schema that sort of mirrors the RDF schema. There is a wikidata property P31 “instance of” that is semantically equivalent to rdf:type. Property P279 “subclass of” is semantically equivalent to rdfs:subClassOf. Classes themselves are declared as items, and items are then described using these properties (not using the RDF schema).

If you are now confused, you probably should be, as anyone with any experience of RDF being introduced to wikidata will be. Why does wikidata just not use RDF schema you will ask yourself. I did and have no good answer (yet). This abstraction layer, means wikidata loses all the benefits of RDF schema, and gains a frankly painful amount of confusion. Any consumer of wikidata first needs to understand the wikidata schema. Semantics in wikidata have been redefined - badly in my opinion.

and the ugly

It’s the obfuscation. So what’s wrong with opaque identifiers I hear you say. Indeed nothing, it makes perfect sense that all the Items (the things), should be identified by globally unique opaque IDs - the dataset is large and crowd-curated, conciseness and consistency works, names of things may change, hackability is not a concern. In fact the one in the title of this article I have no problem with. But I do have a problem with the wikidata properties being obfuscated too. The entire set of wikidata properties including those that are the schema itself (the one that is not the RDF schema) are opaque and not human understandable. This makes building queries and applications on wikidata really painful and time consuming.
Ontology models are like a programming language, informing the user how to build things using its model. Obfuscating the model is akin to obfuscating a programming language. Imagine if all the lemmas, grammars, and reserved words in Java or Python were not human readable, but encoded as a set of opaque IDs. Yes, it is the same with ontologies, they need to be human understandable too (at least until human developers are entirely replaced by AIs). Wikidata is essentially a giant obfuscated graph of nodes and edges. You have no idea of the semantic meaning of an edge without dereferencing the edge via a query or the wiki.
The reason we write code that is self-documenting is well understood. Thoughtful naming of variables and functions is critical in delivering code that is deemed to be self-documenting. The same principle should apply to ontologies.

For a few dollars more..

More bad - There is some fairly unusual ontology model when you dig below the surface. For example Clint Eastwood (Q43203), an instance of human (Q5) has occupation (P106) actor (Q33999). Actor is an instance of profession - all good so far. But actor (Q33999) is also a subclass of artist, creator and person. Thus semantically the range of occupation includes the class “person”. Clint Eastwood is an instance of human, and has the job of being a person.

More good - back to the rather good SPARQL query UI. The UI provides very nice hover and autocomplete features on wikidata properties and resources that de-obfuscates the opaque IDs. This provides some mitigation. If it wasn’t for this building queries would be difficult.

The SPARQL service (provided by the graph database, Blazegraph that wiki data is served from) has some very handy plugin services enabled, not least the label & description binding service. This allows you to write a SPARQL query and bind instance labels very easily, in any available language, without having to write all the joins and language filters in your query. For example :


# get actors in the film "The Good, The Bad, and the Ugly"
SELECT ?actor ?actorLabel ?actorDescription WHERE {
  ?actor wdt:P106 wd:Q33999 .
  wd:Q41483 wdt:P161 ?actor .
  SERVICE wikibase:label { bd:serviceParam wikibase:language [AUTO_LANGUAGE],en,es,it" }
}

In the query above, (try it) the label service automatically binds label and description variables to any named variable in the query. If you a querying for many variables, this is a big time saver.

Roll credits..

I do have reservations about allowing the crowd to curate the underlying model as well as the knowledge itself. The model is something that needs strong governance, and solid grasp of information architecture to ensure quality and semantic correctness.

All up, I have enjoyed working with wikidata, if you can invest the time in getting to grips with the schema and model, it is a superb, extensive source of machine readable information, the best of the current breed, with comprehensive links to open linked data sets (the primary reason we used it as knowledge source). If only the model was self-documenting, and it used RDF for the purpose it was designed.

Subscribe to our newsletter

Authors

Paul Wilton

Managing Director