Advogato: Blog for Skud

Open food interoperability: entities, unique IDs, and semantic equivalence

This is a post I made on Growstuff Talk to propose some initial steps towards interoperability for open food projects. If you have comments, probably best to make them on that post.

I wanted to post about some concepts from my past open data work which have been very much in my mind when working on Growstuff, but which I’m not sure I’ve ever expressed in a way that helps everyone understand their importance.

Just for background: from 2007-2011 I worked on Freebase, a massive general-purpose open data repository which was acquired by Google in 2010 and now forms part of their “Knowledge” area. While working at Google I also worked as a liaison between Google search/knowledge and the Wikimedia Foundation, and presented at a Wikimedia data summit where we proposed the first stages of what would become Wikidata — an entity-based data store for all of Wikimedia’s other projects.

Freebase and Wikidata are part of what is broadly known as the Semantic Web, which has to do with providing data and meaning via web technologies, using common data formats etc.

The Semantic Web movement has several different branches, ranging from the extremely abstract and academic, to the quite mundane and pragmatic. Some of the more common bits of Semantic Web technology you might have come across are microformats, for instance, which let you add semantic meaning to your HTML markup, for instance for defining the meanings of links to things like licenses or for marking up recipes on food blogs and the like. There is also Semantic Mediawiki which adds some semantic features on top of a wiki, to allow you to query for information in interesting ways; Practical Plants uses SMW and its search is based on this semantic data.

At the more academic end of the Semantic Web world are things like RDF which creates a directed graph of semantic data which can be queried via a language called SPARQL, and attempts to define data standards and ontologies for a wide range of purposes. These are generally heavyweight and mostly of interest to researchers, academics, etc, though some aspects of this work are starting to seep through into consumer technology.

This is all background, however. What I wanted to talk about was the single most important thing we learned while working on Freebase, which is this:

Entities must have unique identifiers.

Here’s what I mean. Let’s say you know three people all called Mary Smith. Then someone says, “It’s Mary Smith’s birthday today.” Which one are they referring to? You don’t know. In any system based around knowledge, you need to have some kind of unique ID for each entity to avoid ambiguity. So instead you might say, “Mary Smith, whose employee number is E453425″ or “Mary Smith, whose email address is mary@example.com”, or “Mary Smith, whose primary key in our database is 789″.

When working on our proposal for phase 1 of Wikidata, one of the things we realised is that the Wikimedia community — all the languages of Wikipedia, the Wikimedia Commons, etc — lacked unique identifiers for real-world entities. For instance, Barack Obama was http://en.wikipedia.org/wiki/Barack_Obama on English Wikipedia and http://de.wikipedia.org/wiki/Barack_Obama on German Wikipedia and http://commons.wikimedia.org/wiki/Barack_Obama on Wikimedia Commons and http://en.wikinews.org/wiki/Category:Barack_Obama on Wikinews, but none of these was his definitive identifier.

Meanwhile, interwiki links — the links between English and German and French and Swahili and Korean wikipedias — were maintained by hand (or, actually, by a bot) that had to update every wikipedia whenever a page was added or changed on any of them. This was a combinatoric exercise: with 2 wikis, there are two links (A -> B and B <- A). With 5 wikis there are (4 + 3 + 2 + 1) * 2 links. With N wikis, there are N-1! * 2 links, or to put it another way, 50 wikis would mean 1.2165637e+63 links between them. This was wildly inefficient to maintain!

Wikidata’s “phase 1″ was to create an entity store for Wikimedia projects, where each concept or entity — “Barack Obama” or “semantic web” or “tomato” — would have a central identity which could be linked to. Then, each Wikimedia project could say “This page describes entity XYZ”, or conversely Wikidata could say “this entity is described on these pages”, and suddenly the work of the interwiki bot became much easier: it meant that each new wiki added would only mean one new link, not an exponentially-expanding web of links.

We are in a similar position with open food data at present. There are dozens of open source food projects and that list doesn’t even touch on the ones that are more connected to recipes/eating/nutrition. We’re talking about how to interoperate between our various projects, but the key to interoperability is entity identification. If someone wants to mash up Growstuff’s harvest data with Openrecipes recipe search or the US FDA’s nutrition data, they need to know that Growstuff’s tomato is the same as the tomato you use in spaghetti sauce or the tomato that contains some percent of your RDA of potassium.

So how do we do this? None of our projects are sufficiently established, mature, or complete to claim the right to be the central ID repository. Apart from that, many of us have different focuses — edible plants, all types of plants, all types of living things, and all types of food (including non-animal/non-plant food) are some of the scopes I can mention offhand. Even the wide-ranging species databases like the Encyclopedia of Life don’t capture such information as crop varieties (eg. roma tomato, habanero pepper) that are important to veggie gardeners like Growstuff’s members.

Here’s what I would propose as an interim measure.

All open food projects need to link their major entities (eg. “crops” in Growstuff’s case) to one or more large, open, API-accessible data stores.

Examples of these include:

Wikipedia (any language, but English has the most articles)
Wikidata
Freebase
Encyclopedia of Life

By doing this, we can match data between projects. For instance, if Growstuff’s “tomato” links to the same entity as OpenFarm’s “tomato” and OpenFoodNetwork’s “tomato” and OpenRecipes’ “tomato” then we can reasonably assume they’re all talking about the same thing.

Also, some of the above data sources provide APIs which allow us to pivot easily between data sets. For instance, Freebase’s query language allows you to ask questions like “given an entity that is identified as ‘tomato’ on English Wikipedia, what is its identify on the Encyclopedia of Life?”

To see this in action, paste the following query into Freebase’s interactive query editor:

    [{
      "a:key": [{
        "namespace": "/wikipedia/en",
        "value": "Tomato"
      }],
      "b:key": [{
        "namespace": "/biology/eol",
        "value": null
      }]    
    }]

As you’ll see, the result is “392557” or to put it another way http://eol.org/pages/392557 — the EOL page on tomatoes.

From day 1, Growstuff has been tracking Wikipedia links for all our crops, to enable this sort of query against Freebase and so easily pivot to other data sets that Freebase knows about. If other projects take similar steps, this means that we are well on our way toward interoperability.

(As an aside, this is why we’re also having this other discussion about what to do about crop varieties that don’t have their own Wikipedia page, as this messes up the 1-to-1 relationship between Wikipedia entities and Growstuff entities. This may be something we just have to deal with, however, as no external data set will exactly match ours.)

Next steps

I strongly encourage all open food projects to link their “crops” or similar entities to one or more major, open-licensed, API-accessible data source (ideally one which has its keys in Freebase).
We should all expose these links via our APIs, data dumps, or whatever other mechanisms we use to make our open data available.
Developers should be able to request data from our APIs based on these identifiers, either through query parameters or through REST API resources like eg. /crops/eol/392557.json
We should use semantic markup/links to denote this entity equivalence on our webpages, eg. if Growstuff links to a Practical Plants page on the same crop, there should be a standard way to say “we consider these pages to refer to the same entity”. I’m not sure exactly what this is, yet, but if we do this it will benefit web crawlers, search engines, and other non-API consumers of our websites.
We should look into developing a microformat for expressing crop information on a webpage, in collaboration with microformats.org. I expect, however, that it will be very hard to develop a workable ontology, since (for instance) some of our projects are interested in planting information and some aren’t, some are interested in sale and distribution and others aren’t, some are dealing with non-edible plants and others aren’t, etc. It may have to be as simple as “this is a crop and here are the names we have for it”.
It would be great to put together some kind of visualisation like the linked open data cloud to show which open food projects are providing interoperable identities and how they connect to each other.

I’d like to get buy-in from other open food data projects on at least the general idea of matching our “crop” entities (whatever we call them) against some of the big databases. Who’s in?

Syndicated 2014-09-30 02:11:13 from Infotropism

30 Sep 2014 Skud » (Master)