Older blog entries for mhausenblas (starting at number 52)

Open Data – a virtual natural resource

A virtual natural resource? Doesn’t make sense, does it?

Let me explain.

Natural resources are derived from the environment. Many of them are essential for our survival while others are used for satisfying our wants.

… is with Wikipedia says about natural resources.

Now, some 150 years ago a handful of people saw the potential of petroleum which is nowadays the basis for a multi-billion dollar industry. Roughly the same holds for electricity. It’s not exactly that the crude resource is of much interest or, FWIW, might even be dangerous to handle – ever dipped your fingers into crude oil? ever touched a power outlet with bare hands?

However, as I already said a while ago, the applications on top of the natural resources are valuable and our modern society couldn’t do without it.

Back to Wikipedia. The Homo Digitalis lives in a digital environment, producing data almost as a side-product of the daily activities and depending on it as it drives the applications.

In this sense, yes, Open Data is the virtual natural resource #1 and here to stay.

When will you realise the potential of (Linked) Open Data?


Filed under: Big Data, Cloud Computing, Linked Data

Syndicated 2012-01-30 09:13:54 from Web of Data

… you end up with a graph

Quite often I hear people coming up with rather strange explanations why we use graphs, or to be more specific for the Web case, RDF. Some think that the reason is to make the developer’s life harder. Right. It’s so much easier to understand a key-value structure. And there are the ones who claim that we use graphs because the W3C says so (RDF is a W3C standard). Some others say graphs are used because they are the most generic, powerful data structure and you can represent any other simpler data structure, such as a tree (think XML) with it.

The real reason why graphs are in use is much simpler: when you have data sources and you start connecting data items across them, you end up with a graph. You can like it or not, but it’s inevitable.

Now, graphs have a number of desirable properties, including ‘no JOIN costs‘, they are flexible (you can add/remove stuff anywhere) and, for some of the graph formats, including RDF and TopicMaps there exist standardised query languages (SPARQL, TMQL). The latter one is something that many other sorts of NoSQL datastores are lacking and we see people re-inventing the wheel over and over and over again as there is a natural desire to declaratively state a query and leave the details of the execution to a machine.

The community around GraphDBs is growing and next time you have a BigData problem at hand, give it a moment to contemplate if, maybe, your data is of graph shape and consider joining us in the cool kids club.


Filed under: Big Data, Cloud Computing, FYI, Linked Data, NoSQL

Syndicated 2012-01-29 21:23:46 from Web of Data

Libraries – an important and vibrant Linked Data application domain

In late 2009 I was contacted by Tom Baker, Emmanuelle Bermes and Antoine Isaac to help fund the Library Linked Data Incubator Group (XG) at W3C and although I personally didn’t actively contribute (more a hurler-on-the-ditch like commenting, really) I am really, really happy with the outcome. To be fair, DERI was very active after all and provided great input – thanks to Jodi Schneider.

After the regular year the XG was extended for three months to complete the multitude of deliverables. A very ambitious, and as it looks like successful endeavor – bravo to the chairs and group members for this achievement. I gather that a transition to a W3C Community Group is discussed, something I applaud and support.

Why do I mention this now? Well, first of all (digital) libraries are an important and vibrant Linked Data application domain. I predict that after the attention the eGovernment domain has witnessed something like this could happen soon in the DL domain as well … hm, on a second thought, actually it is already happening ;)

In this context I stumbled upon two interesting new publications I’d like to share with you:

It’s an exciting time to work in the Linked Data area, activities in many application domains and on the systems level (NoSQL, SEO, etc.) are popping up like wild.

What do you think? Is the time ripe for a large-scale deployment of Linked Data in the DL domain? What other good resources do you know of?


Filed under: FYI, Linked Data, W3C

JSON, data and the REST

Tomorrow, on 8.8. is the International JSON day. Why? Because I say so!

Is there a better way to say ‘thank you’ to a person who gave us so much – yeah, I’m talking about Doug Crockford – and to acknowledge how handy, useful and cool the piece of technology is, this person ‘discovered‘?

From its humble beginning some 10 years ago, JSON is now the light-weight data lingua franca. From the nine Web APIs I had a look at recently in the REST: From Research to Practice book, seven offered their data in JSON. These days it is possible to access and process JSON data from virtually any programming language – check out the list at json.org if you doubt that. I guess the raise of JSON and its continuing success story is at least partially due to its inherent simplicity – all you get are key/value pairs and lists. And in 80% or more of the use cases that is likely all you need. Heck, even I prefer to consume JSON in my Web applications over any sort of XML-based data sources or any given RDF serialization.

But the story doesn’t end here. People and organisations nowadays in fact take JSON as a given basis and either try to make it ‘better’ or to leverage it for certain purposes. Let’s have a look at three of these examples …

JSON Schema

I reckon one of the first and most obvious things people where discussing once JSON reached a certain level of popularity was how to validate JSON data. And what do we do as good engineers? We need to invent a schema language, for sure! So, there you go: json-schema.org tries to establish a schema language for JSON. The IETF Internet draft by Kris Zyp states:

JSON Schema provides a contract for what JSON data is required for a given application and how to interact with it. JSON Schema is intended to define validation, documentation, hyperlink navigation, and interaction control of JSON data.

One rather interesting bit, beside the obvious validation use case, is the support for ‘hyperlink navigation’. We’ll come back to this later.

Atom-done-right: OData

I really like the Atom format as well as the Atom Publishing Protocol (APP). A classic, in REST terms. I just wonder, why on earth is it based on XML?

Enter OData. Microsoft, in a very clever move adopted Atom and the APP and made it core of OData; but they didn’t stop there – Microsoft is using JSON as one of the two official formats for OData. They got this one dead right.

OData is an interesting beast, because here we find one attempt to address one of the (perceived) shortcomings of JSON – it is not very ‘webby’. I hear you saying: ‘Hu? What’s that and why does this matter?’ … well it matters to some of us RESTafarians who respect and apply HATEOAS. Put short: as JSON uses a rather restricted ‘data type’ system, there is no explicit support for URIs and (typed) links. Of course you can use JSON to represent and transport a URI (or many, FWIW). But the way you choose to represent, say, a hyperlink might look different from the way I or someone else does, meaning that there is no interoperability. I guess, as long as HATEOAS is a niche concept, not grokked by many people, this might not be such a pressing issue, however, there are cases were it is vital to be able to unambiguously deal with URIs and (typed) links. More in the next example …

Can I squeeze a graph into JSON? Sir, yes, Sir!

Some time ago Manu Sporny and others started an activity called JSON-LD (JavaScript Object Notation for Linking Data) that gained some movement over the past year or so; as time of writing support for some popular languages incl. C++, JavaScript, Ruby and Python is available. JSON-LD is designed to be able to express RDF, microformats as well as Microdata. With the recent introduction of Schema.org, this means JSON-LD is something you might want to keep on your radar …

On a related note: initially, the W3C planned to standardize how serialize RDF in JSON. Once the respective Working Group was in place, this was dropped. I think they made a wise decision. Don’t get me wrong, I’d have also loved to get out an interoperable way to deal with RDF in JSON, and there are certainly enough ways how one could do it, but I guess we’re simply not yet there. And JSON-LD? Dunno, to be honest – I mean I like and support it and do use it, very handy, indeed. Will it be the solution for HATEOAS and Linked Data. Time will tell.

Wrapping up: JSON is an awesome piece of technology, largely due to its simplicity and universality and, we should not forget: due to a man who rightly identified its true potential and never stopped telling the world about it.

Tomorrow, on 8.8. is the International JSON day. Join in, spread the word and say thank you to Doug as well!


Filed under: Announcement, Big Data, Cloud Computing, IETF, Linked Data, NoSQL, W3C

Syndicated 2011-08-07 07:51:12 from Web of Data

Towards Networked Data

This is the second post in the solving-tomorrow’s-problems-with-yesterday’s-tools series.

In his seminal article If You Have Too Much Data, then “Good Enough” Is Good Enough Pat calls for a ‘new theory for data’ – I’d like to call this: networked data (meaning: consuming and manipulating distributed data on a Web-scale).

In this post, now, I’m going to elaborate on the first of his points in the context of Linked Data:

We need a new theory and taxonomy of data that must include:

  • Identity and versions. Unlocked data comes with identity and optional versions.

If you take a 10,000 feet view on the Linked Data principles it reads essentially as follows (the stuff in bold is what I added, here):

  1. Use URIs as names for things – entity identity
  2. Use HTTP URIs so that people can look up those names – entity access
  3. When someone looks up a URI, provide useful information, using the standards – entity structure
  4. Include links to other URIs. so that they can discover more things – entity integration

One word of caution before we dive into it: Linked Data, as we talk is pretty well-defined for the read-only case (the write-enabled case is still subject to research and standardisation).

If you compare the Linked Data principles from above with what Pat demands from the ‘new theory for data’, I think it is fair to state that the entity identity part as well as the entity access part is well covered. The versioning part might be a bit tricky, but doable – for example with Named Graphs, quads, etc.

Concerning the entity structure it occurs to me that there are two schools of thought: ‘purists’ who demand that only RDF serialisations are allowed for representing an entity’s structure on the one hand and the more liberal interpretation which includes technologies such as OData and only recently (triggered through the introduction of Schema.org) also Microdata, on the other hand. Time will tell uptake and success of any of the mentioned technologies, but in doubt I prefer to be inclusive rather than exclusive concerning this question.

The entity integration part is not explicitly mentioned by Pat – I wonder why? ;)


Filed under: FYI, Linked Data, NoSQL

Syndicated 2011-06-08 08:03:36 from Web of Data

Ye shall not DELETE data!

This is the first post in the solving-tomorrow’s-problems-with-yesterday’s-tools series.

Alex Popescu recently reviewed a post by Mikayel Vardanyan on Picking the Right NoSQL Database Tool and was puzzled about the following of Mikayel’s statement:

[Relational database systems] allow versioning or activities like: Create, Read, Update and Delete. For databases, updates should never be allowed, because they destroy information. Rather, when data changes, the database should just add another record and note duly the previous value for that record.

I don’t find it puzzling at all. As Pat Helland rightly says:

In large-scale systems, you don’t update data, you add new data or create a new version.

OK, I guess arguing this on an abstract level serves nobody. Let’s get our hands dirty and have a look at a concrete example. I pick an example from the Linked Data world, but there is nothing really specific to it – it just happens to be the data language I speak and dream in ;)

Look at the following piece of data:

… and now let’s capture the fact that my address has changed …

This looks normal at first sight, but there are two drawbacks attached with it:

  1. If I ask the question: ‘Where has Michael been living previously?’, I can’t get an answer anymore once the update has been performed, unless I have a local copy of the old data piece.
  2. Whenever I ask the question: ‘Where does Michael live?’ I need to implicitly add ‘at the moment’, as the information is not scoped.

There are few ways one can deal with it, though. And as a consequence, here is what I demand:

  • Never ever DELETE data – it’s slow and lossy; also updating data is not good, as UPDATE is essentially DELETE + INSERT and hence lossy as well.
  • Each piece of data must be versioned – in the Linked Data world one could, for example, use quads rather than triples to capture the context of the assertion expressed in the data.

Oh, BTW, my dear colleagues from the SPARQL Working Group – having said this, I think SPARQL Update is heading in the wrong direction. Can we still change this, pretty please?

PS: disk space is cheap these days, as nicely pointed out by Dorian Taylor ;)


Filed under: Big Data, Cloud Computing, Linked Data, NoSQL, Proposal, W3C

Syndicated 2011-05-29 09:07:39 from Web of Data

Solving tomorrow’s problems with yesterday’s tools

Q: What is the difference between efficiency and effectiveness?
A: 42.

Why? Well, as we all know, 42 is the answer to the ultimate question of life, the universe, and everything. But did you know that in 2012 it will be 42 years that Codd introduced ‘A Relational Model of Data for Large Shared Data Banks‘?

OK, now a more serious attempt to answer above question:

Efficiency is doing things right, effectiveness is doing the right thing.

This gem of wisdom has originally been coined by the marvelous Peter Drucker (in his book The Effective Executive – read it, worth every page) and nicely explains, IMO, what is going on: relational database systems are efficient. They are well suited for a certain type of problem: dealing with clearly-defined data in a rather static way. Are they effectively helping us to deal with big, messy data? I doubt so.

How comes?

Pat Helland’s recent ACM Queue article If You Have Too Much Data, then “Good Enough” Is Good Enough offers us some very digestible and enlightening insights why SQL struggles with big data:

We can no longer pretend to live in a clean world. SQL and its Data Definition Language (DDL) assume a crisp and clear definition of the data, but that is a subset of the business examples we see in the world around us. It’s OK if we have lossy answers—that’s frequently what business needs.

… and also …

All data on the Internet is from the “past.” By the time you see it, the truthful state of any changing values may be different. [...] In loosely coupled systems, each system has a “now” inside and a “past” arriving in messages.

… and on he goes …

I observed that data that is locked (and inside a database) is seminally different from data that is unlocked. Unlocked data comes in clumps that have identity and versioning. When data is contained inside a database, it may be normalized and subjected to DDL schema transformations. When data is unlocked, it must be immutable (or have immutable versions).

These were just some quotes from Pat’s awesome paper. I really encourage you to read it yourself and discover maybe even more insights.

Coming back to the initial question: I think NoSQL is effective for big, messy data. It has yet to proof that it is efficient in terms of usability, optimization, etc. – due to the large number of competing solutions, the respective communities are smaller and more fragmented in NoSQLand, but I guess it will undergo a consolidation process in the next couple of years.

Summing up: let’s not try to solve tomorrow’s problems with yesterday’s tools.


Filed under: Big Data, Cloud Computing, FYI, NoSQL

Syndicated 2011-05-29 06:07:56 from Web of Data

Why we link …

The incentives to put structured data on the Web seem to slowly seep in, but why does it make sense to link your data to other data? Why to invest time and resources to offer 5 star data? Even though the interlinking itself becomes more of a commodity these days – for example, the 24/7 platform we’re deploying in LATC is an interlinking cloud offering – the motivation for dataset publisher to set links to other datasets is, in my experience, not obvious.

I think it’s important to have a closer look at the motivation for interlinking data on the Web from a data integration perspective. Traditionally, you would download data from, say, Infochimps or you find it via CKAN or via the many other places that either directly offer data or provide a data catalog. Then you would put it in your favorite (NoSQL) database and use it in your application. Simple, isn’t it?

Let’s say you’re using a dataset about companies such as the Central Contractor Registration (CCR) . These companies typically have a physical address (or: location) attached:

Now, imagine I ask you to render the location of a selection of companies on a map. This requires you to look up the geographical coordinates of a company in a service such as Geonames:

I bet you can automate this, right? Maybe a bit of manual work involved, but not too much, I guess. So, all is fine, right?

Not really.

The next developer that comes along and wants to use the company data and nicely map it has to go through the exact same process. Figure what geo service to use, write some look-up/glue code, import the data and so on.

Wouldn’t it make more sense, from a re-usability point of view, if the original dataset provider (CCR in our example) would have a look at its data and identify what entities (such as companies) are there and provide the links to other datasets (such as location data) up-front? This is, in a nutshell, what Tim says concerning the 5th star of Open Data deployment:

Link your data to other people’s data to provide context.

To sum up: if you have data, think about providing this context – link it to other data in the Web and you make your data more useful and more usable and, in the long run, more used.

PS: the working title of this blog post was ‘As we may link’, to render homage to Vannevar Bush, but then I thought that might be a bit too cheesy ;)


Filed under: FYI, Linked Data

Syndicated 2011-05-22 20:37:30 from Web of Data

Can NoSQL help us in processing Linked Data?

This is an announcement and call for feedback. Over the past couple of days I’ve compiled a short review article where I look into NoSQL solutions and to what extent they can be used to process Linked Data.

I’d like to extend and refine this article, but this only works if you share your experiences and let me know what I’m missing out and where I’m maybe totally wrong?

If youjust want to read it, use the following link: NoSQL solutions for Linked Data processing (read-only Web page).

If you want to provide feedback or rectify stuff I wrote, use: NoSQL solutions for Linked Data processing (Google Docs with discussion enabled).

Thanks, and enjoy reading as well as commenting on the article!


Filed under: Announcement, Linked Data

Syndicated 2011-05-02 20:30:55 from Web of Data

From CSV data on the Web to CSV data in the Web

In our daily work with Government data such as statistics, geographical data, etc. we often deal with Comma-Separated Values (CSV) files. Now, they are really handy as they are easy to produce and to consume: almost any language and platform I came across so far has some support for parsing CSV files and I can virtually export CSV files from any sort of (serious) application.

There is even a – probably not widely known – standard for CSV files (RFC 4180) that specifies the grammar and registers the normative MIME media type text/csv for CSV files.

So far so well.

From a Web perspective, CSV files really are data objects, which however are rather coarse-granular. If I want to use a CSV file, I always have to use the entire file. There is no agreed-upon concept that allows me to refer to a certain cell, row or column. This was my main motivation to start working on what I called Addrable (from Addressable Table) earlier this year. I essentially hacked together a rather simple implementation of Addrables in JavaScript that understands URI fragment identifiers such as:

  • #col:temperature
  • #row:10
  • #where:city=Galway,reporter=Richard

Let’s have a closer look at what the result of the processing of such a fragment identifier against an example CSV file could be. I’m going to use the last one in the list above, that is, addressing a slice where the city column has the value ‘Galway’ and for the reporter column we ask it to be ‘Richard’.

The client-side implementation in jQuery provides a visual rendering of the selected part, see below a screen-shot (if you want to toy around with it, either clone or download it and open it locally in your browser):

There is also a server-side implementation using node.js available (deployed at addrable.no.de), outputting JSON:

{
  "header":
    ["date","temperature"],
  "rows":
    [
      ["2011-03-01", "2011-03-02", "2011-03-03"],
      ["4","10","5"]
    ]
}

Note: the processing of the fragment identifier is meant to be performed by the User Agent after the retrieval action has been completed. However, the server-side implementation demonstrates a workaround for the fact that the fragment identifier is not sent to the Server (see also the related W3C document on Repurposing the Hash Sign for the New Web).

Fast forwarding a couple of weeks.

Now, having an implementation is fine, but why not pushing the envelope and taking it a step further, in order to help making the Web a better place?

Enter Erik Wilde, who did ‘URI Fragment Identifiers for the text/plain Media Type’ aka RFC 5147 some three years ago; and yes, I admit I was a bit biased already through my previous contributions to the Media Fragments work. We decided to join forces to work on ‘text/csv Fragment Identifiers’, based on the Addrable idea.

As a first step (well beside the actual writing of the Internet-Draft to be submitted to IETF) I had a quick look at what we can expect in terms of deployment. That is, a rather quick and naive survey based on some 60 CSV files manually harvested from the Web. The following figure gives you a rough idea what is going on:

To sum up the preliminary findings: almost half of the CSV files are (wrongly) served with text/plain (followed by some other non-conforming and partially exotic Media Types such as text/x-comma-separated-values. The bottom-line is: only 10% of the CSV files are served correctly with text/csv. Why do we care, you ask? Well, for example, because the spec says that the header row is optional, but the presence can be flagged by an optional HTTP Header parameter. Just wondering what the chances are ;)

Now, I admit that my sample here is rather small, but I think the distribution will roughly stay the same. By the way, anyone aware of a good way to find CSV files, besides filetype:csv in Google or contains:csv in Bing, as I did it?

We’d be glad to hear from you – do you think this is useful for your application? If yes, why? How would you use it? Or, maybe you want to do a proper CSV crawl to help us with the analysis?


Filed under: Announcement, FYI, Idea, IETF

Syndicated 2011-04-16 12:43:35 from Web of Data

43 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!