Older blog entries for mhausenblas (starting at number 58)

Linked Data – the best of two worlds

On the one hand you have structured data sources such as relational DB, NoSQL datastores or OODBs and the like that allow you to query and manipulate data in a structured way. This typically involves schemata (either upfront with RDB or sort of dynamically with NoSQL that defines the data layout and the types of the fields), a notion of object identity (for example a unique row ID in RDB or a document ID in a document store) and with it to be able to refer to data items in different containers (e.g. a foreign key in a RDB) as well as the possibility to create and use indices to speed up look-up/query.

On the other hand you have the Web, a globally distributed hypermedia system, mainly for consumption by humans. There the main primitives are: an enormous collection of hyperlinked documents over the Internet with millions of servers and billions of clients (desktop, mobile devices, etc.), in its core based on simple standards: URL, HTTP, HTML.

Now, the idea with Linked Data is a simple one: take the best of both worlds and combine it, yielding large-scale structured data (incl. schema and object identity to allow straightforward manipulation) based on established Web standards (in order to benefit from the deployed infrastructure).

Sounds easy? In fact it is. The devil is in the detail. As with any piece of technology, once you start implementing it, questions arise. For example, must Linked Data be solely based on RDF or are other wire formats such as JSON, Microdata or Atom ‘allowed’? Should we use distributed vocabulary management (as mandated by the Semantic Web) or is it OK to use Schema.org? Depending on whom you ask you currently may get different answers but in this case I lean towards diversity – at the end of the day what matters are URIs (object identity), HTTP (data access) and some way to represent the data in a structured format.


Filed under: Big Data, FYI, Linked Data, NoSQL

Syndicated 2012-04-02 06:06:04 from Web of Data

Why I luv JSON …

… because it’s simple, agnostic and an end-to-end solution.

Wat?

OK, let’s slow down a bit and go through the above keywords step by step.

Simple

Over 150 frameworks, libraries and tools directly support JSON in over 30 (!) languages. This might well be because the entire specification (incl. ToC, all the legal stuff and contact information) is only 10 pages long, printed. To implement support for JSON in any given language, that is, parsing/mapping to native objects/types, is very very cheap and straight forward.

Agnostic

Just as HTTP is agnostic to the payload – you can transfer HTML over HTTP but also any other kind of representation incl. binary stuff – with JSON you have something really agnostic at hand. Want to encode a Key-Value list, JSON can do it. Need to represent any given tree in JSON – no problem. A graph serialised in JSON? Of course possible! I suppose this flexibility makes JSON attractive for a lot of different people, having a multitude of use cases in mind.

End-to-end

What I mean with this is that JSON is available and used throughout, from front-end to back-end:

  • Front-end examples: jQuery, Dojo, etc.
  • Back-end examples: MongoDB, CouchDB, elasticsearch, Node.js, etc.

OK, I reckon it is time to say: ‘Thank you, Doug!’ in case you haven’t done it today, yet ;)


Filed under: Big Data, Cloud Computing, FYI, NoSQL

Syndicated 2012-03-24 09:43:47 from Web of Data

Hosted NoSQL

I admit I dunno how I got here in the first place … ah, right, yesterday was Paddy’s day and I was sitting at home with a sick child. Now, I tinkered around a bit with a hosted CouchDB solution to store/query JSON output from a side-project of mine.

Then I thought: where are we re hosted NoSQL in general. Seems others had that questions as well. So I sat down and here is a (naturally incomplete) list of so called NoSQL datastores that are available ‘in the cloud’. Most of them with an established freemium model, few of them in (public) beta. In terms of type (K/V, wide-column, doc, graph) we find quite everything, incl. proprietary types – like Amazon and Google have – where it’s sorta hard to tell what kind of beasts they are. Not that it matters, but for completeness ;)

OK, nuff time wasted, here we go:

Amazon’s hosted NoSQL datastores

Both SimpleDB and DynamoDB are sorta key-value stores where the latter seems to be for more serious business (scale out). They explain the difference between SimpleDB and DynamoDB in detail. Pricing is in place, looks sensible. I have not tried any of these yet.

Google’s hosted NoSQL datastores

Tightly integrated with Google App Engine (GAE) comes the datastore with its own query language. If you’re on GAE, this is what you get and what you have to use, anyways. And then, since a bit more than a year there is BigQuery with which I’ve been toying around now for a year or so. Very performant and powerful but not the most obvious and clear pricing strategy.

Joyent’s Riak

Joyent offers a so called Riak Smartmachine. I have been toying around with Riak a while ago but haven’t found time to test Joyent’s Riak offering (though I’m pleased with their Node.js offering, hence assuming similar level of service, documentation, etc.).

Cassandra in the cloud

I only found one hosted Cassandra offering. Can that be? Didn’t look closer. Anyone?

CouchDB

So, both cloudno.de and Cloudant offer hosted CouchDB instances (the former also offers Redis). I am currently using the free plan (‘Oxygen’) with Cloudant and find it very straight-forward and easy to use. Prizing looks OK in both cases though I sometimes find it hard to pick the ‘best fit’ for a given workload. Could anyone write an app please that does this for me? :)

MongoDB

Also for MongoDB I was able to spot two offerings: MongoHQ seems somewhat to be the established player in the field, nice docs and sensible princing. Apparently, Joyent is also offering a MongoDB Smartmachine – anyone tried it?

Graph datastores

There are quite some offerings in this area: the general-purpose sort of graph data stores and the RDF-focused ones. In the former category there is Neo4j’s Heroku add-on which I had the pleasure to test drive and found it very useable and useful. And then there is an OrientDB-based offering called Nuvolabase; I have signed up and tried it out some weeks ago and I must say I really like it. Disclaimer: I know the main person behind OrientDB as we’ve done a joint (research) project some years ago.

Last but not least: RDF-focused graph datastores in the cloud. I guess my absolute favourite still is Dydra which I’ve been using manually (SPARQL endpoint, curl, etc.) and in programmatically, inapplications. I think they are still in beta and pricing is not yet announced. And then there is the good old Talis Platform, the established cloud-RDF-store for a couple of years now. Any plans known?


Filed under: Big Data, Cloud Computing, NoSQL

Syndicated 2012-03-18 07:26:50 from Web of Data

Large-Scale Linked Data Processing: Cloud Computing to the Rescue?

At the upcoming 2nd International Conference on Cloud Computing and Services Science (CLOSER 2012) we – Robert Grossman, Andreas Harth, Philippe Cudré-Mauroux and myself – will present a paper with the title Large-Scale Linked Data Processing: Cloud Computing to the Rescue? and the following abstract:

Processing large volumes of Linked Data requires sophisticated methods and tools. In the recent years we have mainly focused on systems based on relational databases and bespoke systems for Linked Data processing. Cloud computing offerings such as SimpleDB or BigQuery, and cloud-enabled NoSQL systems including Cassandra or CouchDB as well as frameworks such as Hadoop offer appealing alternatives along with great promises concerning performance, scalability and elasticity. In this paper we state a number of Linked Data-specific requirements and review existing cloud computing offerings as well as NoSQL systems that may be used in a cloud computing setup, in terms of their applicability and usefulness for processing datasets on a large-scale.

A pre-print is available now and if you have any suggestions please let me know.


Filed under: Big Data, Cloud Computing, FYI, Linked Data, NoSQL

Syndicated 2012-03-01 13:14:22 from Web of Data

Synchronising dataspaces at scale

So, I have a question for you – how would you approach the following (engineering) problem? Imagine you have two dataspaces, a source dataspace, such as Eurostat with some 5000+ datasets that can take up to several GB in the worst case, and a target dataspace (for example, something like what we’re deploying in the LATC, currently). You want to ensure that the data in the target dataspace is as fresh as possible, that is, providing a minimal temporal delay between the contents of source and target dataspaces.

Don’t get me wrong, this has exactly nothing to do with Linked Data, RDF or the like. This is simply the question of how often one should ‘sample’ the source in order to make sure that the target is ‘always’ up-to-date.

Now, would you say that Shannon’s theorem is of any help? Or, you look at the given source update frequency and decide based on this how often you hammer the server?

Step back.

It turns out that one should also take into account what happens in the target dataspace. In our case this is mainly the conversion of the XML or TSV into some RDF serialisation. This is, in cases where the source dataset has, say, some 11GB, a non-trivial issue to address. In addition, we see some ~1000 datasets changing in a couple of days time. Which would leave us, in the worst case, with a situation where we would still be in the conversion process of parts of the dataspace while already updated versions of the datasets would be pending.

On the other hand we know, based on our experience with the Eurostat data, that we can rebuild the entire dataspace – that is, downloading all 5000+ files incl. metadata, converting it to RDF and loading the metadata into the SPARQL endpoint – in some 11+ days. Wouldn’t it make sense to simply only look at the update every 10-or-so days?

We discussed this today and settled for a weekly (weekend) update policy. Let’s see where this takes us and I promise that I keep you posted …


Filed under: FYI, Linked Data

Syndicated 2012-02-13 23:11:08 from Web of Data

JSON, HTTP and data links

In late 2011, Mark Nottingham, whom I very much admire on a personal and professional level, posted ‘Linking in JSON‘ which triggered quite some discussion (see the comments there).

Back then already I sensed that the community at large is ready for the next aspect of the Web. A scalable, machine-targeted way to realise a global dataspace. No, I’m not talking about the Semantic Web. I’m talking about something real. And it’s happening as we speak.

Take JSON and HTTP (some use REST for marketing purposes) and add the capability of following (typed) links that lead you to more data (context, definitions, related stuff, whatever).

And here are the three current contenders in this space (in the order of stage appearance) – Microsoft’s OData JSON Format, The Object Network: Linking up our APIs, and – as I learned from Charl van Niekerk on #whatwg IRC channel tonite – A Convention for HTTP Access to JSON Resources.

What they all have in common is that they define ways to read, create, update and delete data objects, in the Web, based on JSON, using HTTP.

OData/JSON

OData: JavaScript Object Notation (JSON) Format

OData: JavaScript Object Notation (JSON) Format

Totally-objective-and-unbiased-verdict: around for some years now, great community, backed by big bucks, heavy-weight (they squeezed friggin APP into it), rather RESTful and becoming more and more a shadow Semantic Web.

The Object Network

The Object Network: Linking up our APIs

The Object Network: Linking up our APIs

Totally-objective-and-unbiased-verdict: too early to tell, really. Seems like a one-man-show, nice idea in theory, time will tell uptake. Many things, incl. link semantic seem half-baked and unclear. Good motivation and marketing but little ‘apps’ or demos to be of any interest.

A Convention for HTTP Access to JSON Resources

Internet Draft - A Convention for HTTP Access to JSON Resources

Internet Draft - A Convention for HTTP Access to JSON Resources

Totally-objective-and-unbiased-verdict: just learned about it, but seems to be influenced by CouchDB developments and experiences which means it can’t be that bad, can it? :) Yeah, I guess I’ll have a closer look at this one.

 

 

 

 
Now, which one is your favorite? Did I forget any? Before you shout out JSON-LD or the likes now … hold your breath – my #1 requirement is that it does the Full Monty: I want to be able to CRUD, to follow my nose through the data and all this over HTTP. Anyone?


Filed under: FYI, IETF, Linked Data, NoSQL

Syndicated 2012-02-05 21:21:08 from Web of Data

Open Data – a virtual natural resource

A virtual natural resource? Doesn’t make sense, does it?

Let me explain.

Natural resources are derived from the environment. Many of them are essential for our survival while others are used for satisfying our wants.

… is with Wikipedia says about natural resources.

Now, some 150 years ago a handful of people saw the potential of petroleum which is nowadays the basis for a multi-billion dollar industry. Roughly the same holds for electricity. It’s not exactly that the crude resource is of much interest or, FWIW, might even be dangerous to handle – ever dipped your fingers into crude oil? ever touched a power outlet with bare hands?

However, as I already said a while ago, the applications on top of the natural resources are valuable and our modern society couldn’t do without it.

Back to Wikipedia. The Homo Digitalis lives in a digital environment, producing data almost as a side-product of the daily activities and depending on it as it drives the applications.

In this sense, yes, Open Data is the virtual natural resource #1 and here to stay.

When will you realise the potential of (Linked) Open Data?


Filed under: Big Data, Cloud Computing, Linked Data

Syndicated 2012-01-30 09:13:54 from Web of Data

… you end up with a graph

Quite often I hear people coming up with rather strange explanations why we use graphs, or to be more specific for the Web case, RDF. Some think that the reason is to make the developer’s life harder. Right. It’s so much easier to understand a key-value structure. And there are the ones who claim that we use graphs because the W3C says so (RDF is a W3C standard). Some others say graphs are used because they are the most generic, powerful data structure and you can represent any other simpler data structure, such as a tree (think XML) with it.

The real reason why graphs are in use is much simpler: when you have data sources and you start connecting data items across them, you end up with a graph. You can like it or not, but it’s inevitable.

Now, graphs have a number of desirable properties, including ‘no JOIN costs‘, they are flexible (you can add/remove stuff anywhere) and, for some of the graph formats, including RDF and TopicMaps there exist standardised query languages (SPARQL, TMQL). The latter one is something that many other sorts of NoSQL datastores are lacking and we see people re-inventing the wheel over and over and over again as there is a natural desire to declaratively state a query and leave the details of the execution to a machine.

The community around GraphDBs is growing and next time you have a BigData problem at hand, give it a moment to contemplate if, maybe, your data is of graph shape and consider joining us in the cool kids club.


Filed under: Big Data, Cloud Computing, FYI, Linked Data, NoSQL

Syndicated 2012-01-29 21:23:46 from Web of Data

Libraries – an important and vibrant Linked Data application domain

In late 2009 I was contacted by Tom Baker, Emmanuelle Bermes and Antoine Isaac to help fund the Library Linked Data Incubator Group (XG) at W3C and although I personally didn’t actively contribute (more a hurler-on-the-ditch like commenting, really) I am really, really happy with the outcome. To be fair, DERI was very active after all and provided great input – thanks to Jodi Schneider.

After the regular year the XG was extended for three months to complete the multitude of deliverables. A very ambitious, and as it looks like successful endeavor – bravo to the chairs and group members for this achievement. I gather that a transition to a W3C Community Group is discussed, something I applaud and support.

Why do I mention this now? Well, first of all (digital) libraries are an important and vibrant Linked Data application domain. I predict that after the attention the eGovernment domain has witnessed something like this could happen soon in the DL domain as well … hm, on a second thought, actually it is already happening ;)

In this context I stumbled upon two interesting new publications I’d like to share with you:

It’s an exciting time to work in the Linked Data area, activities in many application domains and on the systems level (NoSQL, SEO, etc.) are popping up like wild.

What do you think? Is the time ripe for a large-scale deployment of Linked Data in the DL domain? What other good resources do you know of?


Filed under: FYI, Linked Data, W3C

JSON, data and the REST

Tomorrow, on 8.8. is the International JSON day. Why? Because I say so!

Is there a better way to say ‘thank you’ to a person who gave us so much – yeah, I’m talking about Doug Crockford – and to acknowledge how handy, useful and cool the piece of technology is, this person ‘discovered‘?

From its humble beginning some 10 years ago, JSON is now the light-weight data lingua franca. From the nine Web APIs I had a look at recently in the REST: From Research to Practice book, seven offered their data in JSON. These days it is possible to access and process JSON data from virtually any programming language – check out the list at json.org if you doubt that. I guess the raise of JSON and its continuing success story is at least partially due to its inherent simplicity – all you get are key/value pairs and lists. And in 80% or more of the use cases that is likely all you need. Heck, even I prefer to consume JSON in my Web applications over any sort of XML-based data sources or any given RDF serialization.

But the story doesn’t end here. People and organisations nowadays in fact take JSON as a given basis and either try to make it ‘better’ or to leverage it for certain purposes. Let’s have a look at three of these examples …

JSON Schema

I reckon one of the first and most obvious things people where discussing once JSON reached a certain level of popularity was how to validate JSON data. And what do we do as good engineers? We need to invent a schema language, for sure! So, there you go: json-schema.org tries to establish a schema language for JSON. The IETF Internet draft by Kris Zyp states:

JSON Schema provides a contract for what JSON data is required for a given application and how to interact with it. JSON Schema is intended to define validation, documentation, hyperlink navigation, and interaction control of JSON data.

One rather interesting bit, beside the obvious validation use case, is the support for ‘hyperlink navigation’. We’ll come back to this later.

Atom-done-right: OData

I really like the Atom format as well as the Atom Publishing Protocol (APP). A classic, in REST terms. I just wonder, why on earth is it based on XML?

Enter OData. Microsoft, in a very clever move adopted Atom and the APP and made it core of OData; but they didn’t stop there – Microsoft is using JSON as one of the two official formats for OData. They got this one dead right.

OData is an interesting beast, because here we find one attempt to address one of the (perceived) shortcomings of JSON – it is not very ‘webby’. I hear you saying: ‘Hu? What’s that and why does this matter?’ … well it matters to some of us RESTafarians who respect and apply HATEOAS. Put short: as JSON uses a rather restricted ‘data type’ system, there is no explicit support for URIs and (typed) links. Of course you can use JSON to represent and transport a URI (or many, FWIW). But the way you choose to represent, say, a hyperlink might look different from the way I or someone else does, meaning that there is no interoperability. I guess, as long as HATEOAS is a niche concept, not grokked by many people, this might not be such a pressing issue, however, there are cases were it is vital to be able to unambiguously deal with URIs and (typed) links. More in the next example …

Can I squeeze a graph into JSON? Sir, yes, Sir!

Some time ago Manu Sporny and others started an activity called JSON-LD (JavaScript Object Notation for Linking Data) that gained some movement over the past year or so; as time of writing support for some popular languages incl. C++, JavaScript, Ruby and Python is available. JSON-LD is designed to be able to express RDF, microformats as well as Microdata. With the recent introduction of Schema.org, this means JSON-LD is something you might want to keep on your radar …

On a related note: initially, the W3C planned to standardize how serialize RDF in JSON. Once the respective Working Group was in place, this was dropped. I think they made a wise decision. Don’t get me wrong, I’d have also loved to get out an interoperable way to deal with RDF in JSON, and there are certainly enough ways how one could do it, but I guess we’re simply not yet there. And JSON-LD? Dunno, to be honest – I mean I like and support it and do use it, very handy, indeed. Will it be the solution for HATEOAS and Linked Data. Time will tell.

Wrapping up: JSON is an awesome piece of technology, largely due to its simplicity and universality and, we should not forget: due to a man who rightly identified its true potential and never stopped telling the world about it.

Tomorrow, on 8.8. is the International JSON day. Join in, spread the word and say thank you to Doug as well!


Filed under: Announcement, Big Data, Cloud Computing, IETF, Linked Data, NoSQL, W3C

Syndicated 2011-08-07 07:51:12 from Web of Data

49 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!