Older blog entries for mhausenblas (starting at number 63)

MapReduce for and with the kids

Last week was Halloween and of course we went trick-or-treating with our three kids which resulted in piles of sweets in the living room. Powered by the sugar, the kids would stay up late to count their harvest and while I was observing them at it, I was wondering if it possible to explain the MapReduce paradigm to them, or even better: doing MapReduce with them.

Now, it turns out that Halloween and counting kinds of sweets are a perfect setup. Have a look at the following:

MapReduce for counting kinds of sweet after Halloween harvest.

So, the goal was to figure how many sweets of a certain kind (like, Twix) we now have available overall, for consumption.

We started off with every child having her or his pile of sweets in front of them. Now, in the first step I’d ask the kids to shout how many of the sweet X they have in their own pile. So one kid would go like I’ve got 4 fizzers, etc. … and then we’d gather all the same sweets and their respective counts together. Second, we’d add up the individual counts for each kind of sweet which would give us the desired result: number of X in total.

Lesson learned: MapReduce is a child’s play. Making kids sharing sweets is certainly not – believe me, I speak out of experience ;)


Filed under: Big Data, Cloud Computing, Experiment, NoSQL

Syndicated 2012-11-05 11:22:21 from Web of Data

Denormalizing graph-shaped data

As nicely pointed out by Ilya Katsov:

Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model.

So, I was wondering, why is – in Ilya’s write-up – denormalization not considered to be applicable for GraphDBs?

I suppose the main reason is that the relationships (or links as we use to call them in the Linked Data world) are typically not resolved or dereferenced, which means traversing the graph is fast, but for a number of operations such as range queries, denormalized data would be better.

Now, the question is: can we achieve this in GraphDBs, incl. RDF stores? I would hope so. Here are some design ideas:

  • Up-front: when inserting new data items (nodes), immediately dereference the links (embedded links).
  • Query-time: apply database cracking.

Here is the question for you, dear reader: are you aware of people doing this? My google skills have failed me so far – happy to learn about it in greater detail!


Filed under: Big Data, Cloud Computing, Idea, Linked Data, NoSQL

Syndicated 2012-09-20 10:26:23 from Web of Data

Interactive analysis of large-scale datasets

The value of large-scale datasets – stemming from IoT sensors, end-user and business transactions, social networks, search engine logs, etc. – apparently lies in the patterns buried deep inside them. Being able to identify these patterns, analyzing them is vital. Be it for detecting fraud, determining a new customer segment or predicting a trend. As we’re moving from the billions to trillions of records (or: from the terabyte to peta- and exabyte scale) the more ‘traditional’ methods, including MapReduce seem to have reached the end of their capabilities. The question is: what now?

But a second issue has to be addressed as well: in contrast to what current large-scale data processing solutions provide for in batch-mode (arbitrarily but in line with the state-of-the-art defined as any query that takes longer than 10 sec to execute) the need for interactive analysis increases. Complementary, visual analytics may or may not be helpful but come with their own set of challenges.

Recently, a proposal for a new Apache Incubator group called Drill has been made. This group aims at building a:

… distributed system for interactive analysis of large-scale datasets [...] It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Drill’s design is supposed to be informed by Google’s Dremel and wants to efficiently process nested data (think: Protocol Buffers). You can learn more about requirements and design considerations from Tomer Shiran’s slide set.

In order to better understand where Drill fits in in the overall picture, have a look at the following (admittedly naïve) plot that tries to place it in relation to well-known and deployed data processing systems:

BTW, if you want to test-drive Dremel, you can do this already today; it’s an IaaS service offered in Google’s cloud computing suite, called BigQuery.


Filed under: Big Data, Cloud Computing, FYI, NoSQL

Syndicated 2012-09-02 19:23:07 from Web of Data

Schema.org + WebIntents = Awesomeness

Imagine you search for a camera, say a Canon EOS 60D, and in addition to the usual search results you’re as well offered a choice of actions you can perform on it, for example share the result with a friend, write a review for the item or, why not directly buy it?

Enhancing SERP with actions

Sounds far fetched? Not at all. In fact, all the necessary components are available and deployed. With Schema.org we have a way to describe the things we publish on our Web pages, such as books or cameras and with WebIntents we have a technology at hand that allows us to interact with these things in a flexible way.

Here are some starting points in case you want to dive into WebIntents a bit:

PS: I started to develop a proof of concept for mapping Schema.org terms to WebIntents and will report on the progress, here. Stay tuned!


Filed under: Experiment, Idea, Linked Data, W3C

Syndicated 2012-07-08 18:21:19 from Web of Data

Turning tabular data into entities

Two widely used data formats on the Web are CSV and JSON. In order to enable fine-grained access in an hypermedia-oriented fashion I’ve started to work on Tride, a mapping language that takes one or more CSV files as inputs and produces a set of (connected) JSON documents.

In the 2 min demo video I use two CSV files (people.csv and group.csv) as well as a mapping file (group-map.json) to produce a set of interconnected JSON documents.

So, the following mapping file:

{
 "input" : [
  { "name" : "people", "src" : "people.csv" },
  { "name" : "group", "src" : "group.csv" }
 ],
 "map" : {
  "people" : {
   "base" : "http://localhost:8000/people/",
   "output" : "../out/people/",
   "with" : { 
    "fname" : "people.first-name", 
    "lname" : "people.last-name",
    "member" : "link:people.group-id to:group.ID"
   }
 },
 "group" : {
  "base" : "http://localhost:8000/group/",
  "output" : "../out/group/",
   "with" : {
    "title" : "group.title",
    "homepage" : "group.homepage",
    "members" : "where:people.group-id=group.ID link:group.ID to:people.ID"
   }
  }
 }
}

… produces JSON documents representing groups. One concrete example output is shown below:


Filed under: demo, Experiment, FYI

Syndicated 2012-05-10 15:03:40 from Web of Data

Linked Data – the best of two worlds

On the one hand you have structured data sources such as relational DB, NoSQL datastores or OODBs and the like that allow you to query and manipulate data in a structured way. This typically involves schemata (either upfront with RDB or sort of dynamically with NoSQL that defines the data layout and the types of the fields), a notion of object identity (for example a unique row ID in RDB or a document ID in a document store) and with it to be able to refer to data items in different containers (e.g. a foreign key in a RDB) as well as the possibility to create and use indices to speed up look-up/query.

On the other hand you have the Web, a globally distributed hypermedia system, mainly for consumption by humans. There the main primitives are: an enormous collection of hyperlinked documents over the Internet with millions of servers and billions of clients (desktop, mobile devices, etc.), in its core based on simple standards: URL, HTTP, HTML.

Now, the idea with Linked Data is a simple one: take the best of both worlds and combine it, yielding large-scale structured data (incl. schema and object identity to allow straightforward manipulation) based on established Web standards (in order to benefit from the deployed infrastructure).

Sounds easy? In fact it is. The devil is in the detail. As with any piece of technology, once you start implementing it, questions arise. For example, must Linked Data be solely based on RDF or are other wire formats such as JSON, Microdata or Atom ‘allowed’? Should we use distributed vocabulary management (as mandated by the Semantic Web) or is it OK to use Schema.org? Depending on whom you ask you currently may get different answers but in this case I lean towards diversity – at the end of the day what matters are URIs (object identity), HTTP (data access) and some way to represent the data in a structured format.


Filed under: Big Data, FYI, Linked Data, NoSQL

Syndicated 2012-04-02 06:06:04 from Web of Data

Why I luv JSON …

… because it’s simple, agnostic and an end-to-end solution.

Wat?

OK, let’s slow down a bit and go through the above keywords step by step.

Simple

Over 150 frameworks, libraries and tools directly support JSON in over 30 (!) languages. This might well be because the entire specification (incl. ToC, all the legal stuff and contact information) is only 10 pages long, printed. To implement support for JSON in any given language, that is, parsing/mapping to native objects/types, is very very cheap and straight forward.

Agnostic

Just as HTTP is agnostic to the payload – you can transfer HTML over HTTP but also any other kind of representation incl. binary stuff – with JSON you have something really agnostic at hand. Want to encode a Key-Value list, JSON can do it. Need to represent any given tree in JSON – no problem. A graph serialised in JSON? Of course possible! I suppose this flexibility makes JSON attractive for a lot of different people, having a multitude of use cases in mind.

End-to-end

What I mean with this is that JSON is available and used throughout, from front-end to back-end:

  • Front-end examples: jQuery, Dojo, etc.
  • Back-end examples: MongoDB, CouchDB, elasticsearch, Node.js, etc.

OK, I reckon it is time to say: ‘Thank you, Doug!’ in case you haven’t done it today, yet ;)


Filed under: Big Data, Cloud Computing, FYI, NoSQL

Syndicated 2012-03-24 09:43:47 from Web of Data

Hosted NoSQL

I admit I dunno how I got here in the first place … ah, right, yesterday was Paddy’s day and I was sitting at home with a sick child. Now, I tinkered around a bit with a hosted CouchDB solution to store/query JSON output from a side-project of mine.

Then I thought: where are we re hosted NoSQL in general. Seems others had that questions as well. So I sat down and here is a (naturally incomplete) list of so called NoSQL datastores that are available ‘in the cloud’. Most of them with an established freemium model, few of them in (public) beta. In terms of type (K/V, wide-column, doc, graph) we find quite everything, incl. proprietary types – like Amazon and Google have – where it’s sorta hard to tell what kind of beasts they are. Not that it matters, but for completeness ;)

OK, nuff time wasted, here we go:

Amazon’s hosted NoSQL datastores

Both SimpleDB and DynamoDB are sorta key-value stores where the latter seems to be for more serious business (scale out). They explain the difference between SimpleDB and DynamoDB in detail. Pricing is in place, looks sensible. I have not tried any of these yet.

Google’s hosted NoSQL datastores

Tightly integrated with Google App Engine (GAE) comes the datastore with its own query language. If you’re on GAE, this is what you get and what you have to use, anyways. And then, since a bit more than a year there is BigQuery with which I’ve been toying around now for a year or so. Very performant and powerful but not the most obvious and clear pricing strategy.

Joyent’s Riak

Joyent offers a so called Riak Smartmachine. I have been toying around with Riak a while ago but haven’t found time to test Joyent’s Riak offering (though I’m pleased with their Node.js offering, hence assuming similar level of service, documentation, etc.).

Cassandra in the cloud

I only found one hosted Cassandra offering. Can that be? Didn’t look closer. Anyone?

CouchDB

So, both cloudno.de and Cloudant offer hosted CouchDB instances (the former also offers Redis). I am currently using the free plan (‘Oxygen’) with Cloudant and find it very straight-forward and easy to use. Prizing looks OK in both cases though I sometimes find it hard to pick the ‘best fit’ for a given workload. Could anyone write an app please that does this for me? :)

MongoDB

Also for MongoDB I was able to spot two offerings: MongoHQ seems somewhat to be the established player in the field, nice docs and sensible princing. Apparently, Joyent is also offering a MongoDB Smartmachine – anyone tried it?

Graph datastores

There are quite some offerings in this area: the general-purpose sort of graph data stores and the RDF-focused ones. In the former category there is Neo4j’s Heroku add-on which I had the pleasure to test drive and found it very useable and useful. And then there is an OrientDB-based offering called Nuvolabase; I have signed up and tried it out some weeks ago and I must say I really like it. Disclaimer: I know the main person behind OrientDB as we’ve done a joint (research) project some years ago.

Last but not least: RDF-focused graph datastores in the cloud. I guess my absolute favourite still is Dydra which I’ve been using manually (SPARQL endpoint, curl, etc.) and in programmatically, inapplications. I think they are still in beta and pricing is not yet announced. And then there is the good old Talis Platform, the established cloud-RDF-store for a couple of years now. Any plans known?


Filed under: Big Data, Cloud Computing, NoSQL

Syndicated 2012-03-18 07:26:50 from Web of Data

Large-Scale Linked Data Processing: Cloud Computing to the Rescue?

At the upcoming 2nd International Conference on Cloud Computing and Services Science (CLOSER 2012) we – Robert Grossman, Andreas Harth, Philippe Cudré-Mauroux and myself – will present a paper with the title Large-Scale Linked Data Processing: Cloud Computing to the Rescue? and the following abstract:

Processing large volumes of Linked Data requires sophisticated methods and tools. In the recent years we have mainly focused on systems based on relational databases and bespoke systems for Linked Data processing. Cloud computing offerings such as SimpleDB or BigQuery, and cloud-enabled NoSQL systems including Cassandra or CouchDB as well as frameworks such as Hadoop offer appealing alternatives along with great promises concerning performance, scalability and elasticity. In this paper we state a number of Linked Data-specific requirements and review existing cloud computing offerings as well as NoSQL systems that may be used in a cloud computing setup, in terms of their applicability and usefulness for processing datasets on a large-scale.

A pre-print is available now and if you have any suggestions please let me know.


Filed under: Big Data, Cloud Computing, FYI, Linked Data, NoSQL

Syndicated 2012-03-01 13:14:22 from Web of Data

Synchronising dataspaces at scale

So, I have a question for you – how would you approach the following (engineering) problem? Imagine you have two dataspaces, a source dataspace, such as Eurostat with some 5000+ datasets that can take up to several GB in the worst case, and a target dataspace (for example, something like what we’re deploying in the LATC, currently). You want to ensure that the data in the target dataspace is as fresh as possible, that is, providing a minimal temporal delay between the contents of source and target dataspaces.

Don’t get me wrong, this has exactly nothing to do with Linked Data, RDF or the like. This is simply the question of how often one should ‘sample’ the source in order to make sure that the target is ‘always’ up-to-date.

Now, would you say that Shannon’s theorem is of any help? Or, you look at the given source update frequency and decide based on this how often you hammer the server?

Step back.

It turns out that one should also take into account what happens in the target dataspace. In our case this is mainly the conversion of the XML or TSV into some RDF serialisation. This is, in cases where the source dataset has, say, some 11GB, a non-trivial issue to address. In addition, we see some ~1000 datasets changing in a couple of days time. Which would leave us, in the worst case, with a situation where we would still be in the conversion process of parts of the dataspace while already updated versions of the datasets would be pending.

On the other hand we know, based on our experience with the Eurostat data, that we can rebuild the entire dataspace – that is, downloading all 5000+ files incl. metadata, converting it to RDF and loading the metadata into the SPARQL endpoint – in some 11+ days. Wouldn’t it make sense to simply only look at the update every 10-or-so days?

We discussed this today and settled for a weekly (weekend) update policy. Let’s see where this takes us and I promise that I keep you posted …


Filed under: FYI, Linked Data

Syndicated 2012-02-13 23:11:08 from Web of Data

54 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!