mhausenblas is currently certified at Journeyer level.

Name: Michael Hausenblas
Member since: 2008-12-14 18:20:45
Last Login: 2012-11-02 06:06:56

FOAF RDF Share This


Recent blog entries by mhausenblas

Syndication: RSS 2.0

Cloud Cipher Capabilities

… or, the lack of it.

A recent discussion at a customer made me having a closer look around support for encryption in the context of XaaS cloud service offerings as well as concerning Hadoop. In general, this can be broken down into over-the-wire (cf. SSL/TLS) and back-end encryption. While the former is widely used, the latter is rather seldom to find.

Different reasons might exits why one wants to encrypt her data, ranging from preserving a competitive advantage to end-user privacy issues. No matter why someone wants to encrypt the data, the question is do systems support this (transparently) or are developers forced to code this in the application logic.

IaaS-level. Especially in this category, file storage for app development, one would expect wide support for built-in encryption.

On the PaaS level things look pretty much the same: for example, AWS Elastic Beanstalk provides no support for encryption of the data (unless you consider S3) and concerning Google’s App Engine, good practices for data encryption only seem to emerge.

Offerings on the SaaS level provide an equally poor picture:

  • Dropbox offers encryption via S3.
  • Google Drive and Microsoft Skydrive seem to not offer any encryption options for storage.
  • Apple’s iCloud is a notable exception: not only does it provide support but also nicely explains it.
  • For many if not most of the above SaaS-level offerings there are plug-ins that enable encryption, such as provided by Syncdocs or CloudFlogger

In Hadoop-land things also look rather sobering; there are few activities around making HDFS or the likes do encryption such as ecryptfs or Gazzang’s offering. Last but not least: for Hadoop in the cloud, encryption is available via AWS’s EMR by using S3.

Filed under: Big Data, Cloud Computing, FYI

Syndicated 2013-03-24 16:44:55 from Web of Data

Elephant filet

End of January I participated in a panel discussion on Big Data, held during the CISCO live event in London. One of my fellow panelists, I believe it was Sean of CISCO, said there something along the line:

… ideally the cluster is at 99% utilisation, concerning CPU, I/O, and network …

This stuck in my head and I gave it some thoughts. In the following I will elaborate a bit on this in the context of where Hadoop is used in a shared setup, for example in hosted offerings or, say, within an enterprise that runs different systems such as Storm, Lucene/Solr, and Hadoop on one cluster.

In essence, we witness two competing forces: from the perspective of a single user who expects performance vs. the view of the cluster owner or operator who wants to optimise throughput and maximise utilisation. If you’re not familiar with these terms you might want to read up on Cary Millsap’s Thinking Clearly About Performance (part 1 | part 2).

Now, in such as shared setup we may experience a spectrum of loads: from compute intensive over I/O intensive to communication intensive, illustrated in the following, not overly scientific figure:

Here are a some observations and thoughts for potential starting points of deeper research or experiments.

Multitenancy. We see more and more deployments that require strong support for multitenancy; check out the CapacityScheduler, learn from best practices or use a distribution that natively supports the specification of topologies. Additionally, you might still want to keep an eye on Serengeti – VMware’s Hadoop virtualisation project – that seems to have gone quiet in the past months, but I still have hope for it.

Software Defined Networks (SDN). See Wikipedia’s definition for it, it’s not too bad. CISCO, for example, is very active in this area and only recently there was a special issue in the recent IEEE Communications Magazine (February 2013) covering SDN research. I can perfectly see – and indeed this was also briefly discussed on our CISCO live panel back in January – how SDN can enable new ways to optimise throughput and performance. Imagine a SDN that is dynamically workload-aware in the sense of that it knows the difference of a node that runs a task tracker vs. a data node vs. a Solr shard – it should be possible to transparently better the operational parameters and everyone involved, both the users as well as the cluster owner benefit from it.

As usual, I’m very interested in what you think about the topic and looking forward learning about resources in this space from you.

Filed under: Big Data, Cloud Computing, FYI, NoSQL

Syndicated 2013-03-10 13:37:46 from Web of Data

MapR, Europe and me

MapRYou might have already heard that MapR, the leading provider of enterprise-grade Hadoop and friends, is launching its European operations.

Guess what? I’m joining MapR Europe as of January 2013 in the role of Chief Data Engineer EMEA and will support our technical and sales teams throughout Europe. Pretty exciting times ahead!

As an aside: as I recently pointed out, I very much believe that Apache Drill and Hadoop offer great synergies and if you want to learn more about this come and join us at the Hadoop Summit where my Drill talk has been accepted for the Hadoop Futures session.

Filed under: Announcement, Big Data, FYI, NoSQL

Syndicated 2013-01-01 20:39:46 from Web of Data

Hosted MapReduce and Hadoop offerings

Hadoop in the cloud

Today’s question is: where are we regarding MapReduce/Hadoop in the cloud? That is, what are the offerings of Hadoop-as-a-Service or other hosted MapReduce implementations, currently?

A year ago, InfoQ ran a story Hadoop-as-a-Service from Amazon, Cloudera, Microsoft and IBM which will serve us as a baseline here. This article contains the following statement:

According to a 2011 TDWI survey, 34% of the companies use big data analytics to help them making decisions. Big data and Hadoop seem to be playing an important role in the future.

One year later, we learn from a recent MarketsAndMarkets study, Hadoop & Big Data Analytics Market – Trends, Geographical Analysis & Worldwide Market Forecasts (2012 – 2017) that …

The Hadoop market in 2012 is worth $1.5 billion and is expected to grow to about $13.9 billion by 2017, at a [Compound Annual Growth Rate] of 54.9% from 2012 to 2017.

In the past year there have also been some quite vivid discussions around the topic ‘Hadoop in the cloud’.

So, here are some current offerings and announcements I’m aware of:

… and now it’s up to you dear reader – I would appreciate it if you could point me to more offerings and/or announcements you know of, concerning MapReduce and Hadoop in the cloud!

Filed under: Big Data, Cloud Computing, FYI

Syndicated 2012-11-08 09:34:47 from Web of Data

MapReduce for and with the kids

Last week was Halloween and of course we went trick-or-treating with our three kids which resulted in piles of sweets in the living room. Powered by the sugar, the kids would stay up late to count their harvest and while I was observing them at it, I was wondering if it possible to explain the MapReduce paradigm to them, or even better: doing MapReduce with them.

Now, it turns out that Halloween and counting kinds of sweets are a perfect setup. Have a look at the following:

MapReduce for counting kinds of sweet after Halloween harvest.

So, the goal was to figure how many sweets of a certain kind (like, Twix) we now have available overall, for consumption.

We started off with every child having her or his pile of sweets in front of them. Now, in the first step I’d ask the kids to shout how many of the sweet X they have in their own pile. So one kid would go like I’ve got 4 fizzers, etc. … and then we’d gather all the same sweets and their respective counts together. Second, we’d add up the individual counts for each kind of sweet which would give us the desired result: number of X in total.

Lesson learned: MapReduce is a child’s play. Making kids sharing sweets is certainly not – believe me, I speak out of experience ;)

Filed under: Big Data, Cloud Computing, Experiment, NoSQL

Syndicated 2012-11-05 11:22:21 from Web of Data

63 older entries...


mhausenblas certified others as follows:

  • mhausenblas certified benadida as Master
  • mhausenblas certified danbri as Journeyer
  • mhausenblas certified wikier as Journeyer
  • mhausenblas certified timbl as Master
  • mhausenblas certified connolly as Master
  • mhausenblas certified dajobe as Master

Others have certified mhausenblas as follows:

  • wikier certified mhausenblas as Journeyer
  • dajobe certified mhausenblas as Apprentice
  • LotR certified mhausenblas as Journeyer
  • sye certified mhausenblas as Master
  • ittner certified mhausenblas as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page