Recent blog entries for amits

KVM Forum 2014 Schedule

The 2014 edition of KVM Forum is less than a week away.  The schedule of the talks is available at this location.  Use this link to add the schedule to your calendar.  A few slides have already been uploaded for some of the talks.

As with last year, we’ll live-stream and record all talks, keep an eye on the wiki page for details.

One notable observation about the schedule is that it’s much relaxed from the last few years, and there are far fewer talks in parallel this time around.  There’s a lot of time for interaction / networking / socializing.  If you’re in Dusseldorf next week, please come by and say ‘hello!’

Syndicated 2014-10-09 19:34:42 (Updated 2014-10-09 19:51:08) from Think. Debate. Innovate.

OpenStack Pune Meetup

I participated in the OpenStack Meetup at the Red Hat Pune office a few weekends ago.  I have been too caught up on the lower-level KVM/QEMU layers of the virt stack, and know there aren’t too many people involved in those layers in Pune (or even India); and was curious to learn more about OpenStack and also find out more about the OpenStack community in Pune.  The event was on a Saturday, which means sacrificing one day of rest and relaxation – but I went along because curiousity got the better of me.

This was a small, informal event where we had a few talks and several hallway discussions.  Praveen has already blogged about his experiences, here are my notes about the meetup.

There were a few scheduled talks for the day; speakers nominated themselves on the meetup page and the event organizers allotted slots for them.  The proceedings started off with configuring and setting up OpenStack via DevStack.  I wished (for the audience present there) there would’ve been an introductory talk before a deep-dive into DevStack.  I could spot a few newbies in the crowd, and they would have benefitted by an intro.

In a few discussions with the organizers, I learnt one of their pain points for such meetups: there inevitably are newbies at each meetup, and they can’t move on to advanced topics just because they have to start from scratch for each meetup.  I suggested they have a clear focus for each meetup: tell explicitly what each meetup is about, and the expertise level that’s going to be assumed.  For example, there’s nothing wrong with a newbie-focused event; but then some other event could focus on the networking part of OpenStack, and they assume people are familiar with configuring and deploying openstack and are familiar with basic networking priciples.  This suggestion is based on the Pune FADs we want to conduct and have in the pipeline; and was welcomed by the organizers.

Other talks followed; and I noticed a trend: not many people understood, or even knew about, the lower layers that make up the infrastructure beneath OpenStack.  I asked the organizers if they could spare 10 mins for me to provide a peek into the lower levels, and they agreed.  Right after a short working-lunch break, I took the stage.

I spoke about Linux, KVM and QEMU; dove into details of how each of them co-operate and how libvirt drives the interactions between the upper layers and the lower layers.  Also spoke a little about the alternative hypervisor support that libvirt has, but the advantages of the default hypervisor, QEMU/KVM has over others.  I then spoke about how improvements in Linux in general (e.g. the memory management layer) benefits the thousands of people running Linux, the thousands people running the KVM hypervisor, and in effect, benefit all the OpenStack deployments.  I then mentioned a bit about how features flow from upstream into distributions, and how all the advantages trickle down naturally, without anyone having to bother about particular parts of the infrastructure.

The short talk was well received, and judging by the questions I got asked, it was apparent that some people didn’t know the dynamics involved, and the way I presented it was very helpful to them and they wanted to learn more.  I also got asked a few hypervisor comparison questions.  I had to cut the interaction because I easily overflowed the 15 mins allotted to me, and asked people to follow up with me later, which several did.

One of the results of all those conversations was that I got volunteered to do more in-depth talks on the topic at future meetups.  The organizers lamented there’s a dearth of such talks and subject-matter experts; and many meetups generally end up being just talks from people who have read or heard about things rather than real users or implementers of the technology.  They said they would like to have more people from Red Hat talking about the work we do upstream and all the contributions we make.  I’m just glad our contributions are noticed :-)

Another related topic that came up during discussions with the organizers are hackathons, and getting people to contribute and actually do stuff.  I expect a hackathon to be proposed soon.

I had a very interesting conversation with Sajid, one of the organizers.  He mentioned Reliance Jio are setting up data centres across India, and are going to launch cloud computing services with their 4G rollout.  Their entire infrastructure is based on OpenStack.

There were other conversations as well, but I’ll perhaps talk about them in other posts.

Internally at Red Hat, we had a few discussions on how to improve our organization for such events (even though they’re community events; we should be geared up to serve the attendees better).  Mostly included stuff around making it easier to get people in (ie working with security), getting the AV equipment in place, etc.  All of this was working fine during this event, but basically ensuring all of the things that do go right are also part of the list of things to look at while organizing events so we don’t slip up.

Syndicated 2014-10-05 07:09:10 from Think. Debate. Innovate.

KVM Forum 2014

The KVM Forums are a great way to learn and talk about the future of KVM virtualization. The KVM Forum has been co-located with the Linux Foundation’s LinuxCon events for the past several years, and this year too will be held along with LinuxCon EU in Dusseldorf, Germany.

The KVM Forums also are a great documentation resource on several features, and the slides and videos from the past KVM Forums are freely available online. This year’s Forum will be no different, and we’ll have all the material on the KVM wiki.

Syndicated 2014-09-29 07:39:55 from Think. Debate. Innovate.

Planet Virt

For a long time various people have been telling me there’s not much information on the low-level / plumbing details of the virt stack on Linux. Especially information related to qemu and its various settings, devices, and so on.

Documentation surely is difficult to come by, but a quick and straightforward solution is to syndicate all of the blog posts that people doing virt development write into a common stream: a planet virt. I started hosting and testing such an instance on openshift, but was quickly pointed to the existing Virt Tools Planet by Rich Jones and Dan Berrange. Dan added the list of people whose blogs I followed for virt development to that instance.

I updated the KVM and QEMU wikis to ensure the Planet gets more visibility, and hope this goes a small way to quell the complaints of not enough available information.

Syndicated 2014-09-29 07:34:10 (Updated 2014-09-29 08:09:54) from Think. Debate. Innovate.

24 Aug 2014 (updated 29 Sep 2014 at 08:13 UTC) »

Fedora Activity Day Pune Report

I participated in the Fedora Activity Day at the RH office in Pune yesterday. There was a decent turnout, 20+ people, and it was fun to test the in-progress version of the upcoming F21 release along with other folks.

Siddhesh came up with the idea of rebooting Fedora-related activities in Pune, and a few of us showed interest in such an activity. We quickly agreed on what to focus on for the first such activity: test the upcoming release. This would give us an opportunity to improve the experience with F21, and also be a low-barrier-to-entry activity for first-time contributors: we have had some FADs in the past, but the people who turn up tend to be usually familiar with Fedora or particular aspects of the OS; so focussing on using the OS, and filing bugs along the way, was thought to be a great way to initiate newcomers without necessarily diving deep into technical details.

FAD Photo 3
FAD Photo 3

In that respect, I’d like to think the FAD was a success. We had people testing the installer, the GNOME and KDE desktops within VMs and via live images on laptops, and also a few specific items like VM snapshots and DNSSEC.

Before the FAD, I downloaded and tested in a VM two nightly images – the default workstation image with the GNOME desktop, and the KDE spin. The Aug 20 nightly for both the images worked fine, so we declared them as gold images for the FAD. Most people already had downloaded them before they came for the FAD, and this helped us start with the FAD as soon as our laptops were booted.

We started off at about 9 AM, and I was around till a bit after 4 PM. I tested the default GNOME live image on an X200 laptop, and also inside a VM. I found a couple of suspend-to-ram related issues in GNOME and in the kernel. Quite a few people tested the installer, and I liked how we kept conversing about the issues people were seeing, others attempting to replicate the issues, and once there were 2 or 3 +1s for a particular kind of an issue, we knew it was a fairly reproducible bug.

FAD Photo
FAD Photo 2

Some people ran through some Fedora Test Day test cases, some went through ON_QA bugs to provide karma on bodhi.

PJP took the lead on educating us on DNSSEC and walking us through setting it up as well as testing whether everything works fine.

Kashyap then spoke a bit about how VM snapshots are a great tool for testing destructive things, and showed the couple of different snapshot techniques, and how to set them up and use them using libvirt. Quite a few people agreed this was really cool, and I expect them to start using snapshots in their regular $DAYJOB activities.

We kept recording our progress, bugs found, etc., on an etherpad. The #fedora-india channel helped us exchange links and was helpful for general chit-chat. (Edit: The contents of the etherpad as on the day of the FAD are archived here)

Through the day, quite a few categories and components were tested, as noted in the etherpad.

The FAD wiki page‘s status area is still being populated, but the etherpad has links to bugs found and filed.

The initial mails about hosting a FAD to the fedora-india list showed there was interest in a lot of topics to cover for FADs, so I’m sure we’ll have more such FADs organized with more topics in the future.

Since this was mostly a volunteer-driven event, everything was quite informal. Red Hat sponsored the venue and snacks, and everything else was handled by us on the fly, like lunch being ordered when people started feeling hungry — according to preferences by show-of-hands. Since this FAD was cobbled together in very quick time, we didn’t have enough time to engage with the Fedora contacts for budget for swag or food; hopefully we will be able to get that sorted out soon — especially since we have interest in the FADs and also topics lined up to work on.

Syndicated 2014-08-24 08:50:40 (Updated 2014-09-29 07:42:12) from Think. Debate. Innovate.

31 Jan 2014 (updated 31 Jan 2014 at 10:13 UTC) »

Use of Piwik Analytics

I run Piwik on OpenShift to collect stats on visits to this blog.  I’m not really interested in knowing who visits my site.  I’m only interested in knowing what people are visiting for, and how: which pages are more viewed? where are people landing to my site from?  how long after publishing some post do people still visit it?  And so on.

One of the ways this is also helpful is to track 404 (page not found) errors that pop up for visitors.  After migrating my previous posts from blogger, I kept monitoring for any posts that may have been missed by the automatic migration process, and manually moved them.

These days, the 404 tracking turns up interesting data, though.  Someone recently tried to access such a page on this blog which resulted in a 404 error:

/​oxmax/​admin/​includes/​javascript/​ckeditor/​filemanager/​swfupload/​upload.​php/​From =

A quick search on the net revealed it’s a relatively recent vulnerability discovered in some php-based e-commerce suite, which gives root access to the server hosting the software.  Thankfully, I don’t run any e-commerce software, and I also run on OpenShift, which gives the servers quite a bit of protection.  In the worst case, some wordpress vulnerability might affect my blog, but the other software hosted on the same server as this blog will be protected (even in the case of a root expoit).

Syndicated 2014-01-31 08:12:09 (Updated 2014-01-31 09:41:28) from Think. Debate. Innovate.

Backing Up Data on Android Phones

Experimenting with the new cyanogenmod builds for Android 4.3 (cm-10.2) resulted in a disaster: my phone was setup for encryption, and the updater messed up the usb storage such that the phone wouldn’t recognise the in-built sdcard on the Nexus S anymore.  I tried several things: factory reset, formatting via the clockworkmod recovery, etc., to no avail.  The recovery wouldn’t recognize the /sdcard partition, too.

Good thing I had a backup, so I wasn’t worried too much.

I could use adb when CWM recovery was booted, to navigate around.  Using fdisk, I could see the /sdcard partition was intact, but wouldn’t get recognized by either CWM or the kernel.  I deleted the partition, and created a new one with the same parameters.  Also used the opportunity to try out ext4 instead of the default fat.  CWM still wouldn’t recognize / mount this partition, but the android kernel does recognize it.  However, mounting the card as USB storage still doesn’t work.

So I’ve now fallen back to using adb + rsync as my backup solution: usb-tether the phone to the laptop, note the IP addr the laptop got, and then from an adb shell, just issue

'rsync -av /sdcard/ user@laptop-ip:/path/to/backup/'

This is working fine.  adb push/pull also work quite well, and I don’t really miss the ‘mount as usb storage’ functionality much.  I’ll however try fixing this issue, since encryption isn’t working as well — so the key would be to ensure CWM recovery identifying the partition.  I’m guessing if that works fine, the remaining bits would be fine too (mounting usb storage, encrypting it, etc.)

I use GOBackup from the Play store to backup apps+data.  oandbackup from the fdroid store looks nice, but crashes a lot.  It’s being constantly updated, though, so it has promise to become a nice backup app.

Syndicated 2014-01-15 17:36:06 (Updated 2014-01-15 17:37:10) from Think. Debate. Innovate.

Red-Whiskered Bulbul

A few weeks back, a strange bird call started waking me up.  Though red-whiskered bulbuls are supposed to be pretty common, I’d not heard them or seen one up close.


There were two of them making rounds throughout the day, and they used to visit one plant in my terrace-garden frequently.  I took that as a sign that they’re building a nest, and ensured they don’t get disturbed when the visited.  I tried taking a few pictures, but couldn’t manage good ones: these birds are very shy, and any they’re very alert to any movement or human presence.  There’s a better photo at wikipedia.

Exactly one week after they started arriving, they had built a nest and they didn’t make as much noise as earlier.  That made me curious.  I checked out their nest, and was quite delighted to see an egg:

Red whiskered bulbul's nest with egg

Red whiskered bulbul’s nest with egg

Red whiskered bulbul's nest with egg -- camera flash on.

Red whiskered bulbul’s nest with egg — camera flash on.

The next day I woke up to a lot of mayhem, lots of bird noises near the plant.  I didn’t feel like disturbing anything there.  When the commotion died down, I went to check, and saw the egg was missing.  That was a sad end to the week-long activity around the nest.  I initially suspected pigeons, permanent residents in the terrace, to have caused the damage.  However, later in the day, a big crow came by near the same plant (crows have never come inside the terrace earlier).  Wonder what this all means, and where the egg vanished.

Syndicated 2013-05-12 07:41:57 (Updated 2013-05-12 08:05:15) from Think. Debate. Innovate.

25 Jan 2013 (updated 22 May 2013 at 11:16 UTC) »

Session notes from the Virtualization microconf at the 2012 LPC

The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in.  I suspect this is due to the spam problem, but I’ll put those notes here so that they’re available without needing a login.  The source is here.

These are the notes I took during the virtualization microconference at the 2012 Linux Plumbers Conference.

Virtualization Security Discussion – Paul Moore


  • threats to virt system
  • 3 things to worry about
    • attacks from host – has full access to guest
    • attacks from other guests on the host
      • break out from guest, attack host and other guests (esp. in multi-tenant situations)
    • attacks from the network
      • traditional mitigation: separate networks, physical devices, etc.
  • protecting guest against malicious hosts
  • host has full access to guest resources
  • host has ability to modify guest stuff at will; w/o guest knowing it
  • how to solve?
    • no real concrete solutions that are perfect
    • guest needs to be able to verify / attest host state
      • root of trust
    • guests need to be able to protect data when offline
      • (discussion) encrypt guests – internally as well as qcow2 encryption
  • decompose host
    • (discussion) don’t run services as root
  • protect hosts against malicious guests
  • just assume all guests are going to be malicious
  • more than just qemu isolation
  • how?
    • multi-layer security
    • restrict guest access to guest-owned resources
    • h/w passthrough – make sure devices are tied to those guests
    • limit avl. kernel interfaces
      • system calls, netlink, /proc, /sys, etc.
    • if a guest doesn’t need an access, don’t give it!
  • libvirt+svirt
    • MAC in host to provide separation, etc.
    • addresses netlink, /proc, /sys
  • (discussion) aside: how to use libvirt w/o GUI?
    • there is ‘virsh’, documentation can be improved.
  • seccomp
    • allows to selectively turn off syscalls; addresses syscalls in list above.
  • priv separation
    • libvirt handles n/w, file desc. passing, etc.
  • protecting guest against hostile networks
  • guests vulnerable directly and indirectly
  • direct: buggy apache
  • indirect: host attacked
  • qos issue on loaded systems
  • host and guest firewalls can solve a lot of problems
  • extend guest separation across network
    • network virt – for multi-tenant solutions
    • guest ipsec and vpn services on host
  • (discussion) blue pill vulnerability – how to mitigate?
    • lot of work being done by trusted computing group – TPM
    • maintain a solid root of trust
  • somebody pulling rug beneath you, happens even after boot
  • you’ll need h/w support?
    • yes, TPM
    • UEFI, secure boot
  • what about post-boot security threats?
    • let’s say booted securely. other mechanisms you can enable – IMA – extends root of trust higher. signed hashes, binaries.
    • unfortunately, details beyond scope for a 20-min talk

Storage Virtualization for KVM: Putting the pieces together – Bharata Rao


  • Different aspects of storage mgmt with kvm
    • mainly using glusterfs as storage backend
    • integrating with libvirt, vdsm
  • problems
    • multiple choices for fs and virt mgmt
      • libvirt, ovirt, etc.
  • not many fs’es are virt-ready
    • virt features like snapshots, thin-provisioning, cloning not present as part of fs
    • some things done in qemu: snapshot, block mig, img fmt handling are better handled outside
  • storage integration
    • storage device vendor doesn’t have well-defined interfaces
  • gluster has potential: leverage its capabilities, and solve many of these problems.
  • intro on glusterfs
    • userspace distributed fs
    • aggregates storage resources from multiple nodes and presents a unified fs namespace
  • glusterfs features
    • replication, striping, distribution, geo-replication/sync, online volume extension
  • (discussion) why gluster vs ceph?
    • gluster is modular; pluggable, flexible.
    • keeps storage stack clean. only keep those things active which are needed
    • gluster doesn’t have metadata.
      • unfortunately, gluster people not around to answer these questions.
  • by having backend in qemu, qemu can already leverage glusterfs features
    • (discussion) there is a rados spec in qemu already
      • yes, this is one more protocol that qemu will now support
  • glusterfs is modular: details
    • translators: convert requests from users into requests for storage
    • open/read/write calls percolate down the translator stack
      • any plugin can be introduced in the stack
  • current status: enablement work to integrate gluster-qemu
    • start by writing a block driver in qemu to support gluster natively
    • add block device support in gluster itself via block device translator
  • (discussion) do all features of gluster work with these block devices?
    • not yet, early stages. Hope is all features will eventually work.
  • interesting use-case: replace qemu block dev with gluster translators
  • would you have to re-write qcow2?
    • don’t need to, many of qcow2 features already exist in glusterfs common code
  • slide showing perf numbers
  • future
    • is it possible to export LUNs to gluster clients?
    • creating a VM image means creating a LUN
    • exploit per-vm storage offload – all this using a block device translator
    • export LUNs as files; also export files as LUNs.
  • (discussion) why not use raw files directly instead of adding all this overhead? This looks like a perf disaster (ip network, qemu block layer, etc.) – combination of stuff increasing latency, etc.
    • all of this is an experimentation, to go where we haven’t yet thought about – explore new opportunities. this is just the initial work; more interesting stuff can be build upon this platform later.
  • libvirt, ovirt, vdsm support for glusterfs added – details in slides
  • (discussion) storage array integration (slide) – question
    • way vendors could integrate san storage into virt stack.
    • we should have capability to use array-assisted features to create lun.
    • from ovirt/vdsm/libvirt/etc.
  • (discussion) we already have this in scsi. why add another layer? why in userspace?
    • difficult, as per current understanding: send commands directly to storage: fast copy from lun-to-lun, etc., not via scsi T10 extentions.
    • these are out-of-band mechanisms, in mgmt path, not data path.
  • why would someone want to do that via python etc.?

Next generation interrupt virtualization for KVM – Joerg Roedel


  • presenting new h/w tech today that accelerates guests
  • current state
    • kvm emulates local apic and io-apic
    • all reads/writes intercepted
    • interrupts can be queued from user or kernel
    • ipi costs high
  • h/w support
    • limited
    • tpr is accelerated by using cr8 register
    • only used by 64 bit guests
  • shiny new feature: avic
  • avic is designed to accelrate most common interrupt system features
    • ipi
    • tpr
    • interurpts from assigned devs
  • ideally none of those require intercept anymore
  • avic virtualizes apic for each vcpu
    • uses an apic backing page
    • guest physical apic id table, guest logical apic id table
    • no x2apic in first version
  • guest vapic backing page
    • store local apic contents for one vcpu
    • writes to accelerated won’t intercept
    • to non-accelerated cause intercepts
  • accelerated:
    • tpr
    • EOI
    • ICR low
    • ICR high
  • physical apic id table
    • maps guest physical apic id to host vapic pages
    • (discussion) what if guest cpu is not running
      • will be covered later
  • table maintained by kvm
  • logical apic id table
    • maps guest logical apic ids to guest physical apic ids
      • indexed by guest logical apic id
  • doorbell mechanism
    • used to signal avic interrupts between physcial cpus
      • src pcpu figures out physical apic id of the dest.
      • when dest. vcpu is running, it sends doorbell interrupt to physical cpu
  • iommu can also send doorbell messages to pcpus
    • iommu checks if vcpu is running too
    • for not running vcpus, it sends an event log entry
  • imp. for assigned devices
  • msr can also be used to issue doorbell messages by hand – for emulated devices
  • running and not running vcpus
    • doorbell only when vcpu running
  • if target pcpu is not running, sw notified about a new interrupt for this vcpu
  • support in iommu
    • iommu necessary for avic-enabled device pass-through
    • (discussion) kvm has to maintain? enable/disable on sched-in/sched-out
  • support can be mostly be implemented in kvm-amd module
    • some outside support in apic emulation
    • some changes to lapic emulation
      • change layout
      • kvm x86 core code will allocate vapic pages
      • (discussion) instead of kvm_vcpu_kick(), just run doorbell
  • vapic page needs to be mapped in nested page table
    • likely requires changes to kvm softmmu code
  • open question wrt device passthrough
    • changes to vfio required
      • ideally fully transparent to userspace

Reviewing Unused and New Features for Interrupt/APIC Virtualization – Jun Nakajima


  • Intel is going to talk about a similar hardware feature
  • intel: have a bitmap, and they decide whether to exit or not.
  • amd: hardcoded. apic timer counter, for example.
  • q to intel: do you have other things to talk about?
    • yes, coming up later.
  • paper on ‘net showed perf improved from 5-6Gig/s to wire speed, using emulation of this tech.
  • intel have numbers on their slides.
  • they used sr-iov 10gbe; measured vmexit
  • interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.
    • read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance
    • more than 50% exits are interrupt-related or apic related.
  • new features for interrupt/apic virt
    • reads are redirected to apic page
    • writes: vmexit after write; not intercepted. no need for emluation.
  • virt-interrupt delivery
    • extend tpr virt to other apic registers
    • eoi – no need for vm exits (using new bitmap)
      • this looks different from amd
    • but for eoi behaviour, intel/amd can have common interface.
  • intel/amd comparing their approaches / features / etc.
    • most notably, intel have support for x2apic, not for iommu. amd have support for iommu, not for x2apic.
  • for apic page, approaches mostly similar.
  • virt api can have common infra, but data structures are totally different. intel spec will be avl. in a month or so (update: already available now). amd spec shd be avl in a month too.
  • they can remove interrupt window, meaning 10% optimization for 6 VM case
  • net result
    • eliminate 50% of vmexits
    • optimization of 10% vmexits.
  • intel also supports x2apic h/w.
    • this can hide info from other vcpus
    • secure channel between guest and host; can do whatever hypervisor wants.
    • vcpu executes vmfunc instrucion in special thread
  • usecases:
    • allow hvm guests to share pages/info with hypervisor in secure fashion
  • (discussion) why not just add to ept table
  • (discussion) does intel’s int. virt. has iommu component to?
    • doesn’t want to commit.

CoLo – Coarse-grained Lock-stepping VM for non-stop service – Will Auld


  • non-stop service with VM replication
    • client-server
    • Compare and contrast with Ramus – Xen’s solution
      • xen: ramus
        • buffers responses until checkpoint to secondary server completes (once per epic)
        • resumes secondary only on failover
        • failover at anytime
      • xen: colo
        • runs two VMs in parallel comparing their responses, checkpoints only on miscompare
        • resumes after every checkpoint
        • failover at anytime
  • CoLo runs VMs on primary and secondary at same time.
    • both machines respond to requests; they check for similartiy. When they agree, one of the responses sent to client
  • diff. between two models:
    • ramus: assume machine states have to be same. This is the reason to buffer responses until checkpoint has completed.
    • in colo; no such req. only requirement is request stream must be the same.
  • CoLo non-stop service focus on server response, not internal machine state (since multiprocessor environment is inherently nondeterministic)
  • there’s heartbeat, checkpoint
  • colo managers on both machines compare requests.
    • when they’re not same, CoLo does checkpoint.
  • (discussion) why will response be same?
    • int. machine state shouldn’t matter for most responses.
    • some exceptions, like tcp/ip timestamps.
    • minor modifications to tcp/ip stacks
      • coarse grain time stamp
      • highly deterministic ack mechanism
    • even then, state of machine is dissimilar.
  • resume of machine on secondary node:
    • another stumbling block.
  • (slides) graph on optimizations
  • how do you handle disk access? network is easier – n/w stack resumes on failover. if you don’t do failover in a state where you know disk is in a consistent state, you can get corruption.
    • Two solutions
      • For NAS, do same compares as with responses (this can also trigger checkpoints).
      • On local disks, buffer original state of changed pages, revert to original and them checkpoint with primary nodes disk writes included. This is equivalent to how the memory image is updated. (This was not described complete enough during the session).
  • that sounds dangerous. client may have acked data, etc.
    • will have to look closer at this. (More complete explanation above counters this)
  • how often do you get mismatch?
    • depends on workload. some were like 300-400 packets of good packets, then a mismatch.
  • during that, are you vulnerable to failure?
    • no, can failover at any point. internal state doesn’t matter. Both VMs, provide consistent request streams from their initial state and match responses up to the moment of failover.

NUMA – Dario Faggioli, Andrea Arcangeli

NUMA and Virtualization, the case of Xen – Dario Faggioli


  • Intro to NUMA
    • access costs to memory is different, based on which processor access it
    • remote mem is slower
  • in context of virt, want to avoid accessing remote memory
  • what we used to have in xen
    • on creation of VM, memory was allocated on all nodes
  • to improve: automatic placement
    • at vm1 creation time, pin vm1 to first node,
    • at vm2 create time, pin vm2 to second node since node1 already has a vm pinned to it
  • then they went ahead a bit, because pinning was inflexible
    • inflexible
    • lots of idle cpus and memories
  • what they will have in xen 4.3
    • node affinity
      • instead of static cpu pinning, preference to run vms on specific cpus
  • perf evaluation
    • specjbb in 3 configs (details in slides)
    • they get 13-17% improvements in 2vcpus in each vm
  • open problems
    • dynamic memory migration
    • io numa
      • take into consideration io devices
    • guest numa
      • if vm bigger than 1 node, should guest be aware?
    • ballooning and sharing
      • sharing could cause remote access
      • ballooning causes local pressures
    • inter-vm dependencies
    • how to actually benchmark and evaluate perf to evaluate if they’re improving

AutoNUMA – Andrea Arcangeli


  • solving similar problem for Linux kernel
  • implementation details avail in slides, will skip now
  • components of autonuma design
    • novel approach
      • mem and cpu migration tried earlier using diff. approaches, optimistic about this approach.
    • core design is to two novel ideas
      • introduce numa hinting pagefaults
        • works at thread-level, on thread locality
      • false sharing / relation detection
  • autonuma logic
    • cpu follows memory
    • memory in b/g slowly follows cpu
    • actual migration is done by knuma_migrated
      • all this is async and b/g, doesn’t stall memory channels
  • benchmarkings
    • developed a new benchmark tool, autonuma-benchmark
      • generic to measure alternative approaches too
    • comparing to gravity measurement
    • put all memory in single node
      • then drop pinning
      • then see how memory spreads by autonuma logic
  • see slides for graphics on convergance
  • perf numbers
    • also includes comparison with alternative approach, sched numa.
    • graphs show autonuma is better than schednuma, which is better than vanilla kernel
  • criticism
    • complex
    • takes more memory
      • takes 12 bytes per page, Andrea thinks it’s reasonable.
      • it’s done to try to decrease risk of hitting slowdowns (is faster than vanilla already)
    • stddev shows autonuma is pretty deterministic
  • why is autonuma so important?
    • even 2 sockets show differences and improvements.
    • but 8 nodes really shows autonuma shines


  • looks like andrea focussing on 2 nodes / sockets, not more? looks like it will have bad impact on bigger nodes
    • points to graph showing 8 nodes
    • on big nodes, distance is more.
    • agrees autonuma doesn’t worry about distances
    • currently worries only about convergence
    • distance will be taken as future optimisation
    • same for Xen case
      • access to 2 node h/w is easier
    • as Andrea also mentioned, improvement on 2 node is lower bound; so improvements on bigger ones should be bigger too; don’t expect to be worse
  • not all apps just compute; they do io. and they migrate to the right cpu to where the device is.
    • are we moving memory to cpu, or cpu to device, etc… what should the heuristic be?
      • 1st step should be to get cpu and mem right – they matter the most.
      • doing for kvm is great since it’s in linux, and everyone gets the benefit.
      • later, we might want to take other tunables into account.
    • crazy things in enterprise world, like storage
    • for high-perf networking, use tight binding, and then autonuma will not interfere.
      • this already works.
    • xen case is similar
      • this is also something that’s going to be workload-dependent, so custom configs/admin is needed.
  • did you have a chance to test on AMD Magny-Cours (many more nodes)
    • hasn’t tried autonuma on specific h/w
    • more nodes, better perf, since upstream is that much more worse.
    • xen
      • he did, and at least placement was better.
      • more benchmarking is planned.
  • suggestion: do you have a way to measure imbalance / number of accesses going to correct node
    • to see if it’s moving towards convergence, or not moving towards convergence, maybe via tracepoints
    • essentially to analyse what the system is doing.
    • exposing this data so it can be analysed.
  • using printks right now for development, there’s a lot of info, all the info you have to see why the algo is doing what it’s doing.
  • good to have in production so that admins can see
    • what autonuma is doing
    • how much is it converging
      • to decide to make it more aggressive, etc.
  • overall, all such stats can be easily exported, it’s already avl. via printk, but have to moved to something more structured and standard.
  • xen case is same; trying to see how they can use perf counters, etc. for statistical review of what is going on, but not precise enough
    • tells how many remote memory accesses are happening, but not from where and to where
    • something more in s/w is needed to enable this information.

One balloon for all – towards unified balloon driver – Daniel Kiper


  • wants to integrate various balloon drivers avl. in Linux
  • currently 3 separate drivers
    • virtio
    • xen
    • vmware
  • despite impl. differences, their core is similar
    • feature difference in drivers (xen has selfballooning)
    • overall lots of duplicate code
  • do we have an example of a good solution?
    • yes, generic mem hotplug code
    • core functionality is h/w independent
    • arch-specific parts are minimal, most is generic
  • solution proposed
    • core should be hypervisor-independent
    • should co-operate on h/w independent level – e.g mem hotplug, tmem, movable pages to reduce fragmentation
    • selfballooning ready
    • support for hugepages
    • standard api and abi if possible
    • arch-specific parts should communicate with underlying hypervisor and h/w if needed
  • crazy idea
    • replace ballooning with mem hot-unplug support
    • however, ballooning operates on single pages whereas hotplug/unplug works on groups of pages that are arch-dependent.
      • not flexible at all
      • have to use userspace interfaces
        • can be done via udev scripts, which is a better way
  • discussion: does acpi hotplug work seamlessly?
    • on x86 baremetal, hotplug works like this:
      • hotplug mem
      • acpi signals to kernel
      • acpi adds to mem space
      • this is not visible to processes directly
      • has to be enabled via sysfs interfaces, by writing ‘enable command’ to every section that has to be hotplugged
  • is selfballooning desirable?
    • kvm isn’t looking at it
    • guest wants to keep mem to itself, it has no interest in making host run faster
    • you paid for mem, but why not use all of it
    • if there’s a tradeoff for the guest: you pay less, you get more mem later, etc., guests could be interested.
    • essentially, what is guest admin’s incentive to give up precious RAM to host?

ARM – Marc Zyngier, Stefano Stabellini

KVM ARM – Marc Zyngier


  • ARM architecture virtualization extensions
    • recent introduction in arch
    • new hypervisor mode PL2
    • traditionally secure state and non-secure state
    • Hyp mode is in non-secure side
  • higher privilege than kernel mode
  • adds second stage translation; adds extra level of indirection between guests and physical mem
    • tlbs are tagged by VMID (like EPT/NPT)
  • ability to trap accesses to most system registers
  • can handle irqs, fiqs, async aborts
    • e.g. guest doesn’t see interrupts firing
  • hyp mode: not a superset of SVC
    • has its own pagetables
    • only stage 1, not 2
    • follows LPAE, new physical extensions.
    • one translation table register
      • so difficult to run Linux directly in Hyp mode
      • therefore they use Hyp mode to switch between host and guest modes (unlike x86)
    • uses HYP mode to context switch from host to guest and back
    • exits guest on physical interrupt firing
    • access to a few privileged system registers
    • WFI (wait for interrupt)
      • (discussion) WFI is trapped and then we exit to host
    • etc.
    • on guest exit, control restored to host
    • no nesting; arch isn’t ready for that.
  • MM
    • host in charge of all MM
    • has no stage2 translation itself (saves tlb entries)
    • guests are in total control of page tables
    • becomes easy to map a real device into the guest physical space
    • for emulated devices, accesses fault, generates exit, and then host takes over
    • 4k pages only
  • instruction emulation
    • trap on mmio
    • most instructions described in HSR
    • added complexity due to having to handle multiple ISAs (ARM, Thumb)
  • interrupt handling
    • redirect all interrupts to hyp mode only while running a guest. This only affects physical interrupts.
    • leave it pending and return to host
    • pending int will kick in when returns to guest mode?
      • No, it will be handled in host mode. Basically, we use the redirection to HYP mode to exit the guest, but keep the handling on the host.
  • inject two ways
    • manipulating arch. pins in the guest?
      • The architecture defines virtual interrupt pins that can be manipulated (VI→I, VF→F, VA→A). The host can manipulate these pins to inject interrupts or faults into the guest.
  • using virtual GIC extensions,
  • booting protocol
    • if you boot in HYP mode, and if you enter a non-kvm kernel, it gracefully goes back to SVC.
    • if kvm-enabled kernel is attempted to boot into, automatically goes into HYP mode
    • If a kvm-enabled kernel is booted in HYP mode, it installs a HYP stub and goes back to SVC. The only goal of this stub is to provide a hook for KVM (or another hypervisor) to install itself.
  • current status
    • pending: stable userspace ABI
    • pending: upstreaming
      • stuck on reviewing

Xen ARM – Stefano Stabellini


  • Why?
    • arm servers
    • smartphones
    • 288 cores in a 4U rack – causes a serious maintenance headache
  • challenges
    • traditional way: port xen, and port hypercall interface to arm
    • from Linux side, using PVOPS to modify setpte, etc., is difficult
  • then, armv7 came.
  • design goals
    • exploit h/w as much as possible
    • limit to one type of guest
      • (x86: pv, hvm)
      • no pvops, but pv interfaces for IO
    • no qemu
      • lots of code, complicated
    • no compat code
      • 32-bit, 64-bit, etc., complicated
    • no shadow pagetables
      • most difficult code to read ever
  • NO emulation at all!
  • one type of guest
    • like pv guests
      • boot from a user supplied kernel
      • no emulated devices
      • use PV interfaces for IO
    • like hvm guests
      • exploit nested paging
      • same entry point on native and xen
      • use device tree to discover xen presence
      • simple device emulation can be done in xen
        • no need for qemu
  • exploit h/w
    • running xen in hyp mode
    • no pv mmu
    • hypercall
    • generic timer
      • export timer int. to guest
  • GIC: general interrupt controller
    • int. controller with virt support
    • use GIC to inject event notifications into any guest domains with Xen support
      • x86 taught us this provides a great perf boost (event notifications on multiple vcpus simultaneously)
      • on x86, they had a pci device to inject interrupts to guest at regular intervals (on x86 we had a pci device to inject event notifications as legacy interrupt)
  • hypercall calling convention
    • hvc (hypercall)
    • pass params on registers
    • hvc takes an argument: 0xEA1 – means it’s a xen hypercall.
  • 64-bit ready abi (another lesson from x86)
    • no compat code in xen
      • 2600 lines of lesser code
  • had to write a 1500 line patch of mechanical substitutions to make 32-bit host make all guests work fine
  • status
    • xen and dom0 boot
    • vm creation and destruction work
    • pv console, disk, network work
    • xen hypervisor patches almost entirely upstream
    • linux side patches should go in next merge window
  • open issues
    • acpi
      • will have to add acpi parsers, etc. in device table
      • linux has 110,000 lines – should all be merged
  • uefi
    • grub2 on arm: multiboot2
    • need to virtualise runtime services
    • so only hypervisor can use them now
  • client devices
    • lack ref arch
    • difficult to support all tablets, etc. in market
    • uefi secure boot (is required by win8)
    • windows 8


  • who’s responsbile for device tree mgmt for xen?
    • xen takes dt from hw, changes for mem mgmt, then psases to dom0
    • at present, currently have to build dt binary
  • at the moment, linux kernel infrastructure doesn’t support interrupt priorities.
    • needed to prevent a guest doing a DoS on host by just generating interrupts non-stop
    • xen does support int. priorities in GIC

VFIO – Are we there yet? – Alex Williamson


  • are we there yet? almost
  • what is vfio?
    • virtual function io
    • not sr-iov specific
    • userspace driver interface
      • kvm/qemu vm is a userspace driver
    • iommu required
      • visibility issue with devices in iommu, guaranteeing devices are isolated and safe to use – different from uio.
    • config space access is done from kernel
      • adds to safety requirement – can’t have userspace doing bad things on host
  • what’s different from last year?
    • 1st proposal shot down last year, and got revised at last LPC
    • allow IOMMU driver to define device visibility – not per-device, but the whole group exposed
    • more modular
  • what’s different from pci device assignment
    • x86 only
    • kvm only
    • no iommu grouping
    • relies on pci-sysfs
    • turns kvm into a device driver
  • current staus
    • core pci and iommu drivers in 3.6
    • qemu will be pushed for 1.3
  • what’s next?
    • qemu integration
    • legacy pci interrupts
      • more of a qemu-kvm problem, since vfio already supports this, but these are unique since they’re level-triggered; host has to mask interrupt so it doesn’t cause a DoS till guest acks interrupt
        • like to bypass qemu directly – irqfd for edge-triggered. now exposing irqfd for level
  • (lots of discussion here)
  • libvirt support
    • iommu grps changed the way we do device assignment
    • sysfs entry point; move device to vfio driver
    • do you pass group by file descriptor?
    • lots of discussion on how to do this
    • existing method needs name for access to /sys
    • how can we pass file descriptors from libvirt for groups and containers to work in different security models?
      • The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,
  • POWER support
    • already adding
  • PowerPC
    • freescale looking at it
    • one api for x86, ppc was strange
  • error reporting
    • better ability to inject AER etc to guest
    • maybe another ioctl interrupt
    • What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).
      • We’re going to have to figure this out and it will factor into how much of the AER registers on the device do we expose and allow the guest to control. Perhaps not all errors are guest serviceable and we’ll need to figure out how to manage those.
  • better page pinning and mapping
    • gup issues with ksm running in b/g
  • PRI support
  • graphics support
    • issues with legacy io port space and mmio
    • can be handled better with vfio
  • NUMA hinting

Semiassignment: best of both worlds – Alex Graf


  • b/g on device assignment
  • best of both worlds
    • assigned device during normal operation
    • emulated during migration
  • xen solution – prototype
    • create a bridge in domU
    • guest sees a pv device and a real device
    • guest changes needed for bridge
    • migration is guest-visible, since real device goes away and comes back (hotplug)
      • security issue if VM doesn’t ack hot-unplug
  • vmware way
    • writing a new driver for each n/w device they want to support
    • this new driver calls into vmxnet
    • binary blob is mapped into your address space
    • migration is guest exposed
      • new blob needed for destination n/w card
  • alex way
    • emulate real device in qemu
    • e.g. expose emulated igbvf if passing through igbvf
    • need to write migration code for each adapter as well
  • demo
    • doesn’t quite work right now
  • is it a good idea?
  • how much effort really?
    • doesn’t think it’s much effort
    • current choices in datacenters are igbvf and <something else>
      • that’s not true!
      • easily a dozen adapters avl. now
      • lots of examples given why this claim isn’t true
        • no one needs single-vendor/card dependency in an entire datacenter
  • non-deterministic network performance
  • more complicated network configuration
  • discussion
    • Another solution suggested by Benjamin Herrenschmidt: use s3; remove ‘live’ from ‘live migration’.
    • AER approach
  • General consensus was to just do bonding+failover

KVM performance: vhost scalability – John Fastabend


  • current situation: one kernel thread per vhost
  • if we create a lot of VMs and a lot of virtio-net devices, perf doesn’t scale
  • not numa aware
  • Main grouse is it doesn’t scale.
  • instead of having a thread of every vhost device, create a vhost thread per cpu
  • add some numa-awareness scheduling – pick best cpu based on load
  • perf graphs
    • for 1 VM, number of instances of netperf increase, per-cpu-vhost doesn’t shine.
    • another tweak: use 2 threads per cpu: perf is better
  • for 4 VMs, results are good for 1-thread. much better than 2-thread. (2 thread does worse than current) With 4VMs per-cpu-vhost was nearly equivalent.
  • on 12 VMs, 1-thread works better, 2-thread works better than baseline. Per cpu-vhosts shine here outperforming baseline and 1-thread/2-thread cases.
  • tried tcp, udp, inter-guest, all netperf tests, etc.
    • this result is pretty standard for all the tests they’ve done.
  • RFC
    • should they continue?
    • strong objections?
  • discussion
    • were you testing with raw qemu or libvirt?
      • as libvirt creates its own cgroups, and that may interfere.
    • pinning vs non-pinning
      • gives similar results
  • no objections!
  • in a cgroup – roundrobin the vhost threads – interesting case to check with pinning as well.
  • transmit and receive interfere with each other – so perf improvement was seen when they pinned transmit side.
  • try this on bare-metal.

Network overlays – Vivek Kashyap

  • want to migrate machines from one place to another in a data center
    • don’t want physical limitations (programming switches, routers, mac addr, etc)
  • idea is to define a set of tunnels which are overlaid on top of networks
    • vms migrate within tunnels, completely isolated from physical networks
  • Scaling at layer 2 is limited by the need to support broadcsat/multicast over the network
  • overlay networks
    • when migrating across domains (subnets), have to re-number IP addresses
      • when migrating need to migrate IP and MAC addresses
      • When migrating across subnets might need to re-number or find another mechanism
    • solution is to have a set of tunnels
    • every end-user can view their domain/tunnel as a single virtual network
      • they only see their own traffic, no one else can see their traffic.
  • standardization is required
    • being worked on at IETF
    • MTU seen as VM is not same as what is on the physical network (because headers added by extra layers)
    • vxlan adds udp headers
    • one option is to have large(er) physical MTU so it takes care of this otherwise there will be fragmentation
      • Proposal
        • If guest does pathMTU discovery let tunnel end point return the ICMP error to reduce the guest’s view of the MTU.
        • Even if the guest has not set the DF (dont fragment) bit return an ICMP error. The guest will handle the ICMP error and update its view of the MTU on the route.
        • having the hypervisor to co-operate so guests do a path MTU discovery and things work fine
          • no guest changes needed, only hypervisor needs small change
  • (discussion) Cannot assume much about guests; guests may not handle ICMP.
  • Some way to avoid flooding
    • extend to support an ‘address resolution module’
    • Stephen Hemminger supported the proposal
  • Fragmentation
    • can’t assume much about guests; they may not like packets getting fragmented if they set DF
    • fragmentation highly likely since new headers are added
      • The above is wrong comment since if DF is set we do pathMTU and the packet wont be fragmented. Also, the fragmentation if done is on the tunnel. The VM’s dont see fragmentation but it is not performant to fragment and reassemble at end points.
      • Instead the proposal is to use PathMTU discovery to make the VM’s send packets that wont need to be fragmented.
  • PXE, etc., can be broken
  • Distributed Overlay Ethernet Network
    • DOVE module for tunneling support
      • use 24-bit VNI
  • patches should be coming to netdev soon enough.
  • possibly using checksum offload infrastructure for tunneling
  • question: layer 2 vs layer 3
    • There is interest in the industry to support overlay solutions for layer 2 and layer 3.

Lightning talks

QEMU disaggregation – Stefano Stabellini


  • dom0 is a privileged VM
  • better model is to split dom0 into multiple service VMs
    • disk domain, n/w domain, everything else
      • no bottleneck, better security, simpler
  • hvm domain needs device model (qemu)
  • wouldn’t it be nice if one qemu does only disk emulation
    • second does network emulation
    • etc.
  • to do this, they moved pci decoder in xen
    • traps on all pci requests
    • hypervisor de-multiplexes to the ‘right’ qemu
  • open issues
    • need flexibility in qemu to start w/o devices
    • modern qemu better
      • but: always uses PCI host bridge, PIIX3, etc.
    • one qemu uses this all, others have functionality, but don’t use it
  • multifunction devices
    • PIIX3
  • internal dependencies
    • one component pulls others
      • vnc pulls keyboard, which pulls usb, etc.
  • it is in demoable state

Xenner — Alex Graf

  • intro
    • guest kernel module that allows a xen pv kernel to run on top of kvm – messages to xenbus go to qemu
  • is anyone else interested in this at all?
    • xen folks last year did show interest for a migration path to get rid of pv code.
    • xen is still interested, but not in short time. – few years.
    • do you guys want to work together and get it rolling?
      • no one commits to anything right now


Syndicated 2013-01-25 11:36:01 (Updated 2013-05-22 10:22:52) from Think. Debate. Innovate. - Amit Shah's blog

10 Jan 2013 (updated 22 May 2013 at 11:16 UTC) »

About Random Numbers and Virtual Machines

Several applications need random numbers for correct and secure operation.  When ssh-server gets installed on a system, public and private key paris are generated.  Random numbers are needed for this operation.  Same with creating a GPG key pair.  Initial TCP sequence numbers are randomized.  Process PIDs are randomized.  Without such randomization, we’d get a predictable set of TCP sequence numbers or PIDs, making it easy for attackers to break into servers or desktops.


On a system without any special hardware, Linux seeds its entropy pool from sources like keyboard and mouse input, disk IO, network IO, and any other sources whose kernel modules indicate they are capable of adding to the kernel’s entropy pool (i.e .the interrupts they receive are from sufficiently non-deterministic sources).  For servers, keyboard and mouse inputs are rare (most don’t even have a keyboard / mouse connected).  This makes getting true random numbers difficult: applications requesting random numbers from /dev/random have to wait for indefinite periods to get the randomness they desire (like creating ssh keys, typically during firstboot.).


For applications that need random numbers instantaneously, but can make do with slightly low-quality random numbers, they have the option of getting their randomness from /dev/urandom, which doesn’t block to serve random numbers — it’s just not guaranteed that the numbers one receives from /dev/urandom truly reflect pure randomness.  Indiscriminate reading of /dev/urandom will reduce the system’s entropy levels, and will starve applications that need true random numbers.  Random numbers in a system are a rare resource, so applications should only fetch them when they are needed, and only read as many bytes as needed.


There are a few random number generator devices that can be plugged into computers.  These can be PCI or USB devices, and are fairly popular add-ons on servers.  The Linux kernel has a hwrng (hardware random number generator) abstraction layer to select an active hwrng device among several that might be present, and ask the device to give random data when the kernel’s entropy pool falls below the low watermark.  The rng-tools package comes with rngd, a daemon, that reads input from hwrngs and feeds them into the kernel’s entropy pool.


Virtual machines are similar to server setups: there is very little going on in a VM’s environment for the guest kernel to source random data.  A server that hosts several VMs may still have a lot of disk and network IO happening as a result of all the VMs it hosts, but a single VM may not be doing much to itself generate enough entropy for its applications.  One solution, therefore, to sourcing random numbers in VMs is to ask the host for a portion of the randomness it has collected, and feed them into the guest’s entropy pool.  A paravirtualized hardware random number generator exists for KVM VMs.  The device is called virtio-rng, and as the name suggests, the device sits on top of the virtio PV framework.  The Linux kernel gained support for virtio-rng devices in kernel 2.6.26 (released in 2008).  The QEMU-side device was added in the recent 1.3 release.


On the host side, the virtio-rng device (by default) reads from the host’s /dev/random and feeds that into the guest.  The source of this data can be modified, of course.  If the host lacks any hwrng, /dev/random is the best source to use.  If the host itself has a hwrng, using input from that device is recommended.


Newer Intel architectures (IvyBridge onwards) have an instruction, RDRAND, that provides random numbers.  This instruction can be directly exposed to guests.  Guests probe for the presence of this instruction (using CPUID) and use it if available.  This doesn’t need any modification to the guest.  However, there’s one drawback to exposing this instruction to guests: live migration.  If not all hosts in a server farm have the same CPU, live-migrating a guest from one host that exposes this instruction to another that doesn’t, will not work.  In this case, virtio-rng in the host can be configured to use RDRAND as its source, and the guest can continue to work as in the previous example.  This is still sub-optimal, as we’ll be passing random numbers to the guest (as in the case of /dev/random), instead of real entropy.  The RDSEED instruction, to be introduced later (Broadwell onwards) will provide entropy that can be safely passed on to a guest via virtio-rng as a source of true random entropy, eliminating the need to have a physical hardware random number generator device.


It looks like QEMU/KVM is the only hypervisor that has the support for exposing a hardware random number generator to guests.  (One could pass through a real hwrng to a guest, but that doesn’t scale and isn’t practical for all situations — e.g. live migration.)  Fedora 19 will have QEMU 1.4, which has the virtio-rng device, and even older guests running on top of F19 will be able to use the device.


For more information on virtio-rng, see the QEMU feature page, and the Fedora feature page. has an excellent article on random numbers, based on H. Peter Anvin’s talk at LinuxCon EU 2012.


Updated 2013 May 22: Added info about RDSEED and the Fedora feature page, corrected few typos.

Syndicated 2013-01-10 20:57:30 (Updated 2013-05-22 10:17:40) from Think. Debate. Innovate. - Amit Shah's blog

79 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!