Linux Container Security
First, read these slides. Done? Good.
Hypervisors present a smaller attack surface than containers. This is somewhat mitigated in containers by using seccomp, selinux and restricting capabilities in order to reduce the number of kernel entry points that untrusted code can touch, but even so there is simply a greater quantity of privileged code available to untrusted apps in a container environment when compared to a hypervisor environment.
Does this mean containers provide reduced security? That's an arguable point. In the event of a new kernel vulnerability, container-based deployments merely need to upgrade the kernel on the host and restart all the containers. Full VMs need to upgrade the kernel in each individual image, which takes longer and may be delayed due to the additional disruption. In the event of a flaw in some remotely accessible code running in your image, an attacker's ability to cause further damage may be restricted by the existing seccomp and capabilities configuration in a container. They may be able to escalate to a more privileged user in a full VM.
I'm not really compelled by either of these arguments. Both argue that the security of your container is improved, but in almost all cases exploiting these vulnerabilities would require that an attacker already be able to run arbitrary code in your container. Many container deployments are task-specific rather than running a full system, and in that case your attacker is already able to compromise pretty much everything within the container. The argument's stronger in the Virtual Private Server case, but there you're trading that off against losing some other security features - sure, you're deploying seccomp, but you can't use selinux inside your container, because the policy isn't per-namespace.
So that seems like kind of a wash - there's maybe marginal increases in practical security for certain kinds of deployment, and perhaps marginal decreases for others. We end up coming back to the attack surface, and it seems inevitable that that's always going to be larger in container environments. The question is, does it matter? If the larger attack surface still only results in one more vulnerability per thousand years, you probably don't care. The aim isn't to get containers to the same level of security as hypervisors, it's to get them close enough that the difference doesn't matter.
I don't think we're there yet. Searching the kernel for bugs triggered by Trinity shows plenty of cases where the kernel screws up from unprivileged input. A sufficiently strong seccomp policy plus tight restrictions on the ability of a container to touch /proc, /sys and /dev helps a lot here, but it's not full coverage. The presentation I linked to at the top of this post suggests using the grsec patches - these will tend to mitigate several (but not all) kernel vulnerabilities, but there's tradeoffs in (a) ease of management (having to build your own kernels) and (b) performance (several of the grsec options reduce performance).
But this isn't intended as a complaint. Or, rather, it is, just not about security. I suspect containers can be made sufficiently secure that the attack surface size doesn't matter. But who's going to do that work? As mentioned, modern container deployment tools make use of a number of kernel security features. But there's been something of a dearth of contributions from the companies who sell container-based services. Meaningful work here would include things like:
- Strong auditing and aggressive fuzzing of containers under realistic configurations
- Support for meaningful nesting of Linux Security Modules in namespaces
- Introspection of container state and (more difficult) the host OS itself in order to identify compromises
These aren't easy jobs, but they're important, and I'm hoping that the lack of obvious development in areas like this is merely a symptom of the youth of the technology rather than a lack of meaningful desire to make things better. But until things improve, it's going to be far too easy to write containers off as a "convenient, cheap, secure: choose two" tradeoff. That's not a winning strategy.
 Companies using hypervisors! Audit your qemu setup to ensure that you're not providing more emulated hardware than necessary to your guests. If you're using KVM, ensure that you're using sVirt (either selinux or apparmor backed) in order to restrict qemu's privileges.
 There's apparently some support for loading per-namespace Apparmor policies, but that means that the process is no longer confined by the sVirt policy
 To be fair, last time I ran Trinity under Docker under a VM, it ended up killing my host. Glass houses, etc.