Followup: Automatic disk management - Down With (Un)Mount!

Posted 6 Oct 2004 at 18:28 UTC by lkcl Share This

This article is a followup to the "Desktop Linux: Choices for simplicity" article: a technical description of the components necessary to being able to pull USB drives out with impunity. The components are awkward, to say the least, and integrated across-the-board, and are necessary due to a bug in the 2.6 kernel with "umount -l" not working properly.

The goal is to make it possible for ordinary users to rip USB cables out, remove USB media cards, press the button that ejects the CD and generally do what THEY want not what YOU, the software writer, dictates that they SHOULD do.

User convenience is the priority - not technical superiority of speed or performance. With that in mind, I feel quite comfortable in justifying, rather horrifyingly, the installation of the "fuse" example program (fusexmp) on a production system, as part of the components.

... but why??? why install a userspace filesystem program, and why is it necessary? I'll answer that in a minute, after first describing the components:

  • fusexmp - File system in userspace "example" program.
  • HAL - Hardware Abstraction Layer (modified), with a modified version of fstab-sync which writes AutoFS config files
  • AutoFS - with a modification in the kernel header file reducing the "negative timeout" from 60 to 4 seconds (very important!)
  • KVM - KDE Volume manager (from kdenonbeta cvs), a program that reacts to D-Bus commands from HAL, and responds to changes in media.
  • SE/Linux kernel although the SE/Linux bit is irrelevant here, it meant some extensions to fuse to get it to work: it's mentioned here for completeness.

The implications of this set of... awfulness - a stack of non-standard modifications across the board - is startling to contemplate: WHY, WHY do this???

well, it began with a quite serious kernel bug, the symptoms of which is that by ripping out a USB drive or a USB card without doing unmount, that USB drive or USB card can never be seen again unless you shut down all applications, insert it and remove it twice, and even then you're not guaranteed to see it, and may have to do a shutdown of your machine.

... and yes, i do have to emphasise, this IS linux we're talking about.

the point is that umount -l of the USB card by the HAL daemon, just after it gets a hotplug event that detects that the card has gone, doesn't work as expected: an ioctl on the scsi device which tells it to re-read the partition table reacts with "Device or resource busy".

Until you close all applications that used to be using that drive, and the directory handles are all closed, the scsi device cannot be released, hotplug events are not generated, and HAL cannot respond to non-existent hotplug events to remove the kernel module.

using umount -lf actually makes the problem worse, not better.

So, rather than fix _that_ bug, I introduced the possibility of a whole boat-load more.

So, why fuse? Well, fusexmp provides a proxy view of your filesystem - the entire filesystem. It's the equivalent of running NFS onto your own filesystem over, of running samba and smbfs - you get the picture.

Importantly, fusexmp provides a number of key benefits:

  • stateless file access (no file or directory handles left open)
  • independent inode numbering irrespective of the underlying fs
  • it doesn't give a monkey's about what it's proxying

The first of these is crucial to being able to pull USB drives and other media out without warning or unmount. For a file open, a stat is performed. For a file read, an open, a pread, and a close are performed. for a directory open, an opendir, a readdir of the entire directory is performed, and a closedir are performed. For a directory read, the cached results at open time are given. So it goes on: at no time are any handles kept open for significant periods of time.

The second is crucial for being able to reinsert media: namely that fusexmp receives the full path name of anything it is opening or accessing, creates its own internal inode numbering, and that is what KDE's "directory notification" and presumably FAM as well, rely on.

Some people would view being able to remove media and insert it again as a distinct disadvantage (I don't) in particular where different media could be inserted, and a file save operation performed by a running application on a completely different disk: that's entirely up to the user to deal with that. A technical solution is to have the mount point named after the volume serial number or volume label name in the case of a DVD - this _can_ be done by placing an appropriate label-aware fstab program in /etc/hal/devices.d/.

Anyway - it works, that's the main thing. There are a few things that could be done better, and there are things that might not need to be done (using autofs for example).

What could be done better? Well, implementing fusexmp as a kernel module, for a start - as a kernel module called proxyfs. I've made a start on this, but the fact that fuse implements an inode cache in userspace (!) has me a bit stumped: I am presently examining smbfs and the kernel module it is based on (ncpfs) as an alternative "starting" point. The important thing about ncpfs and smbfs is that the filesystems they access don't support inodes, so both these kernel modules need to "invent" inode numbers, in exactly the same way that fuse does [but in userspace].

Implementing proxyfs is quite straightforward: all it consists of doing is remapping the VFS calls to sys_XXXX calls! for example vfs_proxyfs_rename() consists of calling sys_rename, but first allocating some userspace memory, determining the full path name (using d_path), prepending the proxy mount point to that path name, copying that path name (which will have been created in kernel memory) into the userspace memory and then calling sys_rename. The messy bit is ensuring that the dcache entries for inodes remain up-to-date with unique inode numbers (which is where cut/paste of code from ncpfs comes in handy).

Even if the bug in 2.6 is fixed, I still don't believe that using fusexmp (or proxyfs) will be superseded: if you remove a filesystem out from under a user program such as konqueror, by using umount -l, how do you get it back???

Only by following the age-old computing adage "Got a problem? Add another layer of indirection" can the required decoupling be achieved, and i doubt very much whether _any_ linux kernel, let alone 2.6, is ever going to have "another layer of indirection" integrated seamlessly behind the scenes.

If anyone has ever successfully achieved the same goals as above (with NFS over localhost, with samba plus smbfs or cifsfs, or other) I would love to hear about it.

Proxy File Systems, posted 7 Oct 2004 at 12:43 UTC by sarum » (Journeyer)

IIRC there used to be an encrypted file system that produced its interface as local loop back NFS share, but I never got it to work so I can't tell you more.

Didn't Plan 9 have each process having its own user level version of the file system, ie every process including the user went though a view level which could be remapped. This was to totally hide the location of file systems from the user. When you moved from machine to machine the user saw the same view of there files. May be if Linux (etal) had this sort proxy fs built for every process then the problem could be fixed simply?

Regards Dave

user convenience, technical superiority, posted 8 Oct 2004 at 12:20 UTC by yeupou » (Master)

User convenience is the priority - not technical superiority of speed or performance. With that in mind, I feel quite comfortable in justifying, rather horrifyingly, the installation of the "fuse" example program (fusexmp) on a production system, as part of the components.

Ideally, it should be configurable... What you say is right about the average user. But not all users (by users, I do not only mean individuals).

supermount-ng, posted 8 Oct 2004 at 14:36 UTC by lkcl » (Master)

apparently, supermount is supposed to solve this very same problem: it's a little more integrated into the linux kernel, though. looking at the source code, every access (read, opendir etc.) a check is made to find out if the mount point is still present.

fuse+fusexmp vs supermount, posted 8 Oct 2004 at 19:41 UTC by lkcl » (Master)

the differences are subtle and interesting:

- supermount mounts and takes care of access to a "sub"-filesystem, such that you can simply prepend "none supermount ..." and a few options to your fstab entries and expect it to work without changing anything else (esp. where your home directory is!)

- fusexmp gives a "second" view onto an existing filesystem, starting from "/" including, rather dangerously, itself [but not in the modified version i'm using, which proxies the user's home directory to /Documents].

fusexmp therefore potentially accesses multiple mountpoints whereas supermount manages only one [per supermount mount point, if that makes any sense].

- supermount matches every VFS call with an access to the inode functions of the underlying "sub"-filesystem it is managing... but first it does a check to see if the "sub"-filesystem is still mounted.

- fusexmp's VFS read function, by contrast, does an open, read and close: so is write, and so is readdir.

the difference is significant: i'm not yet certain as to how supermount expires (validates) its inodes properly, whereas fusexmp doesn't have to - if appropriate it does a getattr (a stat), and revalidates the inode as appropriate.

- supermount relies on the inodes of the underlying "sub" filesystem.

- fusexmp relies on the pathnames.

the difference here is that with fuse, you can rip out one disk and replace it with another (with a different directory structure) and then rip it out again, and put the original one back... and a file save will work!

i really couldn't tell you if the same thing would work under supermount!

Down to earth, posted 12 Oct 2004 at 10:50 UTC by yeupou » (Master)

Apart from that, I never understood why umount is not called unmount. Someone have an reasonable explanation for that?

umount is 6 letters, posted 12 Oct 2004 at 22:09 UTC by splork » (Master)

ancient unix and C (and fortran, probably others) had a limit of 6 characters for identifiers in some place or another. Look at all functions in the C standard library. find any >6 letters?

abandoning fuse, investigating lufs, posted 14 Oct 2004 at 10:14 UTC by lkcl » (Master)

an attempt to combine fuse plus its example fusexmp into a kernel module proved difficult because the inode allocation for the kernel is done in userspace (!)

lufs, another userspace filesystem, does the opposite: it allocates inodes in kernel... but like fuse, it does have a userspace cache of filename/directoryname entries: unlike fuse, however, it doesn't have the inode numbers associated with them.

consequently, fuse is able to pass back the userspace-allocated-inode-numbers to the userspace program, for _userspace_ to perform a lookup of that inode into a full path name, whilst lufs reconstructs the full path name in kernelspace and passes _that_ across to the userspace daemon.

anyway - supermount makes a quite dramatic performance hit (lots more disk access for evvveerryything) but the fuse+fusexmp approach only has a performance hit for the mountpoint that the user is accessing media through.

Um..., posted 15 Oct 2004 at 16:53 UTC by pphaneuf » (Journeyer)

On the SuSE 9.1 that I have on my box, they seem to use submount in order to support this perfectly well. I just plop in a CompactFlash in my USB reader, go in the directory and play with it, then pluck it out.

yes, um :), posted 17 Oct 2004 at 19:11 UTC by lkcl » (Master)

do you have:

- a 2.6 kernel - kde 3.3.0

if so, have you tried running the little c program attached to the bugreport above, on a directory of your usb card? (it does an opendir() and then sleeps for an hour)

thanks!, posted 17 Oct 2004 at 19:15 UTC by lkcl » (Master)

hm, submount is not a lot of code: i wonder what's so special about it...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page