Recent blog entries for monkeyiq

libferris and ego meet google earth. The new release 1.1.96 will allow nice integration with ge letting you do desktop searches from within google earth aswell as show where a file is located by clicking on it in ego.

See the top of this page for a screen animation.

Desktop search?

I noticed on planet KDE this post on desktop search. It mentioned not using xattr for metadata because some filesystems don't support it. I'd say that most filesystems don't, iso filesystems, NFS (depending on setup), http, ftp and xemacs don't. The simple solution to all this in libferris is to virtualise xattr just like the filesytem itself is virtual. So you store xattr in RDF when the underlying filesystem doesn't allow it.

I should also highlight that the tagging mentioned in the post referenced by the above post is already available and usable with libferris :) You can attach arbitrary metadata to virtual filesystem objects, index them and search based on that metadata. Indexing can be done in many formats, lucene, postgresql, RDF using redland (db4, sqlite, postgresql) or on an LDAP server.

For indexing see here. For EA stuff here.

FUSE.sf.net meets libferris

Yes folks, its true, you can now mount libferris through the kernel using fuse. The goodness of your xemacs session becoming a kernel filesystem, mounting firefox through ferris and fuse... mmm, filesystems ease the pain.

Still trying to get some more advanced article about libferris usage out there. Things are starting to get rather interesting now because of the stacked filesystems in libferris and ultimate exposure through fuse lets you do some rather funky things with data that comes from (and returns to) many and varied places.

Fighto time.

Recently a question was posed to me in which I tended to offer a reasonably off the cuff response for. This led to an interesting debate about if set<string> was going to be hugely slower than hash_set<string> for the exact case where hash_set<> should whip an AVL tree's butt: direct lookups.

So without going into that conversation I decided to benchmark the two std::collections from both stdc++ and stlport 4.x. This is using gcc 4.0.2 which is shameful as I should have a more recent gcc. I'll likely rereun it on icc and 4.1.x as well.

The core of the code is to read strings from cin and shove them into a std::list. During the set<> parts I create a set with the list (which will have dups) and then iterate the list 50 times looking for each entry (including dups again) in the built set<> or hash_set<>.

There is of course some cruft there to select the right container from stdc++ and stlport because hash_set is non standard.

    if( use_hash )
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            hstrset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                hstrset.find( *iter );
    }
    else
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            strset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                strset.find( *iter );
    }

So the benchmarks, all compiled with -O9. Other gcc options don't seem to make any real effect. I created input from Gutenberg files, l.size is the number of words read. The hash_set methods are quicker for the completely degenerate case of only doing direct lookups and doing each of them at least 50 times per uniq word in the input.

Perhaps the most interesting point is the difference in speed between stlport and libstdc++ for this. I am now very interested to see how stlport5.x compares.


# Using stdc++::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset l.size:273435 use_hash:0

real 0m16.980s user 0m16.493s sys 0m0.028s

# Using stlport::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport l.size:273435 use_hash:0

real 0m10.184s user 0m9.821s sys 0m0.084s

# Using stdc++::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset 1 l.size:273435 use_hash:1

real 0m4.061s user 0m3.868s sys 0m0.024s

# Using stlport::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport 1 l.size:273435 use_hash:1

real 0m2.430s user 0m2.328s sys 0m0.012s

Moving back to blogging here for a while... Seems using university equipment for this is not so much an optimal thing. Loss of control over the network environment etc.

Moving my blog over to UQ because its much easier to update and I can play with the blosxom aswell :)

Read my article at xml.com on Using libferris with XML.

Enjoy.

Novell's VFS

At Edd's blog there is mention of Novell open sourcing their VFS and the possibility to integrate it into GNOME. Very interesting, I should add support for mounting Simias to libferris hopefully sometime shortly after they release it. I suspect it might be somewhat hard to install Simias at first though (maybe they want to beat libferris for difficulty to install ;)

Does make me wonder a little about the whole 'sharing VFS' thing that comes up from time to time (search freedesktop.org's mailing lists). As to, at the level of an Extended Attribute extractor (like something that extracts EXIF data from exif/jfif images) whether there can be code sharing. I think this is the most realistic level where the VFS systems can interoperate as there is an EA interface and a binary chunk that data is extraced from. Though even at this 'grass roots' level libferris differs from C based VFS systems in that data is accessed through the C++ IOStreams interface... though bolting a obj->read( offset, count ) to that should be trivial.

On the flipside, I've been adding support for Adobe's XMP metadata embedding to libferris. This should make for a reasonably kickass /usr/local/doc search engine with the new EA from pdf files and the fulltext indexing of HTML and PDF etc that ferris already does. I think I've managed to sniff out most of the embedded RDF standards now, after XMP there is the jpeg COMMENT1 block RDF standard that I've yet to code support for.

I might have to make mod_ferris for apache soon so that a web interface to the EA/indexes can be created... maybe.

Dude, where's my filesystem

Trying to use evolution-data-server in little clients, not going all that gleefully as yet... must mount evolution... must mount ...

Have now returned from icFCA04, the international conference on Formal Concept Analysis which was in Sydney this year. Interesting mix of computer sci folks and raw maths guys there. I suspect that next year I'll make my talk involve much more pictures and math descriptions of what I am thinking rather than near on code level talks that I tend to be attracted to.

Went along to the SLUG while I was there too. Some very interesting chats seem to always be waiting at that user group... Chatted with jdub for a while over dinner about gnome and explained some of the crack that I've sprinkled over libferris and showed some lattices from FCA that I plan to make into the future of n-dimensional filesystem browsing (though not in those terms ;). From what I've herd there are some very interesting plans for the future of gnome-vfs... Be interesting to chat with some of the dudes planning to extend g-vfs as I think that the open source community needs to be moving fairly quickly to create an opposing system to WinFS. Parts of WinFS seem really cool and we definately want that coolness too (though hopefully we can improve some parts of WinFS that are not so cool).

65 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!