Advogato: Blog for monkeyiq

Fighto time.

Recently a question was posed to me in which I tended to offer a reasonably off the cuff response for. This led to an interesting debate about if set<string> was going to be hugely slower than hash_set<string> for the exact case where hash_set<> should whip an AVL tree's butt: direct lookups.

So without going into that conversation I decided to benchmark the two std::collections from both stdc++ and stlport 4.x. This is using gcc 4.0.2 which is shameful as I should have a more recent gcc. I'll likely rereun it on icc and 4.1.x as well.

The core of the code is to read strings from cin and shove them into a std::list. During the set<> parts I create a set with the list (which will have dups) and then iterate the list 50 times looking for each entry (including dups again) in the built set<> or hash_set<>.

There is of course some cruft there to select the right container from stdc++ and stlport because hash_set is non standard.

    if( use_hash )
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            hstrset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                hstrset.find( *iter );
    }
    else
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            strset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                strset.find( *iter );
    }

So the benchmarks, all compiled with -O9. Other gcc options don't seem to make any real effect. I created input from Gutenberg files, l.size is the number of words read. The hash_set methods are quicker for the completely degenerate case of only doing direct lookups and doing each of them at least 50 times per uniq word in the input.

Perhaps the most interesting point is the difference in speed between stlport and libstdc++ for this. I am now very interested to see how stlport5.x compares.

# Using stdc++::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset l.size:273435 use_hash:0

real 0m16.980s user 0m16.493s sys 0m0.028s

# Using stlport::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport l.size:273435 use_hash:0

real 0m10.184s user 0m9.821s sys 0m0.084s

# Using stdc++::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset 1 l.size:273435 use_hash:1

real 0m4.061s user 0m3.868s sys 0m0.024s

# Using stlport::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport 1 l.size:273435 use_hash:1

real 0m2.430s user 0m2.328s sys 0m0.012s

18 Mar 2006 monkeyiq » (Journeyer)