Fighto time.
Recently a question was posed to me in which I
tended to offer a reasonably off the cuff response for. This
led to an interesting debate about if set<string> was going to be
hugely slower than hash_set<string> for the exact case where
hash_set<> should whip an AVL tree's butt: direct lookups.
So without going into that conversation I decided to benchmark the two
std::collections from both stdc++ and stlport 4.x. This is using gcc
4.0.2 which is shameful as I should have a more recent gcc. I'll likely rereun it on icc and 4.1.x
as well.
The core of the code is to read strings from cin and shove them into a
std::list. During the set<> parts I create a set with the list (which
will have dups) and then iterate the list 50 times looking for each
entry (including dups again) in the built set<> or hash_set<>.
There is of course some cruft there to select the right container from
stdc++ and stlport because hash_set is non standard.
if( use_hash )
{
l_t::iterator e = l.end();
for( l_t::iterator iter = l.begin(); iter != e; ++iter )
hstrset.insert( *iter );
for( int i=0; i<LOOKUPS; ++i )
for( l_t::iterator iter = l.begin(); iter != e; ++iter )
hstrset.find( *iter );
}
else
{
l_t::iterator e = l.end();
for( l_t::iterator iter = l.begin(); iter != e; ++iter )
strset.insert( *iter );
for( int i=0; i<LOOKUPS; ++i )
for( l_t::iterator iter = l.begin(); iter != e; ++iter )
strset.find( *iter );
}
So the benchmarks, all compiled with -O9. Other gcc options don't seem
to make any real effect. I created input from Gutenberg files, l.size
is the number of words read. The hash_set methods are quicker for the
completely degenerate case of only doing direct lookups and doing each
of them at least 50 times per uniq word in the input.
Perhaps the most interesting point is the difference in speed between
stlport and libstdc++ for this. I am now very interested to see how
stlport5.x compares.
# Using stdc++::set<>
foo$ time cat /tmp/largetxt.txt | ./string_xset
l.size:273435
use_hash:0
real 0m16.980s
user 0m16.493s
sys 0m0.028s
# Using stlport::set<>
foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport
l.size:273435
use_hash:0
real 0m10.184s
user 0m9.821s
sys 0m0.084s
# Using stdc++::hash_set<>
foo$ time cat /tmp/largetxt.txt | ./string_xset 1
l.size:273435
use_hash:1
real 0m4.061s
user 0m3.868s
sys 0m0.024s
# Using stlport::hash_set<>
foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport 1
l.size:273435
use_hash:1
real 0m2.430s
user 0m2.328s
sys 0m0.012s