One of the most annoying things about running traffic statistics over a proxy log is the use of "magic hostnames". There's a few examples of this worth mentioning:
- blogspot/wordpress - one hostname per blog site
- pheedo.com - using [md5].img.pheedo.com for an image URL
- youtube/googlevideo/limewire/bitgravity/etc - CDNs which have many, many content servers which requests are directed to via some method
Caching issues aside, this makes it difficult to do straight "per-host" statistics as one can have an entire site "hidden" underneath many, many hostnames which individually serve a small amount of traffic.
Anyway. The naive way of working around this is to craft rules to aggregate specific domains as needed. I've done this in my stats software so I can generate daily statistic sets that are manageable. This gives me live data to play with. :)
Another way is to simply figure out the top level / second level domains and aggregate them at that level. So, you'd aggregate *.[domain].com ; *.[domain].net ; but not *.[domain].au. For .au you would aggregate *.[domain].com.au and so on. This should work fine (as the country domain name structure is reasonably static these days) but it does mean you end up hiding a lot of potential information from the domain name. For example, a CDN may have [server].[client].domain.com - seeing the specific per-client traffic statistics may help identify which websites are causing the traffic, versus just seeing *.domain.com.
Out of curiousity, I decided to come up with some way of figuring out where these domain names branch out into multiple sites. Partially so I can generate some aggregation rules, but partially so I can later on begin accounting traffic to both individual hosts and all parts of the domain name.
Anyway. This was easy to solve in a very, very naive way. I simply built a tree in memory based on the reversed hostname (so foo.domain.com -> moc.niamod.oof) so I could easily identify where in the name the branching occurs. I then just created one vertex per character.
Every time I added a hostname to the tree I incremented a reference count for all vertex nodes in the path.
Finally, to figure out which prefixes were the most prevalent, I simply depth-first searched the tree via recursion, looking for nodes that met a certain criteria - specifically, the node character was "." and the refcount was above some arbitrary hard coded limit (8). I then sorted the resulting list based on refcount.
The result:
844611: .com 82937: .net 51478: .org 36302: .blogspot.com 18525: .uk 17246: .wordpress.com 16527: .co.uk 15297: .info 15237: .ru 14790: .ningim.com 12359: .pheedo.com 12355: .img.pheedo.com 12328: .edu 9992: .files.wordpress.com 9980: .live.com 9578: .de 8430: .us 7171: .deviantart.com 6484: .photofile.ru 6481: .users.photofile.ru 6112: .profile.live.com 5197: .yuku.com 5044: .stats.esomniture.com 5044: .esomniture.com 4960: .avatar.yuku.com 4817: .bigtube.com 4659: .ss.bigtube.com 4541: .llnwd.net 4246: .au 4161: .vo.llnwd.net
.. so, as expected really. This host list doesn't include all of the seen hosts over a month of proxy access as I had done some pre-aggregation so the host list database was managable.
Now, the hacky part.
This wasn't a string B*Tree - the vertex children were -not- sorted. This means searches aren't as efficient as they could have been but for the data set in question (11 million hosts) the algorithm ran in O(perfectly sensible) time and didn't degrade noticably as the data set increased. Adding the extra code to do proper insertion and lookup optimisation would make it faster sure but I wanted to see if I had to care or not. No, I didn't have to care.
It was written in C. Yes, with no pre-defined tree datatypes. This is why I wanted to do the minimum effort required. :)
I initially had domain name fragments for the vertex nodes (ie, foo.domain.com became "foo"->"domain"->"com") but due to the distribution of the strings (ie, a -lot- under .com), the search speed (and thus insert speed) was very bad. I knew the worst case performance of the unsorted-node B*tree would be bad (ie, O(tree depth * number of entries per node)) and I absolutely, positively hit it with .com .
Finally, I needed some way of printing out the hostname given a particular vertex node. A typical method of doing this via recursion is to pass in a "prefix string" into the recursive function which gets modified as nodes are visited. I then realised I could simply traverse the tree backwards to assemble the string when needed. I didn't need to try and manage a variable-length string buffer; or artificially limit how long the string could be (and track that.)
In summary, this is all basic first year computer science data structures clue. It was an amusing way to spend an hour. It will be much more fun to merge this with the statistics code to provide domain-based aggregate statistics...