Poisoned cookies

Posted 6 Sep 2000 at 20:27 UTC by mskala Share This

Marketers invade our privacy to compile databases of our interests and buying habits, and waste our bandwidth with advertising. The technologies they use get more baroque daily - from cookies to Web bugs to :CueCat and the Amazon price-changing scheme that made Slashdot today. The usual solution is passive: disable all the "phone home" features on the software we use, filter out the spam, and complain a lot. But what about a more active approach? With cable modems and DSL we've got lots of receive bandwidth to play with. Why not write a fake client to simulate the behaviour of a real human surfer, while leaking false marketing data to anyone who snoops on it? Could such a program poison the databases and make privacy invasion an unprofitable exercise?

I'm not the first to suggest this, of course, but I haven't seen anyone actually build it yet. Perhaps the time has come. For a long time, a lot of us have been entering fake data in marketing surveys and so forth - but when it's a small-scale individual effort, the resulting data points will just be filtered as noise. To really make a difference, I think the fake data has to be a big enough fraction of total traffic to be caught by data mining (perhaps five or six hits on a site per day) and consistent enough to look like a trend. Can we create one or more fake "market segments" and convice the advertisers to target them?

The system I envision would allow one to write a "persona template" describing a fake person. You install the client on your system with one or more persona templates, and it generates some imaginary people based on those templates. Each imaginary person (or "persona") would have his or her own complete set of personal data and Web surfing habits. Then the software would assume the roles of the personae and go surfing - downloading Web pages, accepting cookies, making search engine queries, etc. The marketers would assemble their profiles of the personae and never know which profiles were human beings and which were robots.

We could make it fun to design imaginative templates, and there'd be an incentive to distribute your template because the more people using it, the more likely the marketers would target it. The system would become a sort of a sport: you win if you can convince the marketers to do something stupid. Such an unequal contest might seem cruel to the marketers (the same kinds of objection raised to activities like "sport fishing"), but I don't have much sympathy with marketers and don't object to breaking their "business models" for my own amusement. Apart from the use of making trouble for marketers, this kind of system could be useful in testing intranet servers. If you want a realistic load on your server, it could be a big step up from the usual "download a page a thousand times" script you might hack up in ten minutes.

Legal consequences would be worth thinking about. There should not be anything illegal about downloading Web pages - it's not a denial of service attack, the activities involved are by definition intended to be indistinguishable from normal use of the Web - but once the marketers figured out what was going on, they might well try to sue somebody. I suggest that any such system ought to be explicitly GPLed and the copyrights assigned to the FSF. Public domain anonymous publication might be less hair-raising, but could be seen as an admission by the authors that there was something questionable about the system. A lot of weird issues could conceivably be raised - for instance, if the program claims falsely to be IE or Netscape when it downloads the pages, then the makers of those packages could allege some sort of trademark infringement.

The ethical side also merits discussion - is it okay to pollute marketing databases? My view is an emphatic yes, but it would be interesting to hear the other side. A user could easily subvert their own copy to abuse a "you're paid for hits on this page" scheme. They'd be hard-pressed to get others to help them with that (if I'm gonna abuse such a system I'll do it for my own profit, not yours) but it could still be a problem. I see no way to prevent it and am not sure it needs to be prevented.

So... if we had enough participants, could this work? If it could work given enough participants, could we get that many participants? How much trouble could we get into? Who would volunteer to help write it?

You may want to read a message I wrote to the local Linux club's mailing list about this concept and of course Neal Stephenson's Hack the Spew is mandatory.

Income, posted 6 Sep 2000 at 21:06 UTC by deekayen » (Master)

I think you're leaving one important thing out of the picture. Advertisers are willing to pay as much as they do because ad companies are getting better at assuring them that their ads will make it to the correct audience. If burstnet or imigs can't assure people that their banners will hit the correct audience, then sites that have ads for funding are hurt. Sure a lot of us put up sites and we could probably run the bandwidth off a 56K modem (although we don't), but what if Altavista shut down because advertisers wouldn't pay them enough to keep their equipment going.

Don't get me wrong, I'm not working for an ad company or something, but I think ads in most cases are a nessesary evil. I don't think any webmaster puts them on their site because he thinks their fun to look at. If you hate information collection so bad, I think it might just be easier to put up a cookie crusher program that singles out on ad cookies... and I'm sure there are plenty out there somewhere.

One area of weekness, posted 6 Sep 2000 at 22:04 UTC by stimuli » (Journeyer)

It sounds like a lovely idea. One area of marketing that it cannot subvert is the tracking of actual purchases -- I don't imagine we'd let these fake personas actually buy anything. Now that data, who is willing to spend how much for what, is perhaps the most valuable of all.

ultimately, posted 7 Sep 2000 at 05:06 UTC by darkewolf » (Journeyer)

Ultimately its not that they are doing these demographics that is the problem. The problem is that they are more often than not doing demographics and data profiling but not telling the customer. I am going to lay odd many of these ad companies keep a fairly good track of who looks at what over multiple sites, using a mixture of 'webbugs' and IP address tracking (Yes, proxies stuff this up a fair bit, but, there is the Originating-IP header).

If someone is tracked, thats okay. Iff they are informed about this and given the option not to be tracked. As mentioned this also occurs in software (even on GNU/Linux), for instace MTV (commercial Mpeg player) uses /usr/bin/mail to email useage to the central office, whether you register the software or warez it.

Poisoning the demographics might actually be an advantage, it means rather than the product decisions being made on the norm or the mean, the marketers might actually ask a cross section of people and get a range of options.

Go fer it, its your privacy, your life and ultimately the individual and the community has more power than the corporate.

Re: Poisoned cookies, posted 7 Sep 2000 at 07:48 UTC by jLoki » (Master)

mskala: the Idea of data poisoning is not new, if I got that right it all got big in the late eighties when people would trade their grocery savings cards amongst each other to confuse the living crap out of the - back then - barely networked databases Safeways and co. had up to track customer behavior.

Nowadays the web is data mining source number one (but just barely outnumbering the Safeway Savings Card and Mail-In rebates) and it's not all Cookies we're dealing with, here. Anyone with a bit of a brain and a powerful database (Oracle will do for now) is able to serve banners over his own server and analyze IP-Adresses and Clicktroughs based on a simple clf-Logfile.

While cookies play a bigger and bigger role in this game, it's still the good old methods that are used to cross-verify cookie generated impressions. By poisoning cookies you'll introduce worthless data into the databases, inevitably generating glitches towards a false impression but todays mining tools are smart enough not to let these impressions screw up the whole analysis.

Let's dive one step deeper into this scenario and take a look at, say freshmeat. On a first glimpse, Andover's "take over" seems like the friendly gesture from your neighbor offering you money and a save place to live because he likes what you do. Look at it from a different angle and all of a sudden there's Linux vendors who'd kill to get Freshmeats data in a processed form: which appliocations are most wanted, which ones are uploaded when, etc. - suddenly one realizes that the ability to datamine has made sites the size of Freshmeat or Slashdot possible, not just pure friendlyness (there is no friendlyness in Pre-IPO-Land).

Back to the topic. To poison data you need to do more than just fake surfing behavior, like shopping irregularly and complete anti pattern or simply giving wrong answers on surveys. Eben 200 bots cannot introduce any significant poisoning into the databases considering the amount of "dumb" people out there that will give them right data.

I, personally, handle this issue like Safeways Savings Card - I do not contribute. A nice little Squid does its share of the work in keeping me out of the records and basically I don't mind others conbtributing or even making money of these mined data.

Cookies are not all bad, I certainly don't want to expose the legitimate users of them (think advogato, Sourceforge, THATware or even foundation) to some bot running amok.

A more proportionate response is needed, posted 7 Sep 2000 at 12:34 UTC by PaulJohnson » (Journeyer)

I'm not sure that database poisoning is either useful or necessary. I take steps to guard my own privacy on-line, but I don't necessarily see data-mining activities of this sort as being inimical to my interests.

There are two main costs to the consumer of this kind of activity:

  • Invasion of privacy. This is hard to quantify, but there is no doubt that people dislike the idea of having large chunks of their "private" lives available to the inspection of unknown and unaccountable individuals.
  • Spam handling. Receiving an advert has a cost in bandwidth (AtGuard currently tells me it has blocked some 15Mb of banner adverts since my last reboot) and in mindwidth. I need to decide what to do with it, even if only to delete it or otherwise ignore it. Also many banner adverts are animated in order to grab attention. This is seriously horrible (and AtGuard nicely rewrites them to run once, BTW).

(Aside: I gather AtGuard is now integrated in Norton Internet Tools 2000 or some such name).

Advertisers are in a position to impose these costs on us without much consideration as to whether we want to pay them. What we need is a way to return these costs to sender in order to punish the baddies and reward the goodies. Blanket punishment of everyone doesn't improve behaviour because it just becomes a part of the cost of doing business.

Bear in mind that the point of such activities is to offer me goods that I want to buy. An advert which merely annoys me has done the advertiser no good at all. The problem at the moment is that this process is so hit and miss that the cost of receiving and comprehending an advert outweighs the probable benefit of finding something I want. If the accuracy of advertising can rise to the point that I find the adverts useful and informative then it becomes a net benefit to me instead of a net cost. If, for example, Amazon deduce from my purchases that I am a Terry Pratchett fan then they can email me with news of the next TP book, and I would be interested to see that. OTOH if they sell the list to someone selling TP merchandise then I won't be interested because I don't generally buy fan merchandise.


And about surveys, posted 7 Sep 2000 at 14:09 UTC by stimuli » (Journeyer)

And concerning surveys, to subvert them is possible with some concerted effort. Keep in mind, when someone puts a form on their page asking a series of questions about background and salary and whatnot, I'm sure they are hoping to find that the more affluent folks are using their site. They'd much rather be able to sell ad space to big money car makers than to phone psychics.

Randomly filling in the forms is not enough, such "noise" will just disappear in the shuffle. We should all fill in the forms as if we were underemployed, young, ethnic, welfare moms. Now, if enough people did that, we'd actually put the aggregate in a downward direction, at least with regard to perceived wealth. This would act, I think, to subvert the goals of the survey more than anything else.

How would poisoned data help individual privacy?, posted 7 Sep 2000 at 16:05 UTC by slothrop » (Apprentice)

Creating "fake" clickthrough and adview data by sending web-user bots stumbling all over your least favorite online businesses does not seem beneficial to the privacy of the individual who's dispatched the bots. Say a site's marketing data has been horribly munged by a legion of well-meaning anti-demographics generators. Now everyone visiting the site gets totally random, useless, annoying ads, the advertisers drop their business with the site because they are getting no clickthroughs, and traffic to the site dries up as people are driven away by pointless ads and the site's increasingly bland services, decaying from lack of ad revenue. Even if you block ads already, you still get pissed when the site runs down, and stop using it (why do you care about a site's marketing data if you don't use the site?).

Legitimate sites, used by many people (even people who willingly submit to a site's privacy policy and have read it in full) use banner ads and cookies to provide real, convenient services and (occasionally) well-targeted, useful advertising. It's dangerous to assume that what you reasonably consider to be an invasion of privacy must obviously be equally onerous to the other users of a site. Corrupting the marketing data of a company, no matter how much you dislike their policies or site, hurts the collective user population.

There is plenty of software out there to block ads and squish cookies, and ISPs have massively improved their ability to keep spam from hitting your inbox. Anyone who cares enough about privacy to want to implement punitive measures against marketroids.com (sorry if this is an actual site :o) has probably already switched to a cleaner ISP and filters everything. So what's the point of making the spam and ads you barely receive now any more randomly-targeted and useless? There are countless more effective ways of improving your privacy than engaging in a quixotic battle with terabytes of data-warehoused marketing information.

Some Collections Surprise me..., posted 7 Sep 2000 at 16:54 UTC by cbbrowne » (Master)

Notably, it surprises me that grocery stores feel the need to use a card to get an "ID token" about the gentle purchaser.

After all, many of the purchasers use credit cards or ATM cards, which contain, on their little magnetic strips, the individual's name.

Given the name, and/or other ID info on the credit card, they should be able to do all the correlations that they may feel the need to do without collecting any information beyond what they truly need to have in order to complete the transaction. For those that buy using cheques, there's even more personal information that the grocery store must collect to give out a "cheque cashing card."

The point here is that there are some pretty potent ways that significant demographic and psychometric data forcibly gets through when it comes to real transactions, and since those transactions led to people deciding to SPEND MONEY , they are the activities involving psychometric data that you can really bank on.

You can create a web bot that goes off and throws garbage at the "demographers," but all this really does is to demonstrate that transactions that don't lead to spending money are vastly less "honest" than those that do.

That being said, I think that it's a neat idea to create such a web bot, and possibly of some small utility. If the "bot" were sufficiently simple to install, configure, and use, some of the "demographic sector" that were willing to contribute CPU to things like the distributed.net contest would be willing to contribute a bit of network bandwidth to the NukeDemographics.net Project. And while this wouldn't do much to influence the psychometric analysis at the point of sale, it could indeed go somewhere in discouraging companies from working real hard to collect the largely worthless psychometric statistics that come from "just browsing the web."

Some responses, posted 7 Sep 2000 at 18:49 UTC by mskala » (Journeyer)

deekayen says: if we put the advertisers out of business, then people who depend on advertising revenue will be harmed. True, but isn't that the point? Yes, there are Web sites that I like that currently depend upon advertising revenue. But I hate marketing. I'm willing to risk endangering the Web sites I like that do have ads, if it could mean eliminating ads. I believe the sites that deserve to survive would be able to survive on other income sources if advertising weren't available. I also think you're overestimating the possible success rate of my plan - which is flattering, but not my own view. I don't think this could really, by itself, shut down the entire concept of Internet marketing. I'd be very glad if it could, but the expected result is just to make it a little trickier.

stimuli points out that I'm unlikely to be willing to let my robots buy things I don't want, just to mess with the seller's minds. True, but there are plenty of things I can do that don't cost. The commercial interests have very carefully structured the Web to make it easy and cheap for me to view their advertising. Thus, I can view a lot of advertising without incurring significant costs to myself.

jLoki appears to be taking my article title too literally. I don't mean that HTTP cookies should be the only or even the primary target, I'm suggesting that the robots should do everything they can to leak their fake personal information. Poisoned cookies is a nice catchy name, but I agree that Web page downloading, form filling, and similar activities are probably more useful as specific tactics. As far as the scale necessary to have an effect, I don't know enough about data mining to give specific quantitative answers on that. I'm not sure even the commercial data miners know - I believe that just like in all other fields of comp sci, commercial practice is at least 20 years behind what the academics are doing. It calls for research. But it's not necessary to truly render the entire database useless just to have a good effect. To carry the poison analogy further, maybe we can't really poison the database but we can make it taste bad. 200 bots will have an effect on the marketers' behaviour if they really look like a new market segment.

PaulJohnson and slothrop question whether marketing data collection is really so bad after all. I submit that yes, it is really that bad. Saying "The sites I like collect my data to serve me better, and I don't care what data the sites I don't visit might collect" is a dangerously short-sighted view because it assumes that the data is only ever going to be used by the site that collects it, and only ever used in the way it was meant to be used at the time it was collected. Data is a superfluid: it flows with zero viscosity and gets everywhere. It also lies dormant on backup tapes and then surfaces unexpectedly at any time in the future. The world changes. The site that collected your data for an innocent reason today may go bankrupt on account of idiots like me spoiling the game for everyone, and then it may sell its database to someone else. The site you visit might somehow become associated with illegal activities and its list fall into the hands of the police. I expand on "reasons you don't want your surfing behaviour recorded" at considerable length in a recent post to the VLUG mailing list, which you might want to read.

On the topic of "what is the payoff for consumers to interfere with marketing databases"? I can only speak for myself, but: when I first started using the Net however many years ago, it was really cool. Now it's a lot less cool and a lot more scary. I believe the biggest single cause for the change in the Net is the ease with which commercial interests can use the Net to make money. If I can make it just a little bit more difficult for commercial interests to make money on the Net, that's a positive result for me. I think some other users of the Net agree with my feelings. If you aren't one of them, feel free to go write robot-detection software for the data mining companies. I'm sure they'll pay you a lot of money for it. There is also the privacy motivation, of course, but as jLoki and others point out, individuals who just care about their own privacy can defend it adequately with passive measures.

The Scale of Things., posted 8 Sep 2000 at 20:54 UTC by Uruk » (Apprentice)

Poisoning databases sounds like a very good idea to me, but I don't really think that it would work merely for the scale of things. There are millions of web surfers all over the world, surfing every day. If we really wanted to make the data in those databases useless, everybody that reads slashdot on a daily basis would probably have to run 100 or more of those "poison" clients. Companies expect when collecting huge amounts of data to have a certain percentage wrong, misleading, incomplete, or otherwise no good. You can make it unprofitable for them if that percentage gets high enough, but they have a VERY high tolerance. I have worked for companies that do direct mailings, and it is considered a VERY good run if you get a 3% response rate. Similary with telemarketing, the percentage of people who weren't interested in your product would have to be well over 90% to make telemarketing unprofitable.

So basically, in order to poison those databases to the extent necessary to make it unprofitable, you would need to use so many poison clients and so many connections and so much bandwidth that you'd probably just bring the internet to a grinding halt. (If you're targeting one specific corporation though, this probably isn't true. if you're just looking to screw doubleclick, that might be achievable if you limit your goal to something very specific).

Also, how long do you think it would be before the marketers built in code to look for patterns common among poisoning software and toss out that data? Which leads to changes in the poisoning software, which leads to changes in their database apps...it's basically a battle of numbers of people working on a project, and I'm not convinced that 1,000 people who hate marketers working 1 hour a week to foil them could even come close to 10 guys in the marketer's IT dept working full time to keep the data clean.

The laws of the land and the technologies that were using just have problems. What I do is vote with my dollars. When it really comes down to it, that's the only thing capitalists understand. If you're serious about this, you can boycott companies associated with marketers who violate privacy.

If you want DoubleClick cookies..., posted 9 Sep 2000 at 22:10 UTC by dmarti » (Master)

ddccss, the Distributed Doubleclick Cookie Snarfing System, has collected 1,211,222 unique doubleclick.net cookies. The're available for download.

Where's the money, then?, posted 10 Sep 2000 at 11:46 UTC by argent » (Master)

Mskala writes, "I believe the sites that deserve to survive would be able to survive on other income sources if advertising weren't available."

$200+ a month for a colo box to provide anything like adequate bandwidh for a reasonably popular site.

For a site that's not actually selling anything directly, what income sources were you thinking of?

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page