When the Data You Scraped Might Be the Data They Wanted You to See • Data Science Festival

Most organisations that rely on web-scraped data treat it like any other data source: ingest it, clean it, build dashboards on top of it. The implicit assumption is that the target site served the same content to your bot as it would to a human. That assumption is often wrong.

In a previous role I spent years on the other side of this problem – building cloud-based defences designed to detect, throttle, and mislead automated traffic.
Bot management platforms don’t just block scrapers; they can slow scrapers or return plausible-but-wrong data. The goal isn’t always to stop collection. Sometimes it’s to make you confident in bad intelligence.

In my role at On The Beach, I lead on Price Intelligence. We process millions of competitor prices daily through different sources.

When routine checks revealed systematic discrepancies between bot-collected and human-observed prices, I had to ask a question I already knew the technical answer to: could the target be feeding us what they wanted us to see?

This talk is an experience report from the data trenches. I’ll cover what the defensive toolkit actually looks like from the inside – fingerprinting, session-based content variation, selective degradation – and then walk through the practical challenge of how we validated scraped data.

We’ll look at how to design lightweight experiments to distinguish legitimate volatility from adversarial manipulation, how to bound what’s plausible using domain constraints, and why the answer was more nuanced than either “yes, they’re poisoning us”” or “”no, we’re fine.”

Whether you’re consuming scraped data, building defences against scraping, or simply making decisions from intelligence you didn’t gather yourself – the underlying question is the same: how do you validate data that someone else had every incentive to corrupt?

Technical Level of Session: Technical practitioner

Supported by