Brightspot’s digital content management platform serves millions of requests per day for some of the world’s best-known media and corporate brands. In recent months, we’ve observed a sharp rise in non-human web traffic hitting our customers’ sites, and a significant portion of this surge comes from automated “bots” — in particular, content-scraping programs likely harvesting data to train large language models (LLMs) and other AI systems.
With bots now accounting for nearly half of all web traffic (per our own internal survey data and Imperva’s Bad Bot Report), and LLM-related scrapers growing in volume and sophistication, publishers face rising costs, data exploitation and degraded analytics integrity. Here, we present analysis, findings and recommendations from a 2025 Brightspot initiative to understand, evaluate and respond to the growing risk and impacts of “bad bot” traffic on our customers’ digital properties.
This in-depth report examines the nature of this automated traffic, the risks it poses to infrastructure and intellectual property and key findings from Brightspot’s evaluation of bot management solutions. We’ll also reference other insights on the rise of LLM-related web scraping, the limitations of current anti-bot measures and best practices for managing bot traffic on content-rich websites.
The rise of automated web traffic and LLM scraping
The explosion of generative AI has created a kind of “gold rush” for data. Bots that crawl the web to index and copy content are not new — search engines have done this for decades — but the scale and aggressiveness of recent crawlers is unprecedented. Companies, students, state actors and other organizations are now deploying bots to scrape text and media to feed LLM training processes. Industry research confirms that automated traffic is reaching new heights,1 and Brightspot’s own 2025 survey of customer sites found that over 40% of traffic on some major media websites was generated by automated agents rather than human readers.
This surge is directly tied to the demand for training data. According to Imperva’s 2024 Bad Bot Report, the rapid adoption of generative AI led to a significant jump in basic web scraping bots — simple “LLM feeder” crawlers increased to nearly 40% of overall traffic in 2023 (up from ~33% in 2022).2
The explosive growth of generative AI has triggered a spike in bots scraping web content to train large language models (LLMs). These bots operate around the clock and increasingly ignore traditional controls like robots.txt
, leading to a surge in non-human traffic across content-rich websites.
Excessive bot traffic strains infrastructure, inflates cloud and CDN costs, disrupts analytics accuracy and threatens intellectual property. Some bots mimic human behavior to avoid detection, making them difficult to block with standard network filters.
Not entirely. Brightspot’s evaluation found that perimeter tools like WAFs and CDN-based solutions often fail against advanced scrapers. More effective defenses are tightly integrated into the CMS or application layer, where behavioral signals provide richer context for detection..
Not necessarily. Some bots deliver business value (e.g., search engine crawlers, monitoring services). Bot management should be strategic — deciding which bots to allow, block or throttle based on business priorities, content value and partnership opportunities.
Adopt a multi-layered defense strategy combining network, application and client-side protections. Continuously monitor traffic and tune defenses. Use adaptive serving for suspected bots and maintain clear Terms of Service and allow/deny lists to support enforcement.
Treat bots as another user segment. Collaborate across security, operations, editorial and legal teams to define content exposure policies. Decide whether your goal is strict IP protection, broader visibility or selective access, and configure your bot controls accordingly.
Ignoring bot guidelines and intellectual property concerns
Traditional controls meant to govern web crawling — such as the robots.txt
standard and honest self-identification via user-agent strings — are increasingly being ignored by these new scraping bots. Under the Robots Exclusion Protocol, website owners can publish a robots.txt
file to indicate which parts of the site should not be crawled and by which bots, and well-behaved bots (like Google’s crawler) generally comply. Adherence to robots.txt
is purely voluntary, however, and malicious or opportunistic scrapers often disregard it, as well as other directives like nofollow
and noindex
, entirely. Recent evidence shows multiple AI companies deliberately bypassing robots.txt
rules to grab content without permission.
In mid-2024, Reuters reported that analytics from a content licensing startup revealed “numerous AI agents are bypassing the robots.txt protocol” across publisher websites.3 In one case, an AI search startup was found likely ignoring Forbes’ robots.txt directives in order to scrape articles, sparking accusations of plagiarism. These incidents underscore that bad actors can easily crawl wherever they please, since the robots standard has no legal enforcement and is essentially based on an honor system.4
Compounding the issue, sophisticated scrapers employ techniques to evade detection and blocking. Many disguise themselves by spoofing popular browser user-agent strings or cycling through random IP addresses and residential proxies to appear as ordinary visitors. Some bots even run actual browsers, mimicking human-like interactions (such as mouse movements or realistic pauses between page loads) to slip past basic bot filters. These tactics make it difficult for network-level defenses that rely on simple patterns (like IP reputation or user-agent blocking) to reliably distinguish scrapers from real users.
Beyond the technical circumvention of guidelines, the unbridled scraping of content raises serious copyright and intellectual property (IP) concerns. News and media outlets, whose content is a prime target for LLM training, are increasingly alarmed that AI developers are harvesting their articles without compensation or attribution. In late 2023, The New York Times filed a landmark lawsuit against OpenAI, alleging that the company infringed copyright by using Times articles to train ChatGPT without permission.5 OpenAI argued that mass web scraping for AI training falls under “fair use,” but the Times and other publishers strongly dispute this, claiming there is nothing “transformative” about using their content wholesale in a new AI product. This legal battle highlights the growing tension between content creators and AI firms.6
In fact, a backlash against AI scraping has begun.7 Over the last year, many major websites have taken active measures to shield their content. A recent analysis by MIT’s Data Provenance Initiative found a “rapid crescendo of data restrictions” being implemented — about 5% of 14,000 sampled websites (and 28% of the most actively updated sites) have now added rules in robots.txt
to block AI-specific crawlers, and many sites have also updated their Terms of Service to explicitly forbid AI training uses of their data. These numbers jumped dramatically from mid-2023 to mid-2024, indicating a new emphasis on protecting content from uncompensated AI mining. Some high-profile sites went as far as blocking OpenAI’s own GPTBot crawler when it was introduced in 2023. However, the landscape continues to evolve — notably, several publishers that initially blocked AI scrapers reversed course after securing licensing deals with the AI companies. For example, when firms like OpenAI struck agreements with media companies (Dotdash Meredith, Vox Media, Condé Nast and others), those publishers promptly removed or eased the blocks on OpenAI’s crawler in their robots.txt
. This seems to indicate that content owners are willing to allow AI access on their own terms — i.e. if there is a fair exchange of value, or strategic benefit, for doing so.8
The scale of the problem: Automated traffic by the numbers
The data make it clear that automated bot traffic is not a minor nuisance but a material (and growing) component of web activity. Brightspot’s internal survey found that some major news and e-commerce sites are seeing over 40% of their total requests coming from bots rather than human users. This aligns with broader industry findings. Imperva reports that in 2023 bots accounted for 49.9% of all website traffic, exceeding human traffic for the first time in their records.9 Within that, about one-third of total traffic was attributed to “bad bots” engaged in malicious or unwanted actions (as opposed to “good” bots like legitimate search-engine crawlers).
The rise of LLM-related scraping is a major contributor to these trends. Simple scraping bots used for AI data collection grew sharply in prevalence as generative AI took off, and the rush to build ever-larger AI models has unleashed an army of crawlers on the web. These bots often operate 24/7 at high request rates, far beyond what a human browser would do, in order to vacuum up as much data as possible. For popular content sites, the result is that a large chunk of their traffic — in some cases approaching or exceeding half — now comes from automated agents that provide no direct business value (no ad impressions, no product purchases, no newsletter sign-ups) and often ignore the site’s rules.
Operational impacts: Infrastructure strain and site reliability
I’ve spoken to a lot of our customers about this topic — customers that are concerned about reliability, cost and content ownership rights. This kind of thing is so new they don’t even know if they should consider it a security issue, an operational issue or a strategy issue — I think it’s all three.
For organizations running content-rich websites, this flood of non-human traffic can have significant operational and business impacts. Infrastructure costs can grow disproportionately to revenue when a sizable portion of traffic is essentially unwanted load. Web servers, databases and CDNs must scale to handle the volume of requests, meaning extra capacity (and cost) is needed merely to serve bots that likely shouldn’t be there in the first place.11
Site performance and reliability can also suffer. Automated scrapers tend to hit pages as fast as they can, and multiple bots may crawl in parallel. That approach defeats solutions like network and application caches that are designed to handle the typical bell curve of content popularity on a given site. Caches are cost-savers as well as performance improvements, and when they get bypassed, things can get expensive.
It’s not just the content misappropriation and the cost/revenue imbalance. It’s also things like analytics pollution — a surge in traffic that appears to be human could be good news, or it could be the latest round of scraping; article popularity data or A/B testing results can become worthless if the nature of the site visitors is in question.
In short, the surge in automated traffic translates to real costs and risks: extra infrastructure spend, potential outages or slowdowns and indirect business harm (like lost engagement or tainted analytics). It is a problem that threatens both the technical performance and the content value proposition of digital publishers. This is why Brightspot and its clients have made addressing malicious and excessive bots a top priority.
Evaluating bot management solutions: Brightspot’s approach
To combat the rising tide of non-human traffic, Brightspot undertook an evaluation of leading bot detection and mitigation tools. The goal was to identify how well current solutions can handle the new wave of sophisticated scrapers — and to guide Brightspot’s strategy for protecting its platform and customers. Rather than rely on vendor claims alone,12 Brightspot set up a real-world test using a commercial web scraper known for its ability to bypass bot defenses. This tool, essentially a stealth crawling engine marketed to researchers and grey-area software engineers, was configured to mimic human-like browsing (randomized headers, varying click rates) and to rotate through numerous IP addresses. Brightspot then unleashed this simulated “bad bot” against our own website and evaluated multiple bot management solutions on their ability to detect or block the scraper.
“This thing is nasty,” said Chris Cover, Program Director and the head of the ‘Red’ team in our experiment. “They claim right up front that they can bypass the big-name filters out there — and from what I can see I don’t doubt it.”
The solutions tested included both edge-level defenses (such as a leading CDN’s bot management add-ons and a cloud WAF service) and an application-level approach integrated within the Brightspot CMS (leveraging JavaScript, a specialized server and a SaaS component). Over several weeks of testing, the team gathered data on detection rates, false positives, performance impact and the effort required to tune each solution. The results were telling. Most tools failed to catch any of the unwanted traffic out-of-the-box, and none were close to seeing all of it. The differences in approach yielded notable trade-offs.
Key findings from Brightspot’s bot management evaluation
No “set-it-and-forget-it” solution: Ongoing tuning is essential.
A clear lesson from the evaluation is that effective bot defense requires active management and tuning over time. None of the tested solutions could simply be enabled and left alone without oversight. In initial runs, every tool let almost all scraper traffic through and produced false positives (blocking a bit of legitimate traffic) until adjustments were made. This aligns with industry intelligence that shows that bots are constantly evolving and adapting; rules that worked last month might miss a new bot variant this month. Brightspot observed that regular tuning of bot signatures, thresholds and allow/block lists was necessary with all solutions. This finding underscores that bot management is an ongoing process, not a one-time deployment.13 Organizations should plan for continuous calibration – whether by internal teams, vendor support or automated learning — to adapt to new bot behaviors and minimize false positives. Set-it-and-forget-it is not realistic in the face of determined adversaries.
Edge vs. integrated: CMS-integrated solutions show better effectiveness.
Another takeaway is that bot defenses deployed outside the application (at the CDN, load balancer or firewall layer) were generally less effective against advanced scrapers than solutions tightly integrated with the CMS. The evaluation found that the Brightspot CMS-integrated prototype caught more of the stealth bot traffic and did so with fewer inadvertent blocks of real users. Why? The integrated approach could leverage application-specific knowledge — for instance, understanding normal content fetch patterns, user session behaviors and CMS-specific query patterns — which allowed more nuanced detection. In contrast, the edge solutions had a limited view (mostly network and HTTP metadata) and struggled to flag the bot when it behaved very much like a human browser. External tools often rely on generic heuristics (like known bad IP lists or anomaly detection at the network layer). These are important techniques, but sophisticated bots that mimic human behavior can easily slip past purely edge-based defenses.
The integrated solutions, on the other hand, are able to analyze user behavior in context — for example, detecting that the scraper never loaded images or executed certain client-side scripts that real users would, and spotting anomalous navigation paths. This deeper, context-aware analysis gave it an edge in identification. External solutions also tended to be reactive (block after a threshold is exceeded), whereas the CMS could proactively challenge suspicious clients (e.g. serve a CAPTCHA or a slowly loading page to suspected bots). The bottom line: defenses closer to the application and content can make more fine-grained decisions, so integrating bot mitigation into the CMS or application logic can yield superior results to solely perimeter-based tools.
Strategic bot management: Align policies with business goals.
A final key finding is that decisions about which bots to block, throttle or allow should be guided by an organization’s broader business strategy and content goals. There is no one-size-fits-all answer to the question of “block all bots or not?” — it truly depends on the type of content, the value of that content and the company’s objectives (and obligations) around it. Brightspot’s evaluation highlighted that the most successful bot management programs were those that were purposeful and selective about automated traffic, rather than applying a blanket ban without nuance. For example, one media group may decide to allow a particular news-aggregator bot that drove traffic to their site (essentially treating it as a partner), while blocking other bots that simply republished their content without benefit. In practice, this means maintaining an allowlist of “good” bots (search engines, monitoring bots, authorized partners, etc.) and a dynamic denylist of unwanted bots.14
Business stakeholders should be involved in classifying bots: is a given crawler helping our business (by increasing our reach or visibility), or hurting it (by exploiting our IP or straining our systems)? For instance, some companies may choose to permit certain AI scrapers because they want their content to be visible in AI-driven search results or chat answers — which can result in inclusion in reports like this one, where we made use of AI to help with our research — essentially an investment in future discoverability.
Others will decide to block all AI training bots to protect proprietary data or seek licensing fees. Both approaches are valid; what’s important is that the bot management strategy aligns with the company’s goals and risk tolerance.15
Conclusion
The rise of automated web traffic — fueled in large part by AI’s voracious appetite for data — presents a complex challenge for digital content platforms. On one hand, bots are overwhelming infrastructure and quietly exfiltrating the hard-earned content that organizations produce. On the other, not all bots are inherently bad — and in fact some level of automated access is integral to the open web and to business growth. As we have explored, there is no silver-bullet solution.
Liz Burgess, Brightspot’s Senior Manager, Service Delivery, who led our Blue team, said, “I told our Red team to ‘come at me!’ and was repeatedly disappointed by the performance of single-source solutions that promised comprehensive bot management. I was pretty chuffed when our multi-layer integrated approach proved effective.”
Combating unwanted bots requires a combination of smart technology and smart policy. Technical defenses must be multi-faceted and continuously updated to keep up with evolving scrapers. Equally, organizational strategies must be well-defined to distinguish between the automated traffic that should be welcomed and that which must be shown the door.
Brightspot’s investigation into bot management tools revealed that while technology is improving (with advanced behavior analysis, AI-driven detection, etc.), human oversight and tuning remain vital. It also highlighted the advantages of weaving bot awareness into the fabric of the CMS and application itself, rather than treating it as an external afterthought. Ultimately, effective mitigation will come from a layered defense and an adaptive mindset. We encourage IT leaders to audit their current traffic and ask: How much of it is non-human, and what is it doing? From there, business leaders and technologists can collaborate on a plan to protect their content’s value, ensure their sites stay reliable, and decide how — or if — they want their data to contribute to the AI ecosystems that are emerging.17
In navigating this new era of automated web traffic, knowledge is power. By understanding the scale, knowing the tools available and aligning actions with goals, organizations can regain control of their web traffic mix. Brightspot remains committed to helping our customers face these challenges. Through continued research, platform innovations and partnerships with leading bot mitigation providers, we aim to ensure that your valuable content serves your human audience first and foremost, while unwanted bots are kept at bay. The web may be increasingly populated by bots, but with the right approach, we can keep the bots in check and the digital experience thriving for everyone involved.
Sources:
1. Imperva – 2024 Bad Bot Report
2. ManagedServer – Bots make up approximately 50% of global web traffic.
3. Reuters – Exclusive: Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says
4. PromptCloud – Read and Respect Robots Txt File
5. Imperva – The New York Times vs. OpenAI: A Turning Point for Web Scraping?
6. Reuters – Exclusive: Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says
7. 404 Media – The Backlash Against AI Scraping Is Real and Measurable
8. Wired – The Race to Block OpenAI’s Scraping Bots Is Slowing Down
9. ManagedServer – Bots make up approximately 50% of global web traffic.
10. ManagedServer – Bots make up approximately 50% of global web traffic.
11. AI Journal – Google and OpenAI Are Slowing Down Your Website
12. SecureAuth – Elevate Your Bot Detection: Why Your WAF Needs Our Intelligent Risk Engine
13. Approov – Streamlining the Defense Against Mobile App Bots
14. Akamai – Managing AI Bots as Part of Your Overall Bot Management Strategy
15. Cloudflare – How to manage good bots | Good bots vs. bad bots - Cloudflare; United States Cybersecurity Magazine - Bots: to Block or Not to Block? Effective Bot Management Strategy
16. Computerworld – IETF hatching a new way to tame aggressive AI website scraping
17. AWS Documentation – Example scenarios of false positives with AWS WAF Bot Control; Akamai - Top 10 Considerations for Bot Management