Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries

Jason Casden; David Romani; Tim Shearer; Jeff Campbell

Issue 61, 2025-10-21

Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries

The rise of aggressive, adaptive, and evasive web crawlers is a significant challenge for libraries and archives, causing service disruptions and overwhelming institutional resources. This article details the experiences of the University of North Carolina at Chapel Hill University Libraries in combating an unprecedented flood of crawler traffic. It describes the escalating mitigation efforts, from traditional client blocking to the implementation of more advanced techniques such as request throttling, regional traffic prioritization, novel facet-based bot detection, commercial Web Application Firewalls (WAFs), and ultimately, in-browser client verification with Cloudflare Turnstile. The article highlights the adaptive nature of these crawlers, the limitations of isolated institutional responses, and the critical lessons learned from mitigation efforts, including the issues introduced by residential proxy networks and the extreme scale of the traffic. Our experiences demonstrate the effectiveness of a multi-layered defense strategy that includes both commercial and library-specific solutions, such as facet-based bot detection. The article emphasizes the importance of community-wide collaboration, proposing future directions such as formalized knowledge sharing and the ongoing development of best practices to collectively address this evolving threat to open access and the stability of digital library services.

By Jason Casden, David Romani, Tim Shearer, and Jeff Campbell

Introduction

Since the release of multiple competing generative AI platforms, the University of North Carolina at Chapel Hill University Libraries has seen increasingly overwhelming and service-disrupting traffic from web crawlers, including those from previously well-behaved hosts such as Google and Microsoft. This traffic consists of an unprecedented flood of evasive and extremely aggressive requests, which caused multiple consecutive days of intermittent service disruption for our public catalog interface (and later to various digital collections services). Although most of our peers in libraries, archives, and the broader open web have also been acutely impacted[1a][2a], our institution was among the first to experience severe service disruptions from this activity. This article will describe these impacts, as well as our evolving and escalating attempts to mitigate the denial-of-service caused by these automated crawls. We will discuss our novel facet-based bot detection and blocking technique, practices identified by other institutions, and opportunities for collaborative detection and blocking of “bad bots” (i.e., bots that aggressively query our services without regard for robots.txt or other policies) among libraries and archives.

Community Context

When UNC Libraries first began to see our systems under an unprecedented and unsustainable wave of traffic in the Spring of 2024, we began to check in across our professional networks. Within traditional IT academic communities (i.e., campus schools and departments), no systems were adversely affected. However, within library and archival IT communities nationwide, it was clear many systems were under strain. This excessive traffic was successfully mitigated by UNC Libraries software and system engineers in fairly short order, ensuring systems were available to serve researchers. When our systems detected a second, and seismic, wave of automated web crawls in late fall 2024, we found, mostly through informal communications, the same national patterns, but far more intense, almost exclusively in memory institutions. These informal communications took a number of forms. As a group was moving into a meeting agenda or winding one down with informal chat, one or two people would mention that their systems were receiving traffic that seemed similar to distributed denial-of-service (DDoS) attacks, and others would chime in with similar experiences. For example, LYRASIS-hosted Fedora Repository Program governance calls in 2024 had so many side conversations about bot activity that it spun off an interest group that is meeting regularly and has begun to compile resources. In the Code4lib community Slack instance[3], a new messaging channel was established, but not publicized, as a forum for an international community of technical libraries and archives colleagues to discuss their attempts to mitigate crawler-driven downtime. Additionally, as our colleagues shared our internal communications with their professional networks, we started receiving requests for consultations with other institutions.

The resulting information sharing is organic, inconsistent, and clearly not reaching all affected institutions. This lack of comprehensive awareness and coordinated response is exacerbated by the fact that systems are not predictably affected. Staff may only need and seek out a shared understanding and response after their own system is affected. Additionally, unaware of the global pattern, individuals often attempt to resolve the issue locally and only come to understand the broader nature of the attacks after spending weeks or months trying to restore access to their impacted systems. In many cases, this is the first time IT staff in libraries and archives have experienced a significant and coordinated security event.

As mentioned, there are efforts to bridge these gaps. We conducted an informal survey in late March 2025, and we are aware of at least one other similar survey. Our goal was to begin documenting the extent of the impact, better understand which systems are affected, identify solutions that seem to be working, and gauge interest in building a community around the topic. We found that roughly a third of respondents were from outside academic libraries and a quarter were from outside the U.S., hinting at the global impact of this bot traffic. Sixty-five percent of respondents are actively battling bot traffic or experiencing degraded services. The Engelberg Center on Innovation Law & Policy and the Confederation of Open Access Repositories (COAR) asked similar questions in surveys conducted in April 2025 and found that 74.4% and 75.5% of respondents, respectively, are actively engaging in bot mitigation[1b][2b]. More than 95% of respondents to our survey are interested in coming together collectively to share and learn from each other. The LYRASIS-hosted group is actively exploring ways to further this goal.

Strategic Partnership

As the same software and systems engineers tackled the second wave of activity in the fall of 2024, they found it was clearly different from the first wave. The apparent source of the traffic was masked, and behavior evolved to outflank each effort to block it. We soon realized we needed help and began sharing data and ideas with our regional academic library IT colleagues. We also initiated talks with our university IT security professionals, leading to a call in early 2025 between Libraries IT staff and both security and networking staff from the university IT division. As we described the problem and expressed uncertainty about whether this qualified as a DDoS attack, a security colleague responded with a clarifying message that “authorial intent does not matter” and both offices generously offered their help. Over the next week, a cross-departmental group with representatives of Libraries IT, the campus security office, and the campus networking office determined that implementing additional network infrastructure was the most promising intervention. The security office was also able to confirm that, in their professional networks, the issues were not widespread across academic institution systems, but were specifically affecting memory institutions located within those campuses. Our central campus networking office referred us to a Web Application Firewall (WAF) vendor and worked closely with us to configure and deploy a cloud-based load balancer and WAF. We were fortunate to have available funding and support from the University Libraries Administration to pursue this vendor-based approach.

Escalation of the Crawling and Mitigation

Our initial response to early reports of sluggish responses from the catalog was to focus on scaling and optimization. We increased server resources, tuned web server configurations, profiled the application, and implemented targeted performance improvements. Although we did not have a load balancer in place at the time, colleagues at other institutions scaled more dramatically behind load balancers, adding more and more servers as the traffic worsened. Our attempts to control this overwhelming traffic, with significant contributions from many of our colleagues, began with more familiar blocking methods and evolved into a layered suite of mitigation methods.

User agent and IP blocking

We have always been willing to block IP addresses in our local host-based firewalls when we observe malicious requests. These typically involve repeated attempts to break into the web server or naïve denial-of-service attacks. While we have automation in place to deploy configuration changes to the firewall, this was typically a one-off manual process, triggered when the activity crossed a threshold and generated alerts in our monitoring system. Other sources of blocks include IP address and user agent lists shared by our colleagues at NCSU Libraries and Princeton University Libraries and lists compiled from other sources beyond libraries[4][5].

In the spring of 2024, we noticed a new kind of traffic: very high-volume requests from a single IP address or a narrow range of addresses, initially targeting full-text resources and eventually expanding to images and other audiovisual content. The traffic originated from commercial cloud providers, often Amazon AWS or Microsoft Azure, making it unlikely to be human-generated. If we banned one IP address, the traffic would often shift to an adjacent one. The scale of this activity quickly overwhelmed our capacity for manual intervention, and if left unchecked, it would have exhausted our server resources. This traffic also included user agents tied to commercial AI crawlers, such as Anthropic’s “ClaudeBot.” This led us to our first attempt at an automated response: fail2ban[6] continuously monitors log files and, when it detects a user-defined pattern, inserts a firewall block that is later removed after a configurable duration. It also supports escalating penalties for repeat offender IPs. Recognizing that user-agent strings are a soft identifier, we used available lists to build a fail2ban “jail” for known AI crawler user agents while we developed a strategy for providing data to AI clients. Short bans of just a few hours proved insufficient, so we adopted a progressive ban cycle for multiple offenses: 1 day, 7 days, 1 month, 6 months, and 1 year. After discussions among internal stakeholders, this ban of known AI user agents was extended to all our public-facing websites and continues to identify addresses to re-ban or ban for the first time.

Request throttling

We began to see a new kind of web crawling traffic in the late fall of 2024. It first showed up on our catalog discovery layer rather than sites with full text or other similar resources, and it was more aggressive than previous crawlers. It sidestepped our fail2ban user agent solution, as it utilized the common, generic user agent “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 Edg/136.0.0.0.” It also no longer originated from commercial cloud providers but from mainstream ISPs such as AT&T, Comcast, Spectrum, and Verizon-Home.

Again, the volume of these requests overwhelmed server resources. It defeated our attempts to tune application performance and consumed any new server resources. Analysis of our logs showed several odd characteristics. There were identical simultaneous requests from multiple disparate IP addresses. This indicated a deep pool of addresses was available and we suspected a residential proxy network was in use. Most of these hundreds of thousands of clients were making something like ~10 requests per minute in bursts of 3-5 minutes, effectively functioning as a DDoS attack. When we attempted to block at 80 requests per 5 minutes, and later at 60 requests per 5 minutes, we experienced brief respites before new bots would return with slightly slower streams of requests.

Facet blocking

After standard IP blocks and request throttling both failed, a deeper examination of the logs revealed that these bots were requesting huge numbers of very large and unusual combinations of facets, producing costly and unusual requests. These types of queries can be very resource-intensive and slow to return, contributing to a vicious cycle of slow responses, large request queues, and repeated query attempts. Many of these queries involved subject headings that were far outside our collecting strengths. To illustrate: in the entire month of November, we received 15 searches using the terms “Finnish” and “Music.” On December 4th, 2024 alone, we received 11,329 of these searches. More importantly, we also discerned that many, if not most, of these requests used a far higher number of facets than we would expect from a human user. For example, in all of November 2024, we received 294,398 requests with 15 or more facets applied (many of which were likely also bots). On December 4th alone, we received 140,104. Also on December 4th, we received 17,527 requests with 20 or more facets applied and 203 requests with 25 or more facets applied. That day’s 648,861 requests came from 22,861 distinct IP addresses with no clear relationship between them. These numbers do not include any requests from the millions of IP addresses we had already blocked in the previous phase.

Less than two weeks later, Google published faceted navigation SEO best practices[7], stating that “faceted navigation is by far the most common source of overcrawl issues site owners report to us.” We later read about the development of “bot traps,”[8] and realized that libraries may have unwittingly created bot traps by creating broadly and deeply filterable discovery interfaces. We began to explore the idea of using faceted interfaces as a kind of security honeypot[9] where we could detect signs of non-human interaction and establish bans that would propagate to other systems where that traffic may be more difficult to detect.

We developed a regular expression filter for fail2ban to isolate requests with more than a certain number of facets. We collaborated with our User Experience department and catalog committee to determine the ideal facet threshold, settling on a range of 8 to 12, depending on the severity of traffic, while allowing all queries from on-campus and VPN networks. After the facet-based bot blocking method went live on the afternoon of December 5th, 2024, we almost immediately saw large numbers of IP addresses being banned, 700 in the first 30 minutes, over 10,000 after 24 hours, and 16,000 after 48 hours. This reflects, in part, the very deep pool of IP addresses that the attackers were able to draw on. The 3,604,924 requests from the week before we deployed the facet-based blocks (November 28th-December 4th) originated from 207,351 distinct IP addresses. In contrast, the week after the facet-based blocks were deployed (December 6th-December 12th) saw 673,031 requests (an over 81% drop) from 220,110 distinct IP addresses. The distinct IP address count was likely sustained due to the ability to submit a successful “bad” query before being banned. We began sharing the details of our method with other institutions the next day.

We also saw adaptive behavior based on the length of the original ban, which started at just 2 minutes and began after the second “bad” faceted query. The initial ban eventually grew to 12 hours after the first offense, with longer bans for repeat offenses. Although this solution was sufficient for nearly a month, the crawlers eventually managed to evade the bans by limiting the number of facets in their queries and drawing from larger sets of IP addresses. We have continued to find this practice helpful as a complement to the other measures described below, as we experienced renewed outages when it was removed. Additionally, in the five months since this change, we have received only one user report of a problem with this limit.

Subnet-based blocks

After several weeks of stability, we began to struggle under the weight of huge numbers of neighboring IP addresses. We modified our facet-based ban rules from blocking just the offending IP address to blocking the whole /24 subnet; that is, the pool of 256 IP addresses in the last dotted quad of the address, for an offense by any IP in that subnet. We also lowered the facet limit and started banning clients after a single flagged query. We were seeking to expand our protection to stabilize our servers with the least possible collateral damage to human users.

Regional bans

Finally, the traffic moved offshore to East Asia, notably the People’s Republic of China (PRC). While this allowed us to programmatically ban large networks, we quickly found ourselves playing a game of whack-a-mole again. That is, once again managing an ongoing sequence of emergent problems, each requiring immediate, but likely temporary, resolution. After discussions with stakeholders, we decided to briefly ban all addresses from the PRC. We reasoned that the catalog was down anyway, and this might provide some temporary breathing room for us to develop a new solution. This had the surprising effect of almost immediately moving traffic to Brazil, and then to various locations in South America and Europe when we blocked Brazil. At this point, we were unable to restrict queries any further without significantly impacting human users. We knew we needed a non-local solution that could sit between the traffic and our server and examine the traffic to either block bots or prioritize on-campus and VPN users.

Commercial Web Application Firewall

We approached our campus networking and security colleagues for assistance. They worked through several possible options and then introduced us to their Web Application Firewall (WAF) vendor. A Web Application Firewall (WAF) examines and filters traffic to a web application based on the source, content, or other attributes of a request before the request reaches the application server. We worked in close partnership with our campus networking unit and the vendor to rapidly implement a trial of a cloud-based, GDPR-compliant WAF for the catalog server. On several occasions, this required involving senior engineers at the vendor, with one apologizing for smiling because “it’s just interesting that I haven’t seen something like this before.”

Once the WAF was implemented and configured, we started to see some relief from the traffic. The WAF included a bot filtering mechanism based on ratings of client “trustworthiness,” and also allowed us to more easily monitor and block traffic from troublesome sources. For the fourth time, we spent weeks feeling that we had found a solution. We licensed the WAF for a year, and we started planning to move other services behind it. After nearly a month of stability, however, large numbers of bots began finding their way through the WAF.

Commercial AI-driven bot detection

We worked with the WAF vendor to evaluate their “AI-driven” bot detection system that can be included as a component of the WAF. This seemed to solve the problem again, but the cost of this system would more than double the WAF price. This was prohibitively expensive, so we began investigating ways to prioritize traffic regionally (e.g., rate-limiting sources of traffic outside the campus and our VPN). The cloud WAF infrastructure made it impossible to include the VPN in these rules since the WAF was located at a non-campus IP address. We began discussing VPN exceptions with campus networking and security colleagues, while also evaluating a recent publication by Jonathan Rochkind[10], which described the use of the Cloudflare Turnstile CAPTCHA-style system to detect and block bot traffic.

In-browser client verification (i.e., Cloudflare Turnstile)

One common comment from our Libraries colleagues as we shared updates on the situation was something along the lines of, “Oh, no, please, I hope we don’t need a CAPTCHA…” Unfortunately, despite legitimate concerns regarding accessibility and user annoyance, the severity of the crawler situation forced us to explore this option. Fortunately, newer technologies like the free-to-use Cloudflare Turnstile offered a promising middle path. Unlike traditional CAPTCHAs that explicitly challenge users to identify traffic lights or crosswalks, Turnstile operates automatically for most users through a combination of browser fingerprinting, behavioral analysis, and lightweight challenges that typically complete automatically without user intervention. Cloudflare also asserts that Turnstile is WCAG 2.1 Level AA compliant[11].

We implemented Turnstile selectively, focusing initially on the most resource-intensive endpoints in our catalog, particularly those involving complex faceted searches that had proven vulnerable to crawler abuse. This targeted approach minimized the impact on typical user workflows while providing robust protection for our most susceptible service points. We configured the system to exempt on-campus and VPN users entirely, ensuring that many of our primary users experienced no change in service. We did not issue challenges for record views or searches without facets, and we only challenge a client once per day.

Results were immediate and dramatic. Server load returned to normal levels within hours of implementation, and we experienced very few further outages on protected endpoints. Those outages were eliminated by restoring our facet-based ban mechanism. In the extremely rare cases where legitimate users encountered service disruptions due to Turnstile (i.e., two incidents in the first four months of use), our cross-departmental response team quickly guided users to solutions.

Lessons Learned

Our sustained efforts, along with those of the wider libraries and archives community, to counter the evolving practices of these crawlers have yielded several valuable lessons. These insights, developed through a continuous cycle of implementing defenses and observing crawler adaptations, reveal the distinct challenges posed by this modern bot activity and the inherent limitations of conventional security and traffic management measures.

Banning blocks of IP addresses is insufficient

The traffic we have observed originates from a vast number of unrelated IP addresses, often from consumer internet service providers rather than from recognized corporate servers. This distributed network of bots has proven to be adept at evading our mitigation measures, as it undermines the traditional approach of banning attributable blocks of IP addresses. It appears that these requests were coming from a residential proxy network, essentially a legitimized botnet. These are services that can be hired to distribute network traffic over a wide range of (sometimes tens of millions of) unrelated individual internet addresses. These proxy nodes can be established in various ways, but one method is to conceal them within mobile apps, browser extensions, or other software[12]. Others include paying users to run this software on their computers and building massive collections of proxied addresses in data centers.

Rule-based controls are insufficient

We have found that static, rule-based defenses are insufficiently flexible to respond to crawlers that exhibit a kind of intelligent adaptation. As we implemented countermeasures, we observed crawler behavior rapidly evolving to circumvent our defenses. When we blocked based on user agents or specific request patterns, the crawlers quickly adapted by modifying these identifiers. Unlike traditional web crawlers that follow predictable paths through sites, these new crawlers exhibited what appeared to be machine learning-based adaptability. They would probe different access vectors, analyze our responses, and modify their approach accordingly.

While we are aware of efforts to implement complex bot fingerprinting in libraries, our attempts to identify patterns in request headers have proved largely unsuccessful, as the crawlers routinely rotate or randomize header values. Even when we identified consistent patterns in behavior, these would shift within days or sometimes hours of implementing blocking measures. This high level of adaptation suggests sophisticated orchestration beyond typical web crawling tools, pointing to well-funded operations with significant resources at their disposal.

We are dramatically out-resourced

The crawlers producing this massive wave of traffic are as inefficient as they are evasive. Rather than following responsible web crawling practices, such as respecting robots.txt, utilizing sitemaps, or implementing reasonable rate limits, these crawlers appear to be designed to extract maximum data, regardless of the computational or infrastructure costs imposed on libraries and archives. We observed crawlers repeatedly retrieving the same resources tens of thousands of times per day, ignoring cache headers, and generating needlessly expensive (i.e., slow) queries applying exhaustive combinations of facets. This “brute force” approach suggests either a lack of technical sophistication or, more likely, a business model where the value of comprehensive data extraction outweighs any concern for efficiency or ethical crawling standards. These inefficient patterns, combined with the highly sophisticated adaptive and evasive behavior, support our hypothesis that the entities behind these crawlers prioritize rapid and comprehensive data acquisition over efficient resource utilization, likely because the current commercial value of training AI models outweighs the costs of infrastructure. The contrast in sophistication between evasive and harvesting techniques also suggests the pairing of advanced, massively distributed, and evasive network traffic infrastructure with immature crawling technology. A common refrain from our colleagues emphasized that we would love to share our data with all comers, in any preferred format, if it meant avoiding this abuse.

Scaling and performance optimization have limited impacts

Our initial response focused on scaling infrastructure and optimizing application performance. However, we quickly discovered fundamental limits to this approach. Even after implementing increasing server resources, optimizing the catalog application, and carefully tuning server configurations, the sheer volume of traffic overwhelmed our systems. Although institutions with cloud-based infrastructure were able to scale much more aggressively by adding servers behind a load balancer, they eventually reported similar experiences as well as concerns related to the fees associated with cloud-based scaling. With on-premises infrastructure, we faced physical hardware limitations that couldn’t be quickly addressed.
This experience has demonstrated that technical resource optimization alone, although helpful, cannot solve the problem of aggressive crawler traffic. Comprehensively addressing this issue requires fundamentally rethinking how we prioritize human access to our systems.

Libraries and archives may be uniquely vulnerable

While organizations across various sectors have experienced increased crawler activity, libraries appear uniquely vulnerable to these aggressive bots. Our discovery interfaces combine deep faceting capabilities, highly structured metadata, extensive collections of unique assets of profound research value, and typically lower access barriers than commercial entities, which creates an ideal target for data harvesting operations. This vulnerability is further compounded by libraries’ commitment to the open access to information, a principle that can sometimes conflict with the need for robust security and business continuity measures.

Future Directions and Collaboration

Community knowledge sharing

The Code4Lib Slack channel quickly became a valuable resource for those aware of its existence, while other critical information sharing occurred informally at the margins of unrelated community calls and events. However, we discovered a significant information gap as we communicated with peer institutions: many were completely unaware of this community resource despite experiencing identical issues. We found ourselves repeatedly directing colleagues from other institutions to this channel, highlighting the uneven distribution of community knowledge. When information travels primarily through informal networks, some institutions, particularly smaller ones, are most likely to be left without effective practices.
The ad hoc nature of our collective response highlights the need for more formalized knowledge-sharing mechanisms that can reach libraries of all sizes and technical capacities. While the Code4Lib Slack channel represents an important first step, expanding communication channels and developing shared repositories of code (of which there are several recent examples[13][14][15][16]) and documentation specifically designed for institutions with limited technical resources would strengthen our community’s collective resilience against these evolving threats.

Proof-of-work checks

One promising approach involves implementing lightweight “proof of work” checks. Unlike traditional CAPTCHA-style mechanisms, these checks operate invisibly to users while requiring browsers to complete small computational tasks that are trivial for individual sessions but may become prohibitively expensive at crawler scale. Cloudflare Turnstile includes proof-of-work checks as one component of its verification suite. One prominent open-source example is Xe Iaso’s Anubis software[17], which “weighs the soul of your connection using a proof-of-work challenge in order to protect upstream resources from scraper bots.” This approach may provide an improved balance between accessibility and security. The computational requirements can be dynamically adjusted based on system load, increasing the barrier during suspected crawler activity while remaining negligible during normal operations. Several academic libraries and archives have recently implemented Anubis to control bot traffic[18][19][20][21], a process documented by Sean Aery at Duke University Libraries[22].

Regional service prioritization

Our experiments with geographic blocking, while effective in the short term, highlighted the need for more nuanced regional service prioritization. Rather than outright blocking entire countries or regions, future infrastructure could implement tiered access channels that prioritize traffic from the institution’s primary service areas during periods of high load.

These sorts of tiered access models conflict with our professional commitment to open access to information[23], but the intensity of these “attacks” has prompted surprising conversations along these lines. Implementation would require careful consideration of equity and access principles, potentially including allowlisting mechanisms for known international research and academic partners, as well as transparent communication about resource allocation during periods of peak demand.

Impact on web analytics

It is also important to note the effect this may have on web analytics reporting. Although bot traffic filtering provided by analytics platforms has always been imperfect, the volume of undetectable machine-driven traffic is likely to massively distort any reporting for periods before the implementation of effective mitigation methods.

Conclusion

The unprecedented surge in aggressive crawler traffic, coinciding with the release of commercial generative AI models, has forced libraries to rapidly evolve our approach to network and application security. While the timing strongly suggests connections to AI training, we cannot definitively attribute these activities to specific organizations or purposes. Regardless of intent, the impact on library services has been substantial, requiring significant financial and human resources to maintain stable access for our human users.

The libraries and archives community has developed a multi-layered defense strategy that has proven, to date, to be somewhat effective against even the most aggressive and evasive crawlers. We have continued to experience intermittent service disruptions related to this traffic, and some of our peers now require logins for catalog access and block access to increasingly broad sets of IP addresses, sometimes representing entire countries. Although client checks such as Cloudflare Turnstile or Anubis are currently the most effective layer of the strategy, the facet-aware detection method introduced in this article represents a novel approach designed explicitly for library discovery interfaces. We strive to transform a potential vulnerability into a detection opportunity by identifying non-human interaction patterns unique to faceted interfaces.

As the generative AI landscape continues to evolve, we anticipate ongoing adaptation from both crawlers and our institutions. By establishing structured channels for sharing detection patterns and mitigation strategies, libraries and archives can collectively build a more resilient infrastructure that maintains our commitment to open access to information while ensuring service stability for our communities. As we look to the future, the international community that formed out of necessity and desperation may serve as a valuable model for a rapidly changing technological landscape.

Acknowledgments

This work would not have been possible without the cross-departmental team at UNC-Chapel Hill that responded to these incidents over the past year. In addition to the authors of this article, this team includes Sidney Stafford, Alex Everett, Dean Farrell, Jamie McGarty, Joseph Moran, Zachary Tewell, Emily Brassell, Benjamin Pennell, Chad Haefele, and others, with invaluable strategic support from María Estorino and Michael Barker. We also send our thanks to our regional colleagues at the Triangle Research Libraries Network, and specifically to Adam Constabaris at NCSU Libraries for introducing us to residential proxy networks and to Sean Aery for sharing Duke University Libraries’ experience with the Anubis system.

We are grateful for the Code4Lib Slack #bots channel community for its curiosity, creativity, and generosity. We are particularly appreciative of Jonathan Rochkind’s work demonstrating the integration of Cloudflare Turnstile in a Blacklight application.

About the authors

Jason Casden is the Head of Software Development at the University of North Carolina at Chapel Hill University Libraries. He leads engineers working on a diverse portfolio of projects dedicated to building, maintaining, and innovating the Libraries’ digital services, including digital collections and repositories, archival information systems, digital preservation systems, and the public catalog discovery layer.

David Romani is the recently retired Senior Linux System Administrator at the University of North Carolina at Chapel Hill University Libraries. David’s professional focus at UNC was system architecture and automation.

Tim Shearer is the Associate University Librarian for Digital Strategies and IT at the University of North Carolina at Chapel Hill University Libraries. In this role he leads the Library Information Technology division and has responsibility for commodity computing, infrastructure, user experience, software development, repository development, discovery, digitization, and information security. Tim is interested in how technology can play a transformative role in supporting discovery, delivery, use, and preservation of information. He holds an MSLS from UNC Chapel Hill’s School of Information and Library Science.

Jeff Campbell is the Head of Infrastructure Management Services at the University of North Carolina at Chapel Hill University Libraries. He leads a skilled team of systems professionals responsible for the strategic planning and management of the Libraries’ IT infrastructure, including oversight of server environments, network architecture, cybersecurity measures, and disaster recovery planning.

References

[1a][1b] Weinberg M. Are AI Bots Knocking Cultural Heritage Offline? Engelberg Center on Innovation Law & Policy; 2025. https://glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/

[2a][2b] Shearer K, Walk P. The impact of AI bots and crawlers on open repositories: Results of a COAR survey, April 2025. COAR: Confederation of Open Access Repositories; 2025. https://coar-repositories.org/news-updates/open-repositories-are-being-profoundly-impacted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey/

[3] Code4Lib Slack page. Code4Lib. [accessed 2025 May 22]. http://code4lib.org/irc/#slack

[4] networksdb.io: Company to IP, IP address owners, Reverse Whois + DNS, free tools & API. [accessed 2025 May 22]. https://networksdb.io/

[5] Dark Visitors – Track the AI Agents and Bots Crawling Your Website. [accessed 2025 May 22]. https://darkvisitors.com/

[6] fail2ban/fail2ban: Daemon to ban hosts that cause multiple authentication errors. 2025 [accessed 2025 May 22]. https://github.com/fail2ban/fail2ban

[7] Illyes G. Crawling December: Faceted navigation | Google Search Central Blog. Google for Developers. [accessed 2025 May 22]. https://developers.google.com/search/blog/2024/12/crawling-december-faceted-nav

[8] Koebler J. Developer Creates Infinite Maze That Traps AI Training Bots. 404 Media. 2025 Jan 23 [accessed 2025 May 19]. https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/

[9] Provos N, Holz T. Virtual honeypots: from botnet tracking to intrusion detection. Addison-Wesley; 2010.

[10] Rochkind J. Using CloudFlare Turnstile to protect certain pages on a Rails app – Bibliographic Wilderness. [accessed 2025 May 19]. https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app/

[11] FAQ · Cloudflare Turnstile docs. [accessed 2025 May 21]. https://developers.cloudflare.com/turnstile/frequently-asked-questions/

[12] Mi X et al. Resident Evil: Understanding Residential IP Proxy as a Dark Service. In: 2019 IEEE Symposium on Security and Privacy (SP). 2019. p 1185–1201. https://ieeexplore.ieee.org/abstract/document/8835239. https://doi.org/10.1109/SP.2019.00011

[13] Rochkind J. samvera-labs/bot_challenge_page: Show a bot challenge interstitial for Rails, usually using Cloudflare Turnstile. [accessed 2025 May 22]. https://github.com/samvera-labs/bot_challenge_page/

[14] Corall J. libops/captcha-protect: Traefik middleware to add an anti-bot challenge to individual IPs in a subnet when traffic spikes are detected from that subnet. [accessed 2025 May 23]. https://github.com/libops/captcha-protect

[15] Cliff D. cerberus/config/initializers/zoo_rack_attack.rb at support/1.x · NEU-Libraries/cerberus. [accessed 2025 May 22]. https://github.com/NEU-Libraries/cerberus/blob/support/1.x/config/initializers/zoo_rack_attack.rb

[16] ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block. 2025 [accessed 2025 May 22]. https://github.com/ai-robots-txt/ai.robots.txt

[17] Iaso X. TecharoHQ/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers. 2025 [accessed 2025 May 22]. https://github.com/TecharoHQ/anubis

[18] Archives & Manuscripts at Duke University Libraries. David M. Rubenstein Rare Book & Manuscript Library. [accessed 2025 Jun 22]. https://archives.lib.duke.edu/

[19] Connecticut’s Archives Online – Repositories. Western Connecticut State University Archives. [accessed 2025 Jun 22]. https://archives.library.wcsu.edu/arclight/caoSearch/repositories/

[20] UCLA Library Digital Collections. [accessed 2025 Jun 22]. https://digital.library.ucla.edu/

[21] Columbia University Libraries Finding Aids. [accessed 2025 Jun 22]. https://findingaids.library.columbia.edu/

[22] Aery S. Anubis Pilot Project Report – June 2025. 2025 Jun 26 [accessed 2025 Jul 8]. https://hdl.handle.net/10161/32990

[23] Hellman E. AI bots are destroying Open Access. Go To Hellman. [accessed 2025 May 20]. https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

Subscribe to comments: For this article | For all articles

Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries

Introduction

Community Context

Strategic Partnership

Escalation of the Crawling and Mitigation

User agent and IP blocking

Request throttling

Facet blocking

Subnet-based blocks

Regional bans

Commercial Web Application Firewall

Commercial AI-driven bot detection

In-browser client verification (i.e., Cloudflare Turnstile)

Lessons Learned

Banning blocks of IP addresses is insufficient

Rule-based controls are insufficient

We are dramatically out-resourced

Scaling and performance optimization have limited impacts

Libraries and archives may be uniquely vulnerable

Future Directions and Collaboration

Community knowledge sharing

Proof-of-work checks

Regional service prioritization

Impact on web analytics

Conclusion

Acknowledgments

About the authors

References

Leave a Reply

Current Issue

Previous Issues

For Authors