Issue 61, 2025-10-21

Editorial

Edward M. Corrado

Welcome to the 61st issue of Code4Lib Journal. We hope that you enjoy the variety articles published in this issue.

Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries

Jason Casden, David Romani, Tim Shearer, and Jeff Campbell

The rise of aggressive, adaptive, and evasive web crawlers is a significant challenge for libraries and archives, causing service disruptions and overwhelming institutional resources. This article details the experiences of the University of North Carolina at Chapel Hill University Libraries in combating an unprecedented flood of crawler traffic. It describes the escalating mitigation efforts, from traditional client blocking to the implementation of more advanced techniques such as request throttling, regional traffic prioritization, novel facet-based bot detection, commercial Web Application Firewalls (WAFs), and ultimately, in-browser client verification with Cloudflare Turnstile. The article highlights the adaptive nature of these crawlers, the limitations of isolated institutional responses, and the critical lessons learned from mitigation efforts, including the issues introduced by residential proxy networks and the extreme scale of the traffic. Our experiences demonstrate the effectiveness of a multi-layered defense strategy that includes both commercial and library-specific solutions, such as facet-based bot detection. The article emphasizes the importance of community-wide collaboration, proposing future directions such as formalized knowledge sharing and the ongoing development of best practices to collectively address this evolving threat to open access and the stability of digital library services.

Liberation of LMS-siloed Instructional Data

Hyung Wook Choi, Jonathan Wheeler, Weimao Ke, Lei Wang, Jane Greenberg, and Mat Kelly

This paper presents an initiative to extract and repurpose instructional content from a series of Blackboard course shells associated with IMLS-funded boot camp events conducted in June of 2021, 2022, and 2023. These events, facilitated by ten faculty members and attended by 68 fellows, generated valuable educational materials currently confined within proprietary learning management system environments. The objective of this project is to enable broader access and reuse of these resources by migrating them to a non-siloed, static website independent of the original Blackboard infrastructure. We describe our methodology for acquiring and validating the data exports, outline the auditing procedures implemented to ensure content completeness and integrity, and discuss the challenges encountered throughout the process. Finally, we report on the current status of this ongoing effort and its implications for future dissemination and reuse of educational materials.

Extracting A Large Corpus from the Internet Archive, A Case Study

Eric C. Weig

The Internet Archive was founded on May 10, 1996, in San Francisco, CA. Since its inception, the archive has amassed an enormous corpus of content, including over 866 billion web pages, more than 42.5 million print materials, 13 million videos, and 14 million audio files. It is relatively easy to upload content to the Internet Archive. It is also easy to download individual objects by visiting their pages and clicking on specific links. However, downloading a large collection, such as thousands or even tens of thousands of items, is not as easy. This article outlines how The University of Kentucky Libraries downloaded over 86 thousand previously uploaded newspaper issues from the Internet Archive for local use. The process leveraged ChatGPT to automate the process of generating Python scripts that accessed the Internet Archive via its API (Application Programming Interface).

Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline

Corey Davis

Large Language Models (LLMs) are reshaping digital preservation and access in libraries, but their limitations (hallucinations, opacity, and resource demands) remain significant.
Retrieval-Augmented Generation (RAG) offers a promising mitigation strategy by grounding LLM outputs used in specific digital collections. This article compares the performance of WARC-GPTs default RAG implementation with unfiltered WARC files from Archi Io ve-It against a custom-built RAG solution utilizing optimization strategies in both modelling and data (WARC) preprocessing. Tested on a collection of thousands of archived pages from the Bob’s Burgers fan wiki, the study analyzes trade-offs in preprocessing, embedding strategies, retrieval accuracy, and system responsiveness. Findings suggest that while WARC-GPT lowers barriers to experimentation, custom RAG pipelines offer substantial improvements for institutions with the technical capacity to implement them, especially in terms of data quality, efficiency, and trustworthiness.

Building and Deploying the Digital Humanities Quarterly Recommender System

Haining Wang, Joel Lee, John A. Walsh, Julia Flanders, and Benjamin Charles Germain Lee

Since 2007, Digital Humanities Quarterly has published over 750 scholarly articles, constituting a significant repository of scholarship within the digital humanities. As the journal’s corpus of articles continues to grow, it is no longer possible for readers to manually navigate the title and abstract of every article in order to stay apprised of relevant work or conduct literature reviews. To address this, we have implemented a recommender system for the Digital Humanities Quarterly corpus, generating recommendations of related articles that appear below each article on the journal’s website with the goal of improving discoverability. These recommendations are generated via three different methods: a keyword-based approach based on a controlled vocabulary of topics assigned to articles by editors; a TF-IDF approach applied to full article text; and a deep learning approach using the Allen Institute for Artificial Intelligence’s SPECTER2 model applied to article titles and abstracts. In this article, we detail our process of creating this recommender system, from the article pre-processing pipeline to the front-end implementation of the recommendations on the Digital Humanities Quarterly website [1]. All of the code for our recommender system is publicly available in the Digital Humanities Quarterly GitHub repository [2].

What it Means to be a Repository: Real, Trustworthy, or Mature?

Seth Shaw

Archivists occasionally describe digital repositories as being “not real,” suggesting that their technical digital preservation infrastructure is inadequate to the task of digital preservation. This article discusses the concept of digital repositories, highlighting the distinction between digital repository technical infrastructure and institutions collecting digital materials, and what it means to be a “real” digital repository. It argues that the Open Archival Information System Reference Model and notions of Trustworthy Digital Repositories are inadequate for determining the “realness” of a digital repository and advocates using maturity models as a framework for discussing repository capability.

From Notes to Networks: Using Obsidian to Teach Metadata and Linked Data

Kara Long and Erin Yunes

In this article, we describe a novel use of the note-taking software Obsidian as a method for users without formal training in metadata creation to develop culturally relevant data literacies across two digital archiving projects. We explain how Obsidian’s built-in use of linked data provides an open-source, flexible, and potentially scalable way for users to creatively interact with digitized materials, navigate and create metadata, and model relationships between digital objects. Furthermore, we demonstrate how Obsidian’s local and offline hosting features can be leveraged to include team members with low or unreliable internet access.