Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline

Corey Davis

Issue 61, 2025-10-21

Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline

Large Language Models (LLMs) are reshaping digital preservation and access in libraries, but their limitations (hallucinations, opacity, and resource demands) remain significant.
Retrieval-Augmented Generation (RAG) offers a promising mitigation strategy by grounding LLM outputs used in specific digital collections. This article compares the performance of WARC-GPTs default RAG implementation with unfiltered WARC files from Archi Io ve-It against a custom-built RAG solution utilizing optimization strategies in both modelling and data (WARC) preprocessing. Tested on a collection of thousands of archived pages from the Bob’s Burgers fan wiki, the study analyzes trade-offs in preprocessing, embedding strategies, retrieval accuracy, and system responsiveness. Findings suggest that while WARC-GPT lowers barriers to experimentation, custom RAG pipelines offer substantial improvements for institutions with the technical capacity to implement them, especially in terms of data quality, efficiency, and trustworthiness.

by Corey Davis

Introduction

Large Language Models (LLMs) have begun reshaping how libraries approach digital preservation and access. We are already seeing LLMs that help transcribe handwritten manuscripts and extract entities from historical texts, among other tasks (Hiltmann et al. 2024; Humphries et al. 2024). At their best, these systems can bridge technical skill gaps and enhance efficiency; at worst, they hallucinate misinformation and act as inscrutable black boxes. Retrieval-Augmented Generation (RAG) has emerged as a promising technique to mitigate some of these issues by grounding LLM outputs in our own unique digital collections. The idea is to map a digital collection into a high-dimensional vector space and let an LLM “chat” with the collection using natural language queries, backed by real source material.

Web ARChive (WARC) files pose a particularly interesting test case for this approach. WARC files are notoriously difficult to search with traditional methods. Keyword searches in tools like the Wayback Machine or Archive-It often fail to surface meaningful results from the mass of HTML, scripts, images, and other files that go into creating a website (Milligan 2012; Costa 2021). To investigate whether LLMs in a RAG pipeline could improve web archive access, I conducted two related case studies during my 2024–25 research leave at the University of Victoria Libraries. I first experimented with WARC-GPT, an open-source tool developed by the Harvard Library Innovation Lab (Cargnelutti et al. 2024), to understand its capabilities out of the box. I then built a customized RAG pipeline, applying targeted data cleaning and architectural adjustments. Both approaches were tested on a collection of thousands of archived pages from the Bob’s Burgers fan wiki, a site rich in unstructured narrative content [1]. This article explores the difference between untuned (i.e. out-of-the-box) and tuned RAG implementations, and reflects on the promise, challenges, and significant effort required to optimize LLM-assisted access to web archives.

Figure 1. A conceptual diagram of the RAG workflow used in these experiments. User queries are converted to embedding vectors and matched against a vector database of the archived collection. Relevant “context chunks” of text are retrieved and passed into an LLM, along with system prompts, which then generates a context-informed answer. Source: https://en.wikipedia.org/wiki/Retrieval-augmented_generation#/media/File:RAG_diagram.svg

WARC-GPT: RAG for web archives

Released in early 2024, WARC-GPT is an open-source tool developed by the Harvard Library Innovation Lab that enables users to create a conversational chatbot over a set of WARC files. Instead of manually browsing or keyword searching a web archive, a user can query it in plain language and receive answers with cited sources. WARC-GPT ingests WARC files, converts the text into vector embeddings, and at query time retrieves relevant text snippets to feed into a Large Language Model (LLM) for answer generation. Notably, the system provides source attribution, listing which archived pages and snippets were used to generate the answer: a critical feature for building trust in an AI-driven research tool [2].

In hands-on testing, untuned WARC-GPT showcases the potential of conversational web archive search, but several significant challenges emerged. Foremost among these was the inherent noisiness of WARC files themselves, which encompass not only the main textual content but also raw HTML, CSS/JS, boilerplate, navigation menus, scripts, and duplicate text. While WARC-GPT’s ingestion pipeline uses a standard configuration of BeautifulSoup for HTML parsing, it does not apply site-specific logic to filter out irrelevant elements. As a result, semantically important content (often marked up in distinctive ways on community-curated sites) was frequently diluted by unrelated or repetitive text in the resulting embeddings.. Poor HTML filtering allowed low-value content to be embedded alongside meaningful text, increasing vector noise and degrading retrieval accuracy. Consequently, the retrieval process often surfaced spurious or misleading snippets, leading to fragmented or incomplete responses.

A second major challenge was scale and performance. Converting a large web archive into vector embeddings is computationally intensive and quickly exposes infrastructure limitations. In my case, WARC-GPT’s default pipeline took over a week to process what would be considered a modest-sized collection by research library standards. Embedding with transformer-based models, particularly those optimized for semantic search, like E5 or ADA, was especially taxing. On the Apple M3 Pro (with 18 GB of unified memory), the process easily saturated both CPU/GPU capacity and memory, leading to slowdowns during batch processing.

Effective RAG pipelines require experimentation and iteration: embedding strategies and models are not one-size-fits-all. They need to be tailored to the collection, which means testing different models (e.g., text-embedding-ada-002, intfloat/e5-base-v2), tuning chunk sizes and overlaps, exploring sparse vs. dense vs. hybrid retrieval, and comparing vector stores like FAISS, Chroma, or Weaviate with different indexing strategies. But when a single embedding run takes days or weeks, this kind of iterative tuning becomes impractical, especially for institutions lacking dedicated compute resources or flexible pipelines. Offloading to commercial cloud infrastructure is an option, but one that carries substantial cost, especially when using proprietary APIs from OpenAI, Anthropic, and others.

For collections that change or grow regularly, the time and expense of re-indexing may be a non-starter. Without access to efficient, scalable infrastructure, the promise of AI-assisted discovery can quickly run up against hard practical limits. Despite these challenges, using WARC-GPT felt like a glimpse into the future of web archive access. It allowed exploratory Q&A across an entire collection in a way that traditional search simply could not, surfacing some connections between documents and providing ok narrative answers.

However, naive RAG systems like WARC-GPT still face significant limitations. Because each query is handled independently, they struggle to maintain context across documents or support multi-turn conversations. There is no memory of previous prompts, and little capacity for narrative continuity or thematic buildup across interactions [Kannappan 2023; Merola and Singh 2025; Anthropic 2024]. The inclusion of source links for each response was a critical feature, allowing users to verify claims and explore the original web pages in more depth.

Ultimately, WARC-GPT showed that conversational AI can meaningfully augment web archives, but it also made clear that to deploy such a tool in production, I would need to improve data preprocessing, retrieval accuracy, and efficiency. This led to the question: could I build a more tailored RAG solution for our web archives?

Creating and tuning a RAG pipeline

Before discussing my custom RAG pipeline below, it’s important to make clear that the tuning I did around better preprocessing, chunking strategies, and model selection, can be done in WARC-GPT as well. That, however, was not the goal of my research. WARC-GPT is flexible and capable of accommodating these kinds of tweaks, but I wanted to build something from the ground up to better understand how each component (chunking, embedding, retrieval, generation, etc.) actually works. This hands-on approach helped clarify why my early results with WARC-GPT fell flat: not because the tool is lacking, but because RAG pipelines are highly sensitive to design and preprocessing choices.

My goal was to leverage open-source components to address the data quality and performance issues I had observed in the untuned instance of WARC-GPT. In essence, I wanted to see whether rethinking the ingestion process could yield cleaner and faster results. The solution combined a suite of lightweight tools and custom scripts (available on GitHub) [3]. Here’s an overview of the approach and optimizations I implemented

Cleaning the input data

Instead of working with existing WARC files full of “noise” (at least for the machines that are trying to process the text), I re-crawled the target websites using wget to generate fresh WARC files, deliberately excluding non-text content like images, video, CSS, and other page assets. This approach produced a leaner, text-focused corpus, optimized for semantic search rather than preserving the visual and structural integrity of entire web pages. Unlike the comprehensive, high-fidelity crawls performed by services like Archive-It, which prioritize preserving full web pages for human browsing, my captures focused solely on the content a reader would actually engage with: the main HTML and visible text. As a result, I began with a much cleaner dataset for embedding.

I then wrote parsing scripts using BeautifulSoup and regular expressions to extract meaningful text from the WARCs and discard the rest.. This involved stripping out boilerplate, navigation menus, and scripts, and keeping only the core textual content and headings from each page. I also filtered out any non-English pages or sections and normalized whitespace and punctuation for consistency. The result was a corpus of reasonably clean, human-readable text ready for indexing.

By extracting only the meaningful text from each page and discarding scripts, navigation elements, and other noise, the system embeds more meaningful components. This improved the quality of the vector index and significantly reduced storage requirements. In short, I did not index what I (or the machines, in this case) don’t need for retrieval or for generating answers based on that content (a process known as inference in generative AI systems).

Customizing chunk size and vector model

As seen in figure 1 above, in retrieval-augmented generation, documents are divided into “chunks”, small, self-contained segments of text that are individually embedded and stored in a vector database for semantic search. I experimented to find a chunk size that balanced contextual completeness with vector granularity. I landed on chunks of about 512 tokens (roughly 400–500 words), with a ~50-token overlap between chunks to avoid splitting sentences and damaging context. Smaller chunks helped limit irrelevant content per embedding, which improved retrieval precision. Using these chunk and overlap parameters, I was able to strike a good balance between preserving context and maintaining retrieval precision. This setup helped ensure that relevant information wasn’t fragmented across multiple chunks, while keeping each embedding focused enough to support accurate retrieval during query time.

For embedding the chunks themselves, I used the intfloat/e5-large-v2 model from Hugging Face, an open-source transformer developed by Microsoft and fine-tuned for semantic search (Wang et al. 2022). It generates 1024-dimensional vectors and runs locally without API fees, which made it preferable over OpenAI’s text-embedding-ada-002, which requires paid API usage (OpenAI 2023). All chunk embeddings were stored in a ChromaDB vector database, along with metadata linking, mirroring WARC-GPT’s approach (but not, admittedly, as sophisticated).

Selecting an LLM chat bot to generate results

For the question-answering stage, I integrated the pipeline with OpenAI’s GPT-4 via API. When a user submits a question, the system embeds the query using the E5 model and retrieves the top relevant text chunks from ChromaDB. Those chunks (typically 5 to 10) are then appended to the user’s question as context, forming an “augmented query” that’s then sent to GPT-4 to generate a conversational response (OpenAI 2024). GPT-4 was selected for its strong language understanding and generation capabilities, which are especially valuable when working with messy web archives. That said, the pipeline–like WARC-GPT itself–was built to be model-agnostic: one can (and should) swap in different LLMs during testing, including open-source alternatives like Mistral, LLaMA 3, DeepSeek, etc., though likely with some trade-offs in coherence or fluency.

The role of prompt engineering: System prompts

A further optimization available in this pipeline is the system prompt. In a RAG pipeline, system prompt design shapes how well the entire retrieval system works and is differentiated from the user prompt as it is applied globally, and behind the scenes, to any query entered by a user. User queries are embedded and semantically compared against the chunk embeddings from the web archives collection, so adding instructions, context-setting, or overly verbose phrasing directly to the user prompt can distort the semantic signature of the actual question. This risks retrieving irrelevant or less optimal chunks. In contrast, the system prompt wraps the user query in a carefully designed context that supports more accurate retrieval and generation. By clearly separating the user’s original question from surrounding formatting or instructions (and using the system prompt strictly for downstream generation at inference time) I was able to significantly improve both retrieval accuracy and output quality.

Hardware Acceleration

I optimized the embedding step to run on an Apple Silicon GPU (specifically, a MacBook Pro with an M3 Pro chip). Leveraging GPU parallelism for batch embedding generation reduced processing time: what used to take nearly a week with the old pipeline was now done in just a few hours, also supported by the move to cleaner text and smaller chunks.

Implications of Optimization

More Accurate results, fewer hallucinations

The impact of these optimizations was most evident on harder questions. For straightforward fact-based queries, both the bespoke RAG system and WARC-GPT performed well. But when the answer depended on surfacing a rare or deeply buried detail (like identifying a single quote or fact buried in a blog post among thousands of pages) the tuned pipeline performed better: it was more effective at surfacing the right snippet, thanks to a cleaner index and more precise retrieval (i.e. there was not as much noise interfering with the semantic signature of each chunk, so a user’s query could more accurately be matched to chunks in the vector database).

And when the tuned system couldn’t find a relevant chunk, it often returned nothing or explicitly deferred, responding with phrases like “I couldn’t find that information,” rather than attempting to fabricate an answer. This behavior was the result of more carefully engineered system prompts, which explicitly instructed the generative model (in this case, GPT-4) not to guess or generate speculative content in the absence of supporting evidence from the enhanced query text containing the retrieved chunks. By aligning the model’s behavior with the principles of grounded generation, these prompts significantly reduced hallucinations (or confabulations), especially in cases where the retrieval step failed to return a relevant chunk at all. Rather than leaving the model to fill in the blanks (which LLMs are apt to do), the prompts encouraged transparency about the system’s limits, a small but critical safeguard when working with incomplete or noisy collections like web archives.

Indexing Efficiency (resource implications)

Another benefit was efficiency. The vector database generated by the custom pipeline was around 240 MB, compared to roughly 10 GB for WARC-GPT’s index on the same content. This ~40× reduction in index size was primarily achieved by stripping out noisy content using a BeautifulSoup configuration tailored to the specific structure of the target site. A smaller index saves storage and speeds up retrieval (fewer vectors to scan) and reduces memory overhead. In practice, queries in the custom pipeline felt faster and the system was more responsive overall. This kind of efficiency could also translate to cost savings, particularly if running in the cloud or on constrained hardware.

Again it’s worth noting that WARC-GPT was designed to be flexible with model choices, including the ability to run fully locally using open-source LLMs via the Ollama framework. In my tests, I opted to use GPT-4 for its higher answer quality, whereas WARC-GPT defaults to open models preconfigured in Ollama. A strong language model can generate better answers from the same context. That said, using GPT-4 comes with trade-offs: it introduces cost and reliance on an external API, which may not be acceptable for every library setting. For now, the key takeaway is that by feeding the LLM cleaner, more focused data (and doing so more efficiently) I was able to significantly boost the system’s performance.

Implications

Comparing an untuned to a custom-created and -tuned RAG pipeline raises a central question for libraries exploring AI: should we adopt off-the-shelf tools, or invest in building custom systems? WARC-GPT offered an accessible starting point (it is open-source and relatively easy to set up) but my deep dive revealed that achieving reliable performance depends heavily on data quality, retrieval precision, and processing efficiency. Any library considering a similar deployment will need to weigh the convenience of a pre-built framework against the benefits of tailoring a system to local needs, and the staff time and technical skills that effort requires. In my case, the custom approach paid off in performance, but it came with a hands-on cost that not every institution may be positioned to absorb.

More broadly, these experiments underscore both exciting opportunities and serious challenges that come with bringing LLMs and RAG into library workflows:

Transparent source attribution: Unlike standalone LLMs that generate answers without revealing where the information came from (although that is changing as the big frontier models increasing access live web content during inference), RAG systems can point users back to the exact documents or text snippets used to construct a response. This traceability not only supports verification and citation, but also aligns with core library values around transparency, accountability, and intellectual honesty. For users, it means they can explore the source material themselves with more confidence (although, as with all things LLM, it is important to always maintain vigilance).

Conversational access to collections: RAG-powered chat interfaces can make large, unstructured collections like web archives more intuitively searchable. They allow users to ask complex questions and get narrative answers that would be impractical with traditional search engines. This can reveal connections and insights that keyword searches might miss.

Automated metadata generation: An LLM that’s allowed to read through an entire collection could help summarize content, extract key entities, suggest topics, or even generate draft descriptions. This could assist librarians and archivists in creating metadata or finding aids, especially for collections lacking detailed descriptions. In our web archive scenario, one could imagine the LLM summarizing the main themes of a set of websites or identifying frequently mentioned names and issues across the collection.

Semantic search capabilities By embedding content in a vector space, I enabled semantic similarity searches: finding documents that are topically related even if they don’t share the same keywords. This goes beyond the “string matching” of conventional search. For researchers, this means a query about, say, climate change controversies might retrieve relevant pages even if those pages don’t explicitly use that exact or approximate phrasing, because the LLM can match on conceptual similarity.

Technical and resource barriers: Setting up and maintaining RAG pipelines requires technical expertise in LLMs, data processing, and scripting; many libraries are still in the process of developing these capabilities. There are also computational costs, including robust hardware (GPUs, lots of memory, fast storage) and ongoing expenses when using commercial APIs. Smaller institutions may find this prohibitive, and even large libraries will need to justify the local infrastructure and/or cloud computing costs needed to run RAG services at scale (Huang et al. 2023; Allen Institute 2023).

Residual hallucinations:Even with well-tuned RAG pipelines, hallucinations remain a persistent risk. While RAG aims to ground LLM outputs in source material, this only works reliably if the model faithfully adheres to system prompts and if the retrieved content is relevant and sufficiently informative. In practice, even state-of-the-art models sometimes ignore instructions not to guess or fabricate answers, especially when prompted with vague, underspecified, or speculative queries (). This failure to follow directives can result in confident, fabricated responses that are not grounded in the retrieved evidence. In library and archival settings, where precision and verifiability are paramount, these confabulations pose a direct challenge to the trustworthiness of AI-assisted services. Guardrails like refusal prompts, deferred responses, and citation requirements help mitigate, but will not eliminate this behavior. As of now, no prompt or retrieval strategy can fully guarantee hallucination-free outputs.

Ethical and legal concerns: AI-generated answers raise important questions about transparency, accuracy, and bias. In a library context, providing fabricated or misleading information can erode trust, a resource that is more critical than ever in an era marked by rising misinformation, political polarization, and growing public skepticism toward institutions. As geopolitical tensions distort narratives and destabilize information ecosystems, libraries are left to recommit their role as trusted stewards of credible knowledge. We must consider how to verify, correct, and contextualize LLM outputs, and how to clearly communicate the provenance of any given response. WARC-GPT’s inclusion of source citations is one step in the right direction, but not all tools offer this. There are also legal considerations: large-scale text analysis and embedding may raise copyright concerns, and use of third-party AI services can conflict with privacy or data ownership policies.

Bias at scale: Another concern is the amplification of existing bias in our collections. Archival and library materials reflect the perspectives, exclusions, and structural inequities of the societies that produced them. When these materials are fed into LLMs without sufficient context or critical framing, there’s a risk that biased or harmful content gets reproduced as neutral or even authoritative. A RAG system (or any LLM) doesn’t “know” the provenance or politics of a source unless we explicitly encode that context. How to do this is the great unsolved question for LLMs and other deep neural networks. We risk building pipelines that surface and summarize legacy bias in ways that flatten nuance or reinforce harmful assumptions. This isn’t a new problem in libraries, but in an AI-powered environment, the speed and scale at which these biases can circulate makes it newly urgent.

Maintenance and sustainability: Deploying an AI tool isn’t a one-time event. Models require periodic updates, underlying collections evolve, and dependencies shift. There’s also a risk in relying too heavily on proprietary services, which may change terms or shut down unexpectedly. Libraries will need long-term plans to manage these systems, such as whether to invest in in-house training and GPU infrastructure, or to lean more heavily on external, commercial services both from publishers and cloud computing providers. A hybrid approach may be ideal, with frequently accessed or sensitive data handled locally, while less critical tasks leverage the cloud.

In navigating these trade-offs, a thoughtful approach with human oversight is essential. At this point AI tools should augment human expertise in archives and libraries. This kind of oversight is foundational to responsible implementation, especially in these early stages of integrating AI into cultural memory work.

One of the most persistent challenges in integrating AI is the inscrutable nature of the models themselves. Deep neural networks like LLMs function as black boxes: we can observe their inputs and outputs, but their internal reasoning remains largely opaque, even to their creators. This lack of transparency becomes especially troubling when combined with the well-documented issue of hallucinations (Ji et al. 2023). In library and archival settings, where trust is foundational and verifiability is non-negotiable (on paper at least), this presents a serious risk. As we experiment with these tools, ensuring source attribution, maintaining human oversight, and building systems that fail gracefully (rather than confidently lying) must remain core design priorities.

Conclusion

LLMs and RAG signal a fundamental shift in how users can discover and interact with digital collections. My work with WARC-GPT and a custom RAG pipeline revealed both the promise and the pitfalls of applying these tools to web archives. I saw firsthand how AI can synthesize information from across an entire collection and return it in a conversational, human-readable format, lowering barriers for researchers and surfacing connections that keyword search alone might miss.

But the most important lesson was this: RAG pipelines are highly sensitive to design and preprocessing choices. Performance didn’t come from the language model alone, it depended on everything that came before it: cleaner data, tuned chunking, targeted filtering, and efficient infrastructure. These foundational steps had a greater impact on retrieval accuracy and output quality than any one prompt or model tweak. With thoughtful design, I found I could significantly improve responsiveness, relevance, and usefulness in a library context, but only by first doing the unglamorous work of preparing the data pipeline.

Recognizing this sensitivity to design choices is key to implementing LLM-assisted services responsibly. Success depends not just on the capabilities of the language model, but on the rigor and care applied throughout the pipeline.Issues of data quality, hallucinated answers, and the compute demands of large models and RAG infrastructure generally, remain real and pressing. Any library considering a “GPT for web archives” approach will need to invest in the slow work: building preprocessing pipelines, testing retrieval methods, refining prompts, and continuously checking for accuracy. It is a serious undertaking, but it’s also one that could enable entirely new forms of access, research, and engagement.

Looking ahead, the future of AI in libraries is likely to be hybrid and human-centred. We will mix local infrastructure for collections that require control, with cloud-based models where scale matters. We will blend automation with the discernment of librarians and archivists. And no

A one-size-fits-all solution will emerge: what works for one collection may be completely wrong for another. But what’s already clear is that the human role is not going away. If anything, our expertise in context, ethics, and stewardship will become even more critical as these tools gain traction.

And yet, I would be remiss not to acknowledge the bigger picture. The same technologies powering this promising moment are part of a much larger transformation, one that includes growing calls from AI researchers and ethicists about the risks posed by super intelligent systems. As we build small, useful tools for cultural memory work, we do so in the shadow of something much larger, more powerful, and less predictable. It’s a strange time to be hopeful, and a necessary time to be cautious.

RAG can help unlock our collections in remarkable new ways. But doing so thoughtfully–openly, critically, and with humility–might just help us hold onto the trust and wisdom we’ll need in the face of what’s coming.

Bibliography

AI Now Institute. 2023. Computational power and AI [Internet]. [accessed 2025 Jun 4]. Available from: https://ainowinstitute.org/publications/compute-and-ai

Anthropic. 2024. Introducing contextual retrieval [Internet]. [accessed 2025 Jun 4]. Available from: https://www.anthropic.com/news/contextual-retrieval

Cargnelutti M, Mukk K, Stanton C. 2024. WARC-GPT: An open-source tool for exploring web archives using AI [Internet]. Harvard Library Innovation Lab. [accessed 2025 Jun 4]. Available from: https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

Charleston Hub. 2023. In libraries we trust [Internet]. [accessed 2025 Jun 4]. Available from: https://www.charleston-hub.com/2025/01/in-libraries-we-trust/

Costa M. 2021. Full-text and URL search over web archives [Internet]. arXiv preprint arXiv:2108.01603. [accessed 2025 Jun 4]. Available from: https://arxiv.org/abs/2108.01603

Cottier B, Rahman R, Fattorini L, Maslej N, Besiroglu T, Owen D. 2024. The rising costs of training frontier AI models [Internet]. arXiv preprint arXiv:2405.21015. [accessed 2025 Jun 4]. Available from: https://arxiv.org/abs/2405.21015

EAB. 2023. Public trust in higher education has reached a new low [Internet]. [accessed 2025 Jun 4]. Available from: https://eab.com/resources/blog/strategy-blog/americans-trust-higher-ed-reached-new-low/

Hiltmann T, Dröge M, Dresselhaus N, Grallert T, Althage M, Bayer P, Eckenstaler S, Mendi K, Schmitz JM, Schneider P, Sczeponik W, Skibba A. 2025. NER4all or context is all you need: using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach. arXiv [Preprint]. Available from: https://arxiv.org/abs/2502.04351

Humphries M, Leddy LC, Downton Q, Legace M, McConnell J, Murray I, Spence E. 2024. Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents. arXiv [Preprint]. Available from: https://arxiv.org/abs/2411.03340

Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang Y, Madotto A, Fung P. 2023. A survey on hallucination in large language models. ACM Comput Surv [Internet]. [accessed 2025 Jun 4];56(1):1–38. Available from: https://dl.acm.org/doi/10.1145/3703155

Kannappan G. 2023. The Achilles’ heel of naive RAG: When retrieval isn’t enough [Internet]. Medium. [accessed 2025 Jun 4]. Available from:

https://medium.com/@ganeshkannappan/the-achilles-heel-of-naive-rag-when-retrieval-isnt-enough-3c1ab812abbb

Merola C, Singh J. 2025. Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation [Internet]. arXiv preprint arXiv:2504.19754. [accessed 2025 Jun 4]. Available from: https://arxiv.org/abs/2504.19754

Milligan I. 2012. WARC files: A challenge for historians, and finding needles in haystacks [Internet]. [accessed 2025 Jun 4]. Available from: https://ianmilli.wordpress.com/2012/12/12/warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks/

OpenAI. 2023. OpenAI platform: Vector embedding [Internet]. [accessed 2025 Jun 4]. Available from: https://platform.openai.com/docs/guides/embeddings

Wang L, Pradeep R, Ainslie J, Shi K, Zettlemoyer L, Zitnick CL. 2022. Text embeddings by weakly-supervised contrastive pre-training [Internet]. arXiv preprint arXiv:2212.03533. [accessed 2025 Jun 4]. Available from: https://arxiv.org/abs/2212.03533

Notes

[1] https://bobs-burgers.fandom.com. This site serves as an unexpectedly rich tool for exploring web archives as a non-traditional, community-driven form of scholarship. As a fan-curated knowledge base centered on the long-running animated series, it exemplifies the kind of vernacular documentation that libraries and archives have historically overlooked. Yet fandom wikis like this one are often meticulously maintained, deeply intertextual, and laden with cultural meaning: qualities that make them ideal test cases for assessing how well AI systems can interpret and navigate messy, user-generated web content. From a practical standpoint, this site also lends itself to “ground-truthing”: as someone who comfort-watches the show, I’m intimately familiar with its characters, plot arcs, and in-jokes, which makes it easier to spot hallucinations or subtle misrepresentations generated by AI tools.

[2] Trust in libraries and universities has long underpinned our supposed authority as stewards of knowledge and public goods, but that trust is increasingly fragile. Libraries still enjoy relatively high public regard (Charleston Hub 2023), yet universities face mounting skepticism amid political polarization and questions about institutional neutrality (EAB 2023). As misinformation spreads and civic discourse fractures, our ability to function as credible, trusted spaces is under threat. For libraries embedded within higher education, the stakes are especially high: we must not only preserve and provide access to knowledge, but do so transparently, responsibly, and with renewed urgency. In this climate, even technical choices–like ensuring LLM-generated responses include source attribution–can become conscious acts of trust-building, reaffirming the library’s role as a safeguard against the erosion of shared knowledge and truth, regardless of how quixotic this all might seem right now.

[3] https://github.com/coreyleedavis/libguides-rag. I relied heavily on ChatGPT and Claude to help write the code. This pipeline wouldn’t exist without them, to be honest. That said, I’d really welcome any feedback from folks who actually know how to write Python from scratch. Suggestions for improving the scripts or overall structure are more than welcome.

About the Author

Corey Davis is the Digital Preservation Librarian at the University of Victoria Libraries, where he leads initiatives on AI, web archives, and digital preservation infrastructure.

coreyd@uvic.ca

Subscribe to comments: For this article | For all articles

Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline

Introduction

WARC-GPT: RAG for web archives

Cleaning the input data

Customizing chunk size and vector model

Selecting an LLM chat bot to generate results

The role of prompt engineering: System prompts

Hardware Acceleration

Implications of Optimization

More Accurate results, fewer hallucinations

Indexing Efficiency (resource implications)

Implications

About the Author

Leave a Reply

Current Issue

Previous Issues

For Authors