[{"id":18575,"date":"2025-10-21T15:58:59","date_gmt":"2025-10-21T19:58:59","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18575"},"modified":"2025-10-21T16:46:49","modified_gmt":"2025-10-21T20:46:49","slug":"editorial-8","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18575","title":{"rendered":"Editorial"},"content":{"rendered":"<p>by Edward M. Corrado<\/p>\n<p>Welcome to the sixty-first issue of the Code4Lib Journal. When I look back to the first issue published in December 2007 I am amazed at how long the journal has lasted as a community effort that has no formal (and barely, if any, informal) structure. The Code4Lib Journal is not an organization, but instead it is an effort created and maintained by people who are passionate about \u201cthe intersection of libraries, technology, and the future.&#8221; <a href=\"#ref1\" id=\"note1\">[1]<\/a> The numerous editorial committee members and technical administrators have enabled the journal to exist and thrive for almost 18 years and they are all owned a great deal of thanks. However, without the article authors and readers of the Code4Lib Journal that appreciate what we are doing it would be all for naught. Thank you for your continued support. <\/p>\n<p>Issue 61 contains seven articles that we believe will help continue Code4Lib Journal\u2019s mission \u201cto foster community and share information among those interested in the intersection of libraries, technology, and the future.\u201d <a href=\"#ref2\" id=\"note2\">[2]<\/a> In no particular order, they are:<\/p>\n<ul>\n<li>What it Means to be a Repository: Real, Trustworthy, or Mature? by Seth Shaw<\/li>\n<li>Building and Deploying the <em>Digital Humanities Quarterly<\/em> Recommender System by Haining Wang, Joel Lee, John A. Walsh, Julia Flanders, and Benjamin Charles Germain Lee<\/li>\n<li>From Notes to Networks: Using Obsidian to Teach Metadata and Linked Data by Kara Long and Erin Yunes<\/li>\n<li>Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline by Corey Davis<\/li>\n<li>Extracting A Large Corpus from the Internet Archive, A Case Study by Eric C. Weig<\/li>\n<li>Liberation of LMS-siloed Instructional Data by Hyung Wook Choi, Jonathan Wheeler, Weimao Ke, Lei Wang, Jane Greenberg, and Mat Kelly<\/li>\n<li>Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries by Jason Casden, David Romani, Tim Shearer, and Jeff Campbell<\/li>\n<\/ul>\n<p>We hope you enjoy these articles and find them useful. Thanks again for supporting Code4Lib Journal!  <\/p>\n<h2>Notes<\/h2>\n<p><a href=\"#note1\" id=\"ref1\">[1]<\/a> &#8220;Mission,&#8221; Code4Lib Journal, accessed October 10, 2025, <a href=\"https:\/\/journal.code4lib.org\/mission\">https:\/\/journal.code4lib.org\/mission<\/a>.<\/p>\n<p><a href=\"#note2\" id=\"ref2\">[2]<\/a> &#8220;Mission,&#8221; Code4Lib Journal.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to the 61st issue of Code4Lib Journal. We hope that you enjoy the variety articles published in this issue.<\/p>\n","protected":false},"author":18,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18575","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18575"}],"version-history":[{"count":1,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18575\/revisions"}],"predecessor-version":[{"id":18583,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18575\/revisions\/18583"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18489,"date":"2025-10-21T15:58:58","date_gmt":"2025-10-21T19:58:58","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18489"},"modified":"2025-10-22T09:18:29","modified_gmt":"2025-10-22T13:18:29","slug":"mitigating-aggressive-crawler-traffic-in-the-age-of-generative-ai-a-collaborative-approach-from-the-university-of-north-carolina-at-chapel-hill-libraries","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18489","title":{"rendered":"Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries"},"content":{"rendered":"<p>By Jason Casden, David Romani, Tim Shearer, and Jeff Campbell<\/p>\n<h2>Introduction<\/h2>\n<p>Since the release of multiple competing generative AI platforms, the University of North Carolina at Chapel Hill University Libraries has seen increasingly overwhelming and service-disrupting traffic from web crawlers, including those from previously well-behaved hosts such as Google and Microsoft. This traffic consists of an unprecedented flood of evasive and extremely aggressive requests, which caused multiple consecutive days of intermittent service disruption for our public catalog interface (and later to various digital collections services). Although most of our peers in libraries, archives, and the broader open web have also been acutely impacted[<a id=\"ref1a\" href=\"#note1a\">1a<\/a>][<a id=\"ref2a\" href=\"#note2a\">2a<\/a>], our institution was among the first to experience severe service disruptions from this activity. This article will describe these impacts, as well as our evolving and escalating attempts to mitigate the denial-of-service caused by these automated crawls. We will discuss our novel facet-based bot detection and blocking technique, practices identified by other institutions, and opportunities for collaborative detection and blocking of &#8220;bad bots&#8221; (i.e., bots that aggressively query our services without regard for robots.txt or other policies) among libraries and archives.<\/p>\n<h2>Community Context<\/h2>\n<p>When UNC Libraries first began to see our systems under an unprecedented and unsustainable wave of traffic in the Spring of 2024, we began to check in across our professional networks. Within traditional IT academic communities (i.e., campus schools and departments), no systems were adversely affected. However, within library and archival IT communities nationwide, it was clear many systems were under strain. This excessive traffic was successfully mitigated by UNC Libraries software and system engineers in fairly short order, ensuring systems were available to serve researchers. When our systems detected a second, and seismic, wave of automated web crawls in late fall 2024, we found, mostly through informal communications, the same national patterns, but far more intense, almost exclusively in memory institutions. These informal communications took a number of forms. As a group was moving into a meeting agenda or winding one down with informal chat, one or two people would mention that their systems were receiving traffic that seemed similar to distributed denial-of-service (DDoS) attacks, and others would chime in with similar experiences. For example, LYRASIS-hosted Fedora Repository Program governance calls in 2024 had so many side conversations about bot activity that it spun off an interest group that is meeting regularly and has begun to compile resources. In the Code4lib community Slack instance[<a id=\"ref3\" href=\"#note3\">3<\/a>], a new messaging channel was established, but not publicized, as a forum for an international community of technical libraries and archives colleagues to discuss their attempts to mitigate crawler-driven downtime. Additionally, as our colleagues shared our internal communications with their professional networks, we started receiving requests for consultations with other institutions.<\/p>\n<p>The resulting information sharing is organic, inconsistent, and clearly not reaching all affected institutions. This lack of comprehensive awareness and coordinated response is exacerbated by the fact that systems are not predictably affected. Staff may only need and seek out a shared understanding and response after their own system is affected. Additionally, unaware of the global pattern, individuals often attempt to resolve the issue locally and only come to understand the broader nature of the attacks after spending weeks or months trying to restore access to their impacted systems. In many cases, this is the first time IT staff in libraries and archives have experienced a significant and coordinated security event.<\/p>\n<p>As mentioned, there are efforts to bridge these gaps. We conducted an informal survey in late March 2025, and we are aware of at least one other similar survey. Our goal was to begin documenting the extent of the impact, better understand which systems are affected, identify solutions that seem to be working, and gauge interest in building a community around the topic. We found that roughly a third of respondents were from outside academic libraries and a quarter were from outside the U.S., hinting at the global impact of this bot traffic. Sixty-five percent of respondents are actively battling bot traffic or experiencing degraded services. The Engelberg Center on Innovation Law &amp; Policy and the Confederation of Open Access Repositories (COAR) asked similar questions in surveys conducted in April 2025 and found that 74.4% and 75.5% of respondents, respectively, are actively engaging in bot mitigation[<a id=\"ref1b\" href=\"#note1b\">1b<\/a>][<a id=\"ref2b\" href=\"#note2b\">2b<\/a>]. More than 95% of respondents to our survey are interested in coming together collectively to share and learn from each other. The LYRASIS-hosted group is actively exploring ways to further this goal.<\/p>\n<h2>Strategic Partnership<\/h2>\n<p>As the same software and systems engineers tackled the second wave of activity in the fall of 2024, they found it was clearly different from the first wave. The apparent source of the traffic was masked, and behavior evolved to outflank each effort to block it. We soon realized we needed help and began sharing data and ideas with our regional academic library IT colleagues. We also initiated talks with our university IT security professionals, leading to a call in early 2025 between Libraries IT staff and both security and networking staff from the university IT division. As we described the problem and expressed uncertainty about whether this qualified as a DDoS attack, a security colleague responded with a clarifying message that &#8220;authorial intent does not matter&#8221; and both offices generously offered their help. Over the next week, a cross-departmental group with representatives of Libraries IT, the campus security office, and the campus networking office determined that implementing additional network infrastructure was the most promising intervention. The security office was also able to confirm that, in their professional networks, the issues were not widespread across academic institution systems, but were specifically affecting memory institutions located within those campuses. Our central campus networking office referred us to a Web Application Firewall (WAF) vendor and worked closely with us to configure and deploy a cloud-based load balancer and WAF. We were fortunate to have available funding and support from the University Libraries Administration to pursue this vendor-based approach.<\/p>\n<h2>Escalation of the Crawling and Mitigation<\/h2>\n<p>Our initial response to early reports of sluggish responses from the catalog was to focus on scaling and optimization. We increased server resources, tuned web server configurations, profiled the application, and implemented targeted performance improvements. Although we did not have a load balancer in place at the time, colleagues at other institutions scaled more dramatically behind load balancers, adding more and more servers as the traffic worsened. Our attempts to control this overwhelming traffic, with significant contributions from many of our colleagues, began with more familiar blocking methods and evolved into a layered suite of mitigation methods.<\/p>\n<h3>User agent and IP blocking<\/h3>\n<p>We have always been willing to block IP addresses in our local host-based firewalls when we observe malicious requests. These typically involve repeated attempts to break into the web server or na\u00efve denial-of-service attacks. While we have automation in place to deploy configuration changes to the firewall, this was typically a one-off manual process, triggered when the activity crossed a threshold and generated alerts in our monitoring system. Other sources of blocks include IP address and user agent lists shared by our colleagues at NCSU Libraries and Princeton University Libraries and lists compiled from other sources beyond libraries[<a id=\"ref4\" href=\"#note4\">4<\/a>][<a id=\"ref5\" href=\"#note5\">5<\/a>].<\/p>\n<p>In the spring of 2024, we noticed a new kind of traffic: very high-volume requests from a single IP address or a narrow range of addresses, initially targeting full-text resources and eventually expanding to images and other audiovisual content. The traffic originated from commercial cloud providers, often Amazon AWS or Microsoft Azure, making it unlikely to be human-generated. If we banned one IP address, the traffic would often shift to an adjacent one. The scale of this activity quickly overwhelmed our capacity for manual intervention, and if left unchecked, it would have exhausted our server resources. This traffic also included user agents tied to commercial AI crawlers, such as Anthropic&#8217;s &#8220;ClaudeBot.&#8221; This led us to our first attempt at an automated response: fail2ban[<a id=\"ref6\" href=\"#note6\">6<\/a>] continuously monitors log files and, when it detects a user-defined pattern, inserts a firewall block that is later removed after a configurable duration. It also supports escalating penalties for repeat offender IPs. Recognizing that user-agent strings are a soft identifier, we used available lists to build a fail2ban &#8220;jail&#8221; for known AI crawler user agents while we developed a strategy for providing data to AI clients. Short bans of just a few hours proved insufficient, so we adopted a progressive ban cycle for multiple offenses: 1 day, 7 days, 1 month, 6 months, and 1 year. After discussions among internal stakeholders, this ban of known AI user agents was extended to all our public-facing websites and continues to identify addresses to re-ban or ban for the first time.<\/p>\n<h3>Request throttling<\/h3>\n<p>We began to see a new kind of web crawling traffic in the late fall of 2024. It first showed up on our catalog discovery layer rather than sites with full text or other similar resources, and it was more aggressive than previous crawlers. It sidestepped our fail2ban user agent solution, as it utilized the common, generic user agent &#8220;Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/136.0.0.0 Safari\/537.36 Edg\/136.0.0.0.&#8221; It also no longer originated from commercial cloud providers but from mainstream ISPs such as AT&amp;T, Comcast, Spectrum, and Verizon-Home.<\/p>\n<p>Again, the volume of these requests overwhelmed server resources. It defeated our attempts to tune application performance and consumed any new server resources. Analysis of our logs showed several odd characteristics. There were identical simultaneous requests from multiple disparate IP addresses. This indicated a deep pool of addresses was available and we suspected a residential proxy network was in use. Most of these hundreds of thousands of clients were making something like ~10 requests per minute in bursts of 3-5 minutes, effectively functioning as a DDoS attack. When we attempted to block at 80 requests per 5 minutes, and later at 60 requests per 5 minutes, we experienced brief respites before new bots would return with slightly slower streams of requests.<\/p>\n<h3>Facet blocking<\/h3>\n<p>After standard IP blocks and request throttling both failed, a deeper examination of the logs revealed that these bots were requesting huge numbers of very large and unusual combinations of facets, producing costly and unusual requests. These types of queries can be very resource-intensive and slow to return, contributing to a vicious cycle of slow responses, large request queues, and repeated query attempts. Many of these queries involved subject headings that were far outside our collecting strengths. To illustrate: in the entire month of November, we received 15 searches using the terms &#8220;Finnish&#8221; and &#8220;Music.&#8221; On December 4th, 2024 alone, we received 11,329 of these searches. More importantly, we also discerned that many, if not most, of these requests used a far higher number of facets than we would expect from a human user. For example, in all of November 2024, we received 294,398 requests with 15 or more facets applied (many of which were likely also bots). On December 4th alone, we received 140,104. Also on December 4th, we received 17,527 requests with 20 or more facets applied and 203 requests with 25 or more facets applied. That day&#8217;s 648,861 requests came from 22,861 distinct IP addresses with no clear relationship between them. These numbers do not include any requests from the millions of IP addresses we had already blocked in the previous phase.<\/p>\n<p>Less than two weeks later, Google published faceted navigation SEO best practices[<a id=\"ref7\" href=\"#note7\">7<\/a>], stating that &#8220;faceted navigation is by far the most common source of overcrawl issues site owners report to us.&#8221; We later read about the development of &#8220;bot traps,&#8221;[<a id=\"ref8\" href=\"#note8\">8<\/a>] and realized that libraries may have unwittingly created bot traps by creating broadly and deeply filterable discovery interfaces. We began to explore the idea of using faceted interfaces as a kind of security honeypot[<a id=\"ref9\" href=\"#note9\">9<\/a>] where we could detect signs of non-human interaction and establish bans that would propagate to other systems where that traffic may be more difficult to detect.<\/p>\n<p>We developed a regular expression filter for fail2ban to isolate requests with more than a certain number of facets. We collaborated with our User Experience department and catalog committee to determine the ideal facet threshold, settling on a range of 8 to 12, depending on the severity of traffic, while allowing all queries from on-campus and VPN networks. After the facet-based bot blocking method went live on the afternoon of December 5th, 2024, we almost immediately saw large numbers of IP addresses being banned, 700 in the first 30 minutes, over 10,000 after 24 hours, and 16,000 after 48 hours. This reflects, in part, the very deep pool of IP addresses that the attackers were able to draw on. The 3,604,924 requests from the week before we deployed the facet-based blocks (November 28th-December 4th) originated from 207,351 distinct IP addresses. In contrast, the week after the facet-based blocks were deployed (December 6th-December 12th) saw 673,031 requests (an over 81% drop) from 220,110 distinct IP addresses. The distinct IP address count was likely sustained due to the ability to submit a successful &#8220;bad&#8221; query before being banned. We began sharing the details of our method with other institutions the next day.<\/p>\n<p>We also saw adaptive behavior based on the length of the original ban, which started at just 2 minutes and began after the second &#8220;bad&#8221; faceted query. The initial ban eventually grew to 12 hours after the first offense, with longer bans for repeat offenses. Although this solution was sufficient for nearly a month, the crawlers eventually managed to evade the bans by limiting the number of facets in their queries and drawing from larger sets of IP addresses. We have continued to find this practice helpful as a complement to the other measures described below, as we experienced renewed outages when it was removed. Additionally, in the five months since this change, we have received only one user report of a problem with this limit.<\/p>\n<h3>Subnet-based blocks<\/h3>\n<p>After several weeks of stability, we began to struggle under the weight of huge numbers of neighboring IP addresses. We modified our facet-based ban rules from blocking just the offending IP address to blocking the whole \/24 subnet; that is, the pool of 256 IP addresses in the last dotted quad of the address, for an offense by any IP in that subnet. We also lowered the facet limit and started banning clients after a single flagged query. We were seeking to expand our protection to stabilize our servers with the least possible collateral damage to human users.<\/p>\n<h3>Regional bans<\/h3>\n<p>Finally, the traffic moved offshore to East Asia, notably the People&#8217;s Republic of China (PRC). While this allowed us to programmatically ban large networks, we quickly found ourselves playing a game of whack-a-mole again. That is, once again managing an ongoing sequence of emergent problems, each requiring immediate, but likely temporary, resolution. After discussions with stakeholders, we decided to briefly ban all addresses from the PRC. We reasoned that the catalog was down anyway, and this might provide some temporary breathing room for us to develop a new solution. This had the surprising effect of almost immediately moving traffic to Brazil, and then to various locations in South America and Europe when we blocked Brazil. At this point, we were unable to restrict queries any further without significantly impacting human users. We knew we needed a non-local solution that could sit between the traffic and our server and examine the traffic to either block bots or prioritize on-campus and VPN users.<\/p>\n<h3>Commercial Web Application Firewall<\/h3>\n<p>We approached our campus networking and security colleagues for assistance. They worked through several possible options and then introduced us to their Web Application Firewall (WAF) vendor. A Web Application Firewall (WAF) examines and filters traffic to a web application based on the source, content, or other attributes of a request before the request reaches the application server. We worked in close partnership with our campus networking unit and the vendor to rapidly implement a trial of a cloud-based, GDPR-compliant WAF for the catalog server. On several occasions, this required involving senior engineers at the vendor, with one apologizing for smiling because &#8220;it&#8217;s just interesting that I haven&#8217;t seen something like this before.&#8221;<\/p>\n<p>Once the WAF was implemented and configured, we started to see some relief from the traffic. The WAF included a bot filtering mechanism based on ratings of client &#8220;trustworthiness,&#8221; and also allowed us to more easily monitor and block traffic from troublesome sources. For the fourth time, we spent weeks feeling that we had found a solution. We licensed the WAF for a year, and we started planning to move other services behind it. After nearly a month of stability, however, large numbers of bots began finding their way through the WAF.<\/p>\n<h3>Commercial AI-driven bot detection<\/h3>\n<p>We worked with the WAF vendor to evaluate their &#8220;AI-driven&#8221; bot detection system that can be included as a component of the WAF. This seemed to solve the problem again, but the cost of this system would more than double the WAF price. This was prohibitively expensive, so we began investigating ways to prioritize traffic regionally (e.g., rate-limiting sources of traffic outside the campus and our VPN). The cloud WAF infrastructure made it impossible to include the VPN in these rules since the WAF was located at a non-campus IP address. We began discussing VPN exceptions with campus networking and security colleagues, while also evaluating a recent publication by Jonathan Rochkind[<a id=\"ref10\" href=\"#note10\">10<\/a>], which described the use of the Cloudflare Turnstile CAPTCHA-style system to detect and block bot traffic.<\/p>\n<h3>In-browser client verification (i.e., Cloudflare Turnstile)<\/h3>\n<p>One common comment from our Libraries colleagues as we shared updates on the situation was something along the lines of, &#8220;Oh, no, please, I hope we don&#8217;t need a CAPTCHA\u2026&#8221; Unfortunately, despite legitimate concerns regarding accessibility and user annoyance, the severity of the crawler situation forced us to explore this option. Fortunately, newer technologies like the free-to-use Cloudflare Turnstile offered a promising middle path. Unlike traditional CAPTCHAs that explicitly challenge users to identify traffic lights or crosswalks, Turnstile operates automatically for most users through a combination of browser fingerprinting, behavioral analysis, and lightweight challenges that typically complete automatically without user intervention. Cloudflare also asserts that Turnstile is WCAG 2.1 Level AA compliant[<a id=\"ref11\" href=\"#note11\">11<\/a>].<\/p>\n<p>We implemented Turnstile selectively, focusing initially on the most resource-intensive endpoints in our catalog, particularly those involving complex faceted searches that had proven vulnerable to crawler abuse. This targeted approach minimized the impact on typical user workflows while providing robust protection for our most susceptible service points. We configured the system to exempt on-campus and VPN users entirely, ensuring that many of our primary users experienced no change in service. We did not issue challenges for record views or searches without facets, and we only challenge a client once per day.<\/p>\n<p>Results were immediate and dramatic. Server load returned to normal levels within hours of implementation, and we experienced very few further outages on protected endpoints. Those outages were eliminated by restoring our facet-based ban mechanism. In the extremely rare cases where legitimate users encountered service disruptions due to Turnstile (i.e., two incidents in the first four months of use), our cross-departmental response team quickly guided users to solutions.<\/p>\n<h2>Lessons Learned<\/h2>\n<p>Our sustained efforts, along with those of the wider libraries and archives community, to counter the evolving practices of these crawlers have yielded several valuable lessons. These insights, developed through a continuous cycle of implementing defenses and observing crawler adaptations, reveal the distinct challenges posed by this modern bot activity and the inherent limitations of conventional security and traffic management measures.<\/p>\n<h3>Banning blocks of IP addresses is insufficient<\/h3>\n<p>The traffic we have observed originates from a vast number of unrelated IP addresses, often from consumer internet service providers rather than from recognized corporate servers. This distributed network of bots has proven to be adept at evading our mitigation measures, as it undermines the traditional approach of banning attributable blocks of IP addresses. It appears that these requests were coming from a residential proxy network, essentially a legitimized botnet. These are services that can be hired to distribute network traffic over a wide range of (sometimes tens of millions of) unrelated individual internet addresses. These proxy nodes can be established in various ways, but one method is to conceal them within mobile apps, browser extensions, or other software[<a id=\"ref12\" href=\"#note12\">12<\/a>]. Others include paying users to run this software on their computers and building massive collections of proxied addresses in data centers.<\/p>\n<h3>Rule-based controls are insufficient<\/h3>\n<p>We have found that static, rule-based defenses are insufficiently flexible to respond to crawlers that exhibit a kind of intelligent adaptation. As we implemented countermeasures, we observed crawler behavior rapidly evolving to circumvent our defenses. When we blocked based on user agents or specific request patterns, the crawlers quickly adapted by modifying these identifiers. Unlike traditional web crawlers that follow predictable paths through sites, these new crawlers exhibited what appeared to be machine learning-based adaptability. They would probe different access vectors, analyze our responses, and modify their approach accordingly.<\/p>\n<p>While we are aware of efforts to implement complex bot fingerprinting in libraries, our attempts to identify patterns in request headers have proved largely unsuccessful, as the crawlers routinely rotate or randomize header values. Even when we identified consistent patterns in behavior, these would shift within days or sometimes hours of implementing blocking measures. This high level of adaptation suggests sophisticated orchestration beyond typical web crawling tools, pointing to well-funded operations with significant resources at their disposal.<\/p>\n<h3>We are dramatically out-resourced<\/h3>\n<p>The crawlers producing this massive wave of traffic are as inefficient as they are evasive. Rather than following responsible web crawling practices, such as respecting robots.txt, utilizing sitemaps, or implementing reasonable rate limits, these crawlers appear to be designed to extract maximum data, regardless of the computational or infrastructure costs imposed on libraries and archives. We observed crawlers repeatedly retrieving the same resources tens of thousands of times per day, ignoring cache headers, and generating needlessly expensive (i.e., slow) queries applying exhaustive combinations of facets. This &#8220;brute force&#8221; approach suggests either a lack of technical sophistication or, more likely, a business model where the value of comprehensive data extraction outweighs any concern for efficiency or ethical crawling standards. These inefficient patterns, combined with the highly sophisticated adaptive and evasive behavior, support our hypothesis that the entities behind these crawlers prioritize rapid and comprehensive data acquisition over efficient resource utilization, likely because the current commercial value of training AI models outweighs the costs of infrastructure. The contrast in sophistication between evasive and harvesting techniques also suggests the pairing of advanced, massively distributed, and evasive network traffic infrastructure with immature crawling technology. A common refrain from our colleagues emphasized that we would love to share our data with all comers, in any preferred format, if it meant avoiding this abuse.<\/p>\n<h3>Scaling and performance optimization have limited impacts<\/h3>\n<p>Our initial response focused on scaling infrastructure and optimizing application performance. However, we quickly discovered fundamental limits to this approach. Even after implementing increasing server resources, optimizing the catalog application, and carefully tuning server configurations, the sheer volume of traffic overwhelmed our systems. Although institutions with cloud-based infrastructure were able to scale much more aggressively by adding servers behind a load balancer, they eventually reported similar experiences as well as concerns related to the fees associated with cloud-based scaling. With on-premises infrastructure, we faced physical hardware limitations that couldn&#8217;t be quickly addressed.<br \/>\nThis experience has demonstrated that technical resource optimization alone, although helpful, cannot solve the problem of aggressive crawler traffic. Comprehensively addressing this issue requires fundamentally rethinking how we prioritize human access to our systems.<\/p>\n<h3>Libraries and archives may be uniquely vulnerable<\/h3>\n<p>While organizations across various sectors have experienced increased crawler activity, libraries appear uniquely vulnerable to these aggressive bots. Our discovery interfaces combine deep faceting capabilities, highly structured metadata, extensive collections of unique assets of profound research value, and typically lower access barriers than commercial entities, which creates an ideal target for data harvesting operations. This vulnerability is further compounded by libraries&#8217; commitment to the open access to information, a principle that can sometimes conflict with the need for robust security and business continuity measures.<\/p>\n<h2>Future Directions and Collaboration<\/h2>\n<h3>Community knowledge sharing<\/h3>\n<p>The Code4Lib Slack channel quickly became a valuable resource for those aware of its existence, while other critical information sharing occurred informally at the margins of unrelated community calls and events. However, we discovered a significant information gap as we communicated with peer institutions: many were completely unaware of this community resource despite experiencing identical issues. We found ourselves repeatedly directing colleagues from other institutions to this channel, highlighting the uneven distribution of community knowledge. When information travels primarily through informal networks, some institutions, particularly smaller ones, are most likely to be left without effective practices.<br \/>\nThe ad hoc nature of our collective response highlights the need for more formalized knowledge-sharing mechanisms that can reach libraries of all sizes and technical capacities. While the Code4Lib Slack channel represents an important first step, expanding communication channels and developing shared repositories of code (of which there are several recent examples[<a id=\"ref13\" href=\"#note13\">13<\/a>][<a id=\"ref14\" href=\"#note14\">14<\/a>][<a id=\"ref15\" href=\"#note15\">15<\/a>][<a id=\"ref16\" href=\"#note16\">16<\/a>]) and documentation specifically designed for institutions with limited technical resources would strengthen our community&#8217;s collective resilience against these evolving threats.<\/p>\n<h3>Proof-of-work checks<\/h3>\n<p>One promising approach involves implementing lightweight &#8220;proof of work&#8221; checks. Unlike traditional CAPTCHA-style mechanisms, these checks operate invisibly to users while requiring browsers to complete small computational tasks that are trivial for individual sessions but may become prohibitively expensive at crawler scale. Cloudflare Turnstile includes proof-of-work checks as one component of its verification suite. One prominent open-source example is Xe Iaso\u2019s Anubis software[<a id=\"ref17\" href=\"#note17\">17<\/a>], which &#8220;weighs the soul of your connection using a proof-of-work challenge in order to protect upstream resources from scraper bots.&#8221; This approach may provide an improved balance between accessibility and security. The computational requirements can be dynamically adjusted based on system load, increasing the barrier during suspected crawler activity while remaining negligible during normal operations. Several academic libraries and archives have recently implemented Anubis to control bot traffic[<a id=\"ref18\" href=\"#note18\">18<\/a>][<a id=\"ref19\" href=\"#note19\">19<\/a>][<a id=\"ref20\" href=\"#note20\">20<\/a>][<a id=\"ref21\" href=\"#note21\">21<\/a>], a process documented by Sean Aery at Duke University Libraries[<a id=\"ref22\" href=\"#note22\">22<\/a>].<\/p>\n<h3>Regional service prioritization<\/h3>\n<p>Our experiments with geographic blocking, while effective in the short term, highlighted the need for more nuanced regional service prioritization. Rather than outright blocking entire countries or regions, future infrastructure could implement tiered access channels that prioritize traffic from the institution&#8217;s primary service areas during periods of high load.<\/p>\n<p>These sorts of tiered access models conflict with our professional commitment to open access to information[<a id=\"ref23\" href=\"#note23\">23<\/a>], but the intensity of these &#8220;attacks&#8221; has prompted surprising conversations along these lines. Implementation would require careful consideration of equity and access principles, potentially including allowlisting mechanisms for known international research and academic partners, as well as transparent communication about resource allocation during periods of peak demand.<\/p>\n<h3>Impact on web analytics<\/h3>\n<p>It is also important to note the effect this may have on web analytics reporting. Although bot traffic filtering provided by analytics platforms has always been imperfect, the volume of undetectable machine-driven traffic is likely to massively distort any reporting for periods before the implementation of effective mitigation methods.<\/p>\n<h2>Conclusion<\/h2>\n<p>The unprecedented surge in aggressive crawler traffic, coinciding with the release of commercial generative AI models, has forced libraries to rapidly evolve our approach to network and application security. While the timing strongly suggests connections to AI training, we cannot definitively attribute these activities to specific organizations or purposes. Regardless of intent, the impact on library services has been substantial, requiring significant financial and human resources to maintain stable access for our human users.<\/p>\n<p>The libraries and archives community has developed a multi-layered defense strategy that has proven, to date, to be somewhat effective against even the most aggressive and evasive crawlers. We have continued to experience intermittent service disruptions related to this traffic, and some of our peers now require logins for catalog access and block access to increasingly broad sets of IP addresses, sometimes representing entire countries. Although client checks such as Cloudflare Turnstile or Anubis are currently the most effective layer of the strategy, the facet-aware detection method introduced in this article represents a novel approach designed explicitly for library discovery interfaces. We strive to transform a potential vulnerability into a detection opportunity by identifying non-human interaction patterns unique to faceted interfaces.<\/p>\n<p>As the generative AI landscape continues to evolve, we anticipate ongoing adaptation from both crawlers and our institutions. By establishing structured channels for sharing detection patterns and mitigation strategies, libraries and archives can collectively build a more resilient infrastructure that maintains our commitment to open access to information while ensuring service stability for our communities. As we look to the future, the international community that formed out of necessity and desperation may serve as a valuable model for a rapidly changing technological landscape.<\/p>\n<h2>Acknowledgments<\/h2>\n<p>This work would not have been possible without the cross-departmental team at UNC-Chapel Hill that responded to these incidents over the past year. In addition to the authors of this article, this team includes Sidney Stafford, Alex Everett, Dean Farrell, Jamie McGarty, Joseph Moran, Zachary Tewell, Emily Brassell, Benjamin Pennell, Chad Haefele, and others, with invaluable strategic support from Mar<span data-huuid=\"9108846393841904543\">\u00ed<\/span>a Estorino and Michael Barker. We also send our thanks to our regional colleagues at the Triangle Research Libraries Network, and specifically to Adam Constabaris at NCSU Libraries for introducing us to residential proxy networks and to Sean Aery for sharing Duke University Libraries&#8217; experience with the Anubis system.<\/p>\n<p>We are grateful for the Code4Lib Slack #bots channel community for its curiosity, creativity, and generosity. We are particularly appreciative of Jonathan Rochkind&#8217;s work demonstrating the integration of Cloudflare Turnstile in a Blacklight application.<\/p>\n<h2 class=\"abouttheauthor\">About the authors<\/h2>\n<p><em>Jason Casden<\/em> is the Head of Software Development at the University of North Carolina at Chapel Hill University Libraries. He leads engineers working on a diverse portfolio of projects dedicated to building, maintaining, and innovating the Libraries&#8217; digital services, including digital collections and repositories, archival information systems, digital preservation systems, and the public catalog discovery layer.<\/p>\n<p><em>David Romani<\/em> is the recently retired Senior Linux System Administrator at the University of North Carolina at Chapel Hill University Libraries. David&#8217;s professional focus at UNC was system architecture and automation.<\/p>\n<p><em>Tim Shearer<\/em> is the Associate University Librarian for Digital Strategies and IT at the University of North Carolina at Chapel Hill University Libraries. In this role he leads the Library Information Technology division and has responsibility for commodity computing, infrastructure, user experience, software development, repository development, discovery, digitization, and information security. Tim is interested in how technology can play a transformative role in supporting discovery, delivery, use, and preservation of information. He holds an MSLS from UNC Chapel Hill\u2019s School of Information and Library Science.<\/p>\n<p><em>Jeff Campbell<\/em> is the Head of Infrastructure Management Services at the University of North Carolina at Chapel Hill University Libraries. He leads a skilled team of systems professionals responsible for the strategic planning and management of the Libraries\u2019 IT infrastructure, including oversight of server environments, network architecture, cybersecurity measures, and disaster recovery planning.<\/p>\n<h2>References<\/h2>\n<p>[<a id=\"note1a\" href=\"#ref1a\">1a<\/a>][<a id=\"note1b\" href=\"#ref1b\">1b<\/a>] Weinberg M. Are AI Bots Knocking Cultural Heritage Offline? Engelberg Center on Innovation Law &amp; Policy; 2025. <a href=\"https:\/\/glamelab.org\/products\/are-ai-bots-knocking-cultural-heritage-offline\/\">https:\/\/glamelab.org\/products\/are-ai-bots-knocking-cultural-heritage-offline\/<\/a><\/p>\n<p>[<a id=\"note2a\" href=\"#ref2a\">2a<\/a>][<a id=\"note2b\" href=\"#ref2b\">2b<\/a>] Shearer K, Walk P. The impact of AI bots and crawlers on open repositories: Results of a COAR survey, April 2025. COAR: Confederation of Open Access Repositories; 2025. <a href=\"https:\/\/coar-repositories.org\/news-updates\/open-repositories-are-being-profoundly-impacted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey\/\">https:\/\/coar-repositories.org\/news-updates\/open-repositories-are-being-profoundly-impacted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey\/<\/a><\/p>\n<p>[<a id=\"note3\" href=\"#ref3\">3<\/a>] Code4Lib Slack page. Code4Lib. [accessed 2025 May 22]. <a href=\"http:\/\/code4lib.org\/irc\/#slack\">http:\/\/code4lib.org\/irc\/#slack<\/a><\/p>\n<p>[<a id=\"note4\" href=\"#ref4\">4<\/a>] networksdb.io: Company to IP, IP address owners, Reverse Whois + DNS, free tools &amp; API. [accessed 2025 May 22]. <a href=\"https:\/\/networksdb.io\/\">https:\/\/networksdb.io\/<\/a><\/p>\n<p>[<a id=\"note5\" href=\"#ref5\">5<\/a>] Dark Visitors &#8211; Track the AI Agents and Bots Crawling Your Website. [accessed 2025 May 22]. <a href=\"https:\/\/darkvisitors.com\/\">https:\/\/darkvisitors.com\/<\/a><\/p>\n<p>[<a id=\"note6\" href=\"#ref6\">6<\/a>] fail2ban\/fail2ban: Daemon to ban hosts that cause multiple authentication errors. 2025 [accessed 2025 May 22]. <a href=\"https:\/\/github.com\/fail2ban\/fail2ban\">https:\/\/github.com\/fail2ban\/fail2ban<\/a><\/p>\n<p>[<a id=\"note7\" href=\"#ref7\">7<\/a>] Illyes G. Crawling December: Faceted navigation | Google Search Central Blog. Google for Developers. [accessed 2025 May 22]. <a href=\"https:\/\/developers.google.com\/search\/blog\/2024\/12\/crawling-december-faceted-nav\">https:\/\/developers.google.com\/search\/blog\/2024\/12\/crawling-december-faceted-nav<\/a><\/p>\n<p>[<a id=\"note8\" href=\"#ref8\">8<\/a>] Koebler J. Developer Creates Infinite Maze That Traps AI Training Bots. 404 Media. 2025 Jan 23 [accessed 2025 May 19]. <a href=\"https:\/\/www.404media.co\/developer-creates-infinite-maze-to-trap-ai-crawlers-in\/\">https:\/\/www.404media.co\/developer-creates-infinite-maze-to-trap-ai-crawlers-in\/<\/a><\/p>\n<p>[<a id=\"note9\" href=\"#ref9\">9<\/a>] Provos N, Holz T. Virtual honeypots: from botnet tracking to intrusion detection. Addison-Wesley; 2010.<\/p>\n<p>[<a id=\"note10\" href=\"#ref10\">10<\/a>] Rochkind J. Using CloudFlare Turnstile to protect certain pages on a Rails app \u2013 Bibliographic Wilderness. [accessed 2025 May 19]. <a href=\"https:\/\/bibwild.wordpress.com\/2025\/01\/16\/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app\/\">https:\/\/bibwild.wordpress.com\/2025\/01\/16\/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app\/<\/a><\/p>\n<p>[<a id=\"note11\" href=\"#ref11\">11<\/a>] FAQ \u00b7 Cloudflare Turnstile docs. [accessed 2025 May 21]. <a href=\"https:\/\/developers.cloudflare.com\/turnstile\/frequently-asked-questions\/\">https:\/\/developers.cloudflare.com\/turnstile\/frequently-asked-questions\/<\/a><\/p>\n<p>[<a id=\"note12\" href=\"#ref12\">12<\/a>] Mi X et al. Resident Evil: Understanding Residential IP Proxy as a Dark Service. In: 2019 IEEE Symposium on Security and Privacy (SP). 2019. p 1185\u20131201. <a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/8835239\">https:\/\/ieeexplore.ieee.org\/abstract\/document\/8835239<\/a>. <a href=\"https:\/\/doi.org\/10.1109\/SP.2019.00011\">https:\/\/doi.org\/10.1109\/SP.2019.00011<\/a><\/p>\n<p>[<a id=\"note13\" href=\"#ref13\">13<\/a>] Rochkind J. samvera-labs\/bot_challenge_page: Show a bot challenge interstitial for Rails, usually using Cloudflare Turnstile. [accessed 2025 May 22]. <a href=\"https:\/\/github.com\/samvera-labs\/bot_challenge_page\/\">https:\/\/github.com\/samvera-labs\/bot_challenge_page\/<\/a><\/p>\n<p>[<a id=\"note14\" href=\"#ref14\">14<\/a>] Corall J. libops\/captcha-protect: Traefik middleware to add an anti-bot challenge to individual IPs in a subnet when traffic spikes are detected from that subnet. [accessed 2025 May 23]. <a href=\"https:\/\/github.com\/libops\/captcha-protect\">https:\/\/github.com\/libops\/captcha-protect<\/a><\/p>\n<p>[<a id=\"note15\" href=\"#ref15\">15<\/a>] Cliff D. cerberus\/config\/initializers\/zoo_rack_attack.rb at support\/1.x \u00b7 NEU-Libraries\/cerberus. [accessed 2025 May 22]. <a href=\"https:\/\/github.com\/NEU-Libraries\/cerberus\/blob\/support\/1.x\/config\/initializers\/zoo_rack_attack.rb\">https:\/\/github.com\/NEU-Libraries\/cerberus\/blob\/support\/1.x\/config\/initializers\/zoo_rack_attack.rb<\/a><\/p>\n<p>[<a id=\"note16\" href=\"#ref16\">16<\/a>] ai-robots-txt\/ai.robots.txt: A list of AI agents and robots to block. 2025 [accessed 2025 May 22]. <a href=\"https:\/\/github.com\/ai-robots-txt\/ai.robots.txt\">https:\/\/github.com\/ai-robots-txt\/ai.robots.txt<\/a><\/p>\n<p>[<a id=\"note17\" href=\"#ref17\">17<\/a>] Iaso X. TecharoHQ\/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers. 2025 [accessed 2025 May 22]. <a href=\"https:\/\/github.com\/TecharoHQ\/anubis\">https:\/\/github.com\/TecharoHQ\/anubis<\/a><\/p>\n<p>[<a id=\"note18\" href=\"#ref18\">18<\/a>] Archives &amp; Manuscripts at Duke University Libraries. David M. Rubenstein Rare Book &amp; Manuscript Library. [accessed 2025 Jun 22]. <a href=\"https:\/\/archives.lib.duke.edu\/\">https:\/\/archives.lib.duke.edu\/<\/a><\/p>\n<p>[<a id=\"note19\" href=\"#ref19\">19<\/a>] Connecticut\u2019s Archives Online &#8211; Repositories. Western Connecticut State University Archives. [accessed 2025 Jun 22]. <a href=\"https:\/\/archives.library.wcsu.edu\/arclight\/caoSearch\/repositories\/\">https:\/\/archives.library.wcsu.edu\/arclight\/caoSearch\/repositories\/<\/a><\/p>\n<p>[<a id=\"note20\" href=\"#ref20\">20<\/a>] UCLA Library Digital Collections. [accessed 2025 Jun 22]. <a href=\"https:\/\/digital.library.ucla.edu\/\">https:\/\/digital.library.ucla.edu\/<\/a><\/p>\n<p>[<a id=\"note21\" href=\"#ref21\">21<\/a>] Columbia University Libraries Finding Aids. [accessed 2025 Jun 22]. <a href=\"https:\/\/findingaids.library.columbia.edu\/\">https:\/\/findingaids.library.columbia.edu\/<\/a><\/p>\n<p>[<a id=\"note22\" href=\"#ref22\">22<\/a>] Aery S. Anubis Pilot Project Report &#8211; June 2025. 2025 Jun 26 [accessed 2025 Jul 8]. <a href=\"https:\/\/hdl.handle.net\/10161\/32990\">https:\/\/hdl.handle.net\/10161\/32990<\/a><\/p>\n<p>[<a id=\"note23\" href=\"#ref23\">23<\/a>] Hellman E. AI bots are destroying Open Access. Go To Hellman. [accessed 2025 May 20]. <a href=\"https:\/\/go-to-hellman.blogspot.com\/2025\/03\/ai-bots-are-destroying-open-access.html\">https:\/\/go-to-hellman.blogspot.com\/2025\/03\/ai-bots-are-destroying-open-access.html<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The rise of aggressive, adaptive, and evasive web crawlers is a significant challenge for libraries and archives, causing service disruptions and overwhelming institutional resources. This article details the experiences of the University of North Carolina at Chapel Hill University Libraries in combating an unprecedented flood of crawler traffic. It describes the escalating mitigation efforts, from traditional client blocking to the implementation of more advanced techniques such as request throttling, regional traffic prioritization, novel facet-based bot detection, commercial Web Application Firewalls (WAFs), and ultimately, in-browser client verification with Cloudflare Turnstile. The article highlights the adaptive nature of these crawlers, the limitations of isolated institutional responses, and the critical lessons learned from mitigation efforts, including the issues introduced by residential proxy networks and the extreme scale of the traffic. Our experiences demonstrate the effectiveness of a multi-layered defense strategy that includes both commercial and library-specific solutions, such as facet-based bot detection. The article emphasizes the importance of community-wide collaboration, proposing future directions such as formalized knowledge sharing and the ongoing development of best practices to collectively address this evolving threat to open access and the stability of digital library services.<\/p>\n","protected":false},"author":202,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18489","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18489","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/202"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18489"}],"version-history":[{"count":2,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18489\/revisions"}],"predecessor-version":[{"id":18585,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18489\/revisions\/18585"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18489"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18489"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18489"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18462,"date":"2025-10-21T15:58:57","date_gmt":"2025-10-21T19:58:57","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18462"},"modified":"2025-10-21T16:00:00","modified_gmt":"2025-10-21T20:00:00","slug":"liberation-of-lms-siloed-instructional-data","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18462","title":{"rendered":"Liberation of LMS-siloed Instructional Data"},"content":{"rendered":"<p>By Hyung Wook Choi, Jonathan Wheeler, Weimao Ke, Lei Wang, Jane Greenberg, and Mat Kelly<\/p>\n<h2>Introduction<\/h2>\n<p>Data liberation is fundamental to enabling data reuse. The concepts of data reuse, data sharing, and open data have gained significant attention in academic and research communities, with the goal of making research outputs more widely accessible for purposes such as education, business, and further scientific inquiry [<a id=\"ref1\" href=\"#note1\">1<\/a>][<a id=\"ref2\" href=\"#note2\">2<\/a>]. By promoting open access, researchers can ensure that valuable data sets contribute well beyond their original scope of analysis [<a id=\"ref3\" href=\"#note3\">3<\/a>]. However, several barriers hinder effective data reuse, including technical difficulties [<a id=\"ref4\" href=\"#note4\">4<\/a>], the absence of standardized practices and supporting infrastructure [<a id=\"ref5\" href=\"#note5\">5<\/a>], and concerns about privacy [<a id=\"ref6\" href=\"#note6\">6<\/a>].<\/p>\n<p>The advancement of Information Technology (IT) has significantly influenced the growth of e-learning systems, enabling more flexible and accessible opportunities for education across various domains. These systems have supported not only formal education but also cultural engagement and both professional and personal skill development, thereby shaping modern educational infrastructures [<a id=\"ref7\" href=\"#note7\">7<\/a>]. Since much of the educational content has become available online, especially post-COVID [<a id=\"ref8\" href=\"#note8\">8<\/a>], sharing and reusing educational content holds potential to enhance teaching and learning experiences, promote collaboration among educators, learners, and instructional designers, and ultimately improve educational outcomes. The reuse of educational resources can be adapted and customized into different contexts and strategies to meet specific needs in the fields. Despite the significant benefits, challenges such as copyright concerns, quality assurance, and technical interoperability continue to hinder the sharing and reuse of educational resources and must be systematically addressed [<a id=\"ref9\" href=\"#note9\">9<\/a>]. For instance, Poole pointed out the short supply of faculty and the lack of shared and sustainable national infrastructure, including datasets, tools, techniques, and instructional demonstrations, for supporting data science education in the Library and Information Science (LIS) field [<a id=\"ref10\" href=\"#note10\">10<\/a>]. The study emphasized that promoting a national digital platform requires coordinated tools and technologies across the educational lifecycle.<\/p>\n<p>In this paper, we describe a preliminary process of liberating instructional material from the Blackboard learning management system (LMS), representative of efforts from grant-funded, multi-year boot camps that occurred in June of 2021, 2022, and 2023. These boot camps preceded 6-month internships by each fellow at institutions in the United States including university libraries, non-profit organizations, and data science-driven collectives, among others. The instructional material was developed and interactively presented by 10 teaching, tenure-track, and tenured professors at the lead university and attended by 68 fellows over three years of the project (24, 26, and 18 fellows in each of the respective years). The topics covered in the boot camp included Data Management, Cleaning, and Mining; Information Retrieval; Data Curation, Preservation, Metadata, and Ontologies Search, and Machine Learning and ChatGPT among others. The purpose of this data liberation effort is to both widely share the extensive, well-informed instructional material beyond the original cohorts of fellows but also to explore the nuances that need to be considered beyond the technical efforts, like those of privacy, rights restrictions, and the need to disseminate the material in an LMS-agnostic, sustainable, usable platform.<\/p>\n<h2>Related Work<\/h2>\n<p>Several studies have been conducted to associate digital archiving with learning management systems (LMS). Vitaglione et al. proposed a new software model for archiving slide-based presentations from the Internet, built around the Lecture Object format designed for storing lectures online [<a id=\"ref11\" href=\"#note11\">11<\/a>]. This project was developed in collaboration with the CERN HR Division Training and the University of Michigan, aiming to respond to the growing integration of the World Wide Web in educational contexts. Building on this initiative, Bousdira et al. documented the Web Lecture Archiving Project (WLAP), which involved the implementation of web-based archiving technologies to record a series of content-rich presentations at CERN [<a id=\"ref12\" href=\"#note12\">12<\/a>]. This was accomplished using Sync-O-Matic, a tool that generates slide-based web lectures viewable through a standard web browser. They introduced a new concept of Learning Object which refers to a standard format for archiving web lectures and for exchanging them between different archives. Herr et al. published another study on lecture archiving at CERN and the University of Michigan as a part of the University of Michigan ATLAS Collaboratory Project (UMACP) [<a id=\"ref13\" href=\"#note13\">13<\/a>]. They updated the work that the new system adopting the Lecture Object architecture would be implemented with the goal of a very long-term archive of rich, high-quality media and associated metadata.<\/p>\n<p>Chu et al. examined the advantages and disadvantages of learning management systems (LMS) and lecture capture (LC) technologies by introducing a blended learning model called the START program, which integrated e-learning with virtual mentorship and was implemented at Stanford University [<a id=\"ref14\" href=\"#note14\">14<\/a>]. Moodle was selected as the LMS platform for this initiative. Developed by the Anesthesia Informatics and Media Laboratory in the Department of Anesthesia, the program aimed to enhance the preparedness of interns transitioning into anesthesia residency. Their findings indicated that online educational programs can be effective but require substantial technical expertise during the initial setup phase. Jamieson and Verhaart presented three case studies documenting the transition from Blackboard to Moodle [<a id=\"ref15\" href=\"#note15\">15<\/a>]. Their analysis revealed that Moodle\u2019s features were less specific compared to Blackboard, leading to issues such as errors during content import. More recently, Santoso et al. proposed an archival system, implemented using Docker, to store and manage historical course content and activities, with the goal of alleviating storage demands on the Moodle server [<a id=\"ref16\" href=\"#note16\">16<\/a>]. Following implementation, they conducted three separate evaluation tests. The results demonstrated that variations in course size significantly impacted the time required to restore backup files.<\/p>\n<p>Given that the primary goal of this effort is to archive content for future reuse, we draw on research in web archiving, particularly efforts related to preserving content behind authentication barriers. While Kelly et al. focused on archiving private and personal content [<a id=\"ref17\" href=\"#note17\">17<\/a>], their earlier work explored tool development for ad hoc approaches to small-scale data extraction [<a id=\"ref18\" href=\"#note18\">18<\/a>], which closely aligns with the scope of our project. More recent initiatives, such as the Webrecorder project [<a id=\"ref19\" href=\"#note19\">19<\/a>], have further democratized personal web archiving, making it more accessible regardless of privacy constraints. This body of work has informed our data extraction methodology and will continue to guide our approach as the archiving project evolves.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue61\/choi\/figure1.png\"><br \/>\n<strong>Figure 1.<\/strong> The Bootcamp data was organized in Blackboard by-day in the leftmost navigation. Within each day, the respective instructors posted their contents for the fellows. The contents of this image have been masked to remove identifying information.\n<\/p>\n<h2>Methodology<\/h2>\n<p>While we had direct, practical familiarity with the contents of the bootcamp Blackboard instance, we needed to liberate the contents from the authentically-protected system in a manner that maintained the file and course structure as well as the metadata to organize iteration of the bootcamp. This required a process of data extraction, inventory, and auditing for assurance of the process, which we describe in this section. We began this process in the Fall of 2023 and the effort is currently underway (see below timeline).<\/p>\n<ul>\n<li>November 2020-April 2021 Boot camp materials developed by instructors<\/li>\n<li>May 2021-August 2021 Boot camp<\/li>\n<li>June 2021 Boot camp (2021)<\/li>\n<li>July 2021-December 2021 Fellows&#8217; internship at one of 21 different sites<\/li>\n<li>June 2022 Boot camp (2022)<\/li>\n<li>July 2022-December 2022 Fellows&#8217; internship at one of 19 different sites<\/li>\n<li>June 2023 Boot camp (2023)<\/li>\n<li>July 2023-December 2023 Fellows&#8217; internship at one of 13 different sites<\/li>\n<li>September 2023-December 2023 Boot camp dump and analysis begin<\/li>\n<li>January-February 2024 Preliminary Audit performed<\/li>\n<li>February-March 2024 Extraction and validation procedure performed and documented<\/li>\n<\/ul>\n<h3>Data Extraction and Inventory<\/h3>\n<p>We exported course data using the functionality built into Blackboard and took inventory of the contents of the export using the DROID file format identification tool from the UK National Archives [<a id=\"ref20\" href=\"#note20\">20<\/a>]. We then parsed the CSV output of the scan using an R script to generate a count of files by type, with the intent of getting an overview of the number of scripts, notebooks, PDFs, and binary types. The export included files in PowerPoint and Word formats, as well as operating system specific but otherwise contextually unnecessary files like *_MACOS* directories.  Given the large number of files in the export, the DROID report also served to confirm assumptions about the structure of the data: one top-level directory containing structural XML files and a small number of subdirectories, primarily used for Blackboard configuration. Among these was a single dedicated subdirectory serving as the content store, which contained all of the embedded course materials.<\/p>\n<p>Since all files within the course export are named according to a schema that is internal to a single Blackboard course instance, with site-specific identifiers replacing human-readable file names, our next step was to mine the XML for filenames. Within the content store directory, each content file was represented by two files: the content in its original format and an XML file that included system information including the original file name. The shared identifier between the two files was the system-generated filename. We developed a Python script to read the XML files and output a lookup table with two columns, the system filename, and the corresponding original filename.<\/p>\n<h3>Traversing the Course Structure<\/h3>\n<p>The initial process produced a list of content files using their original filenames. However, conducting a thorough audit required matching these files with the corresponding lesson descriptions within each module. One potential method was to set up a sandboxed Blackboard instance, import the exported data, and compare it with the live version of the course while it remained accessible online.<\/p>\n<p>Instead, a scripted solution was developed. This script parsed the XML files to generate a weekly overview of course content, aligning the schedule with specific XML data to map course materials to each week. To streamline this process, lookup tables were also used rather than relying solely on the raw XML files.<\/p>\n<p>The institution running the Blackboard system uses Kaltura as a long-term video hosting solution, rather than relying on Zoom for video storage. While Zoom recordings are deleted after 18 weeks  [<a id=\"ref21\" href=\"#note21\">21<\/a>], an automated process transfers videos created through the institution\u2019s Zoom account into Kaltura. These videos can then be embedded within Blackboard without being physically copied. Instructors reference the videos via Blackboard, but the actual files remain hosted on Kaltura.<\/p>\n<h3>Kaltura Caveat<\/h3>\n<p>In December 2023, the grant-funded institution hosting the Blackboard instance notified all instructors that the prior embedded instances of Kaltura videos would no longer function and need to be reset in Blackboard. Given all instances of the bootcamp had since completed, this meant that a manual process of adapting the embeds might need to be performed to allow the videos to be accessible and playable. We mitigate this effort by extracting the URLs of the videos in the import rather than relying on the embed markup itself to allow the video to be usable in its original form, i.e., within the Blackboard interface.<\/p>\n<h2>Discussion and Future Work<\/h2>\n<p>As the process of data extraction and migration progresses, it is essential to critically consider the broader implications associated with the dissemination of the resulting materials. Chief among these concerns is the issue of instructional authorization: the participation of instructors in the boot camp events does not inherently imply consent for public distribution of their contributions. Accordingly, explicit permission from each instructor must be obtained prior to dissemination.<\/p>\n<p>Furthermore, a comprehensive review and analysis of the extracted content is required to ensure that copyrighted materials, whether included inadvertently or permitted within the credential-protected Blackboard environment, are not redistributed without proper authorization. For materials that are publicly available but still under copyright, our dissemination strategy includes verifying their presence in the Internet Archive\u2019s repository and providing external links rather than hosting the content directly.<\/p>\n<p>For resources for which reuse permissions have been granted, particularly interactive elements such as Jupyter notebooks, we intend to integrate and adapt these materials into existing Library Carpentry workshop formats. This will support their continued use and extension by broader audiences, contingent upon the prior acquisition of all necessary permissions from the original content creators.<\/p>\n<h3>Audit Redux<\/h3>\n<p>At the time of this writing, we have performed a preliminary audit of boot camp materials that includes instructional videos recordings of the instructors\u2019 lectures, PowerPoint\/PDF presentation files, and Python-based notebooks. While some of the notebooks are Jupyter-based and stored in the Blackboard shell itself, some instructors opted for notebooks on other platforms like Google Colab. While the offerings of these two mechanisms for executing Python scripts using notebooks is similar, there is no guarantee that one notebook will be portable to other systems due to the availability of modules and resources on each respective platform. We plan to evaluate both the notebooks and their linked versions (e.g., in Colab), and manually verify that the execution results match expectations.<\/p>\n<h3>Presentation Medium<\/h3>\n<p>Dissemination of the bootcamp material need not maintain the original web-based representation, that is, organized by structure and style of the Blackboard template. We are looking into effective means of organizing the content, both from the context of the multiple iterations of the boot camp to make the content available but also to facilitate accessibility. Our intention is for the content to be widely available and permissively licensed for reuse. We are considering using platforms like GitHub Pages to accomplish this with external linking to captures at Internet Archive for copyrighted, publicly accessible, web-based materials.<\/p>\n<h3>Evaluation of Our Approach cf. Web Archiving \/ Auto-crawling Tools<\/h3>\n<p>Our initial attempts at data liberation have been somewhat ad hoc, relying primarily on the built-in capabilities of the system. As we have seen with other web-based platforms, data exports (&#8220;dumps&#8221;) are often incomplete. While we continue to assess how comprehensive and representative the Blackboard data dump is in capturing the full content of the boot camps, we are also exploring practitioner-oriented web archiving tools and methods to extract data in a more portable and consistent manner. We plan to use the archiving tools in the near future to implement a more systematic approach and to benchmark their performance against our current baseline. These tools will be especially valuable for capturing hard-to-preserve resources and will also support the portability of the data in its original structure, such as the Blackboard course template.<\/p>\n<h2>Conclusion<\/h2>\n<p>This study aims to liberate educational content for reuse with Blackboard data from the Summers of 2021, 2022, and 2023. Making the liberated data into reusable content required multiple steps including obtaining permission for copyright, organizing the data dumps for consistency, and assuring the quality of contents. However, if the demonstrated methods could employ a more systematic process, it would bring tremendous potential to educational workers by sharing and reusing the quality of educational contents without concerns about license.<\/p>\n<h2>Acknowledgements<\/h2>\n<p>For support of the 3-year grant that made both the fellowships and boot camps possible, we would like to recognize support by the Institute of Museum and Library Services (IMLS), grant #RE-246450-OLS-20. We would also like to thank the instructors at Drexel University for their cooperation and willingness to make their valuable instructional data widely available.<\/p>\n<h2>References<\/h2>\n<p>[<a id=\"note1\" href=\"#ref1\">1<\/a>] National Institutes of Health. (2023). Data Management &#038; Sharing Policy Overview | Data Sharing. <a href=\"https:\/\/sharing.nih.gov\/data-management-and-sharing-policy\/about-data-management-and-sharing-policies\/data-management-and-sharing-policy-overview#after\">https:\/\/sharing.nih.gov\/data-management-and-sharing-policy\/about-data-management-and-sharing-policies\/data-management-and-sharing-policy-overview#after<\/a><\/p>\n<p>[<a id=\"note2\" href=\"#ref2\">2<\/a>] National Science Foundation. (2023). Preparing Your Data Management Plan &#8211; Funding at NSF <a href=\"https:\/\/new.nsf.gov\/funding\/data-management-plan#nsfs-data-sharing-policy-1c8\">https:\/\/new.nsf.gov\/funding\/data-management-plan#nsfs-data-sharing-policy-1c8<\/a><\/p>\n<p>[<a id=\"note3\" href=\"#ref3\">3<\/a>] Piwowar, H. A., &#038; Vision, T. J. (2013). Data reuse and the open data citation advantage. <em>PeerJ<\/em>, 1, e175. <a href=\"https:\/\/doi.org\/10.7717\/peerj.175\">doi:10.7717\/peerj.175<\/a><\/p>\n<p>[<a id=\"note4\" href=\"#ref4\">4<\/a>] Borgman, C. L. (2012). The conundrum of sharing research data. <em>Journal of the American Society for Information Science and Technology<\/em>, 63(6), 1059-1078. <a href=\"https:\/\/doi.org\/10.1002\/asi.22634\">doi:10.1002\/asi.22634<\/a><\/p>\n<p>[<a id=\"note5\" href=\"#ref5\">5<\/a>] Wicherts, J. M., Borsboom, D., Kats, J., &#038; Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. <em>American psychologist<\/em>, 61(7), 726. <a href=\"https:\/\/doi.org\/10.1037\/0003-066X.61.7.726\">doi:10.1037\/0003-066X.61.7.726<\/a><\/p>\n<p>[<a id=\"note6\" href=\"#ref6\">6<\/a>] El Emam, K., Rodgers, S., &#038; Malin, B. (2015). Anonymising and sharing individual patient data. <em>bmj<\/em>, 350. <a href=\"https:\/\/doi.org\/10.1136\/bmj.h1139\">doi:10.1136\/bmj.h1139<\/a><\/p>\n<p>[<a id=\"note7\" href=\"#ref7\">7<\/a>] Oliveira, P. C. D., Cunha, C. J. C. D. A., &#038; Nakayama, M. K. (2016). Learning Management Systems (LMS) and e-learning management: an integrative review and research agenda. <em>JISTEM-Journal of Information Systems and Technology Management<\/em>, 13, 157-180. <a href=\"https:\/\/doi.org\/10.4301\/S1807-17752016000200001\">doi:10.4301\/S1807-17752016000200001<\/a><\/p>\n<p>[<a id=\"note8\" href=\"#ref8\">8<\/a>] Camilleri, M. A., &#038; Camilleri, A. C. (2022). The acceptance of learning management systems and video conferencing technologies: Lessons learned from COVID-19. <em>Technology, Knowledge and Learning<\/em>, 27(4), 1311-1333. <a href=\"https:\/\/doi.org\/10.1007\/s10758-021-09561-y\">doi:10.1007\/s10758-021-09561-y<\/a><\/p>\n<p>[<a id=\"note9\" href=\"#ref9\">9<\/a>] Hilton III, J., Wiley, D., Stein, J., &#038; Johnson, A. (2010). The four \u2018R\u2019s of openness and ALMS analysis: frameworks for open educational resources. <em>Open learning: The journal of open, distance and e-learning<\/em>, 25(1), 37-44. <a href=\"https:\/\/doi.org\/10.1080\/02680510903482132\">doi:10.1080\/02680510903482132<\/a><\/p>\n<p>[<a id=\"note10\" href=\"#ref10\">10<\/a>] Poole, A. H. (2021). LEADING the way: A new model for data science education. Proceedings of the Association for Information Science and Technology, 58(1), 525-531. <a href=\"https:\/\/doi.org\/10.1002\/pra2.491\">doi:10.1002\/pra2.491<\/a><\/p>\n<p>[<a id=\"note11\" href=\"#ref11\">11<\/a>] Vitaglione, G., Bousdira, N., Goldfarb, S., Neal, H. A., Severance, C., &#038; Storr, M. (2001). Lecture Object: an architecture for archiving lectures on the Web. <em>International Journal of Modern Physics C<\/em>, 12(04), 533-547. <a href=\"https:\/\/doi.org\/10.1142\/S012918310100219X\">doi:10.1142\/S012918310100219X<\/a><\/p>\n<p>[<a id=\"note12\" href=\"#ref12\">12<\/a>] Bousdira, N., Storr, K. M., Myers, E., Goldfarb, S., Neal, H. A., Severance, C., &#038; Vitaglione, G. (2001). <em>WLAP the web lecture archive project: The development of a web-based archive of lectures, tutorials, meetings and events at CERN and at the University of Michigan<\/em> (No. CERN-OPEN-2001-066). <a href=\"https:\/\/cds.cern.ch\/record\/516632\">https:\/\/cds.cern.ch\/record\/516632<\/a><\/p>\n<p>[<a id=\"note13\" href=\"#ref13\">13<\/a>] Herr, J., Lougheed, R., &#038; Neal, H. A. (2010, April). Lecture archiving on a larger scale at the University of Michigan and CERN. In <em>Journal of Physics: Conference Series<\/em> (Vol. 219, No. 8, p. 082003). IOP Publishing. <a href=\"https:\/\/doi.org\/10.1088\/1742-6596\/219\/8\/082003\">doi:10.1088\/1742-6596\/219\/8\/082003<\/p>\n<p>[<a id=\"note14\" href=\"#ref14\">14<\/a>] Chu, L. F., Young, C. A., Ngai, L. K., Cun, T., Pearl, R. G., &#038; Macario, A. (2010). Learning management systems and lecture capture in the medical academic environment. <em>International Anesthesiology Clinics<\/em>, 48(3), 27-51. <a href=\"https:\/\/doi.org\/10.1097\/AIA.0b013e3181e5c1d5\">doi:10.1097\/AIA.0b013e3181e5c1d5<\/a><\/p>\n<p>[<a id=\"note15\" href=\"#ref15\">15<\/a>] Jamieson J., Verhaart M. Issues Surrounding Course Content Migration: Blackboard to Moodle. In: Proceedings of the 18th annual conference of the national advisory committee on computing qualifications (NACCQ), Tauranga, New Zealand, 2005. Available at: <a href=\"https:\/\/citrenz.org.nz\/citrenz\/conferences\/2005\/concise\/jamieson_moodle.pdf\">https:\/\/citrenz.org.nz\/citrenz\/conferences\/2005\/concise\/jamieson_moodle.pdf<\/a><\/p>\n<p>[<a id=\"note16\" href=\"#ref16\">16<\/a>] Santoso, B. J., Ijtihadie, R. M., &#038; Millah, Z. (2023, September). A Docker Container-Based Solution for Course Archival on Moodle: Implementation and Evaluation. In <em>2023 8th International Conference on Electrical, Electronics and Information Engineering (ICEEIE)<\/em> (pp. 1-6). IEEE. <a href=\"https:\/\/doi.org\/10.1109\/ICEEIE59078.2023.10334688\">doi:10.1109\/ICEEIE59078.2023.10334688<\/a><\/p>\n<p>[<a id=\"note17\" href=\"#ref17\">17<\/a>] Kelly, M. and Weigle, M. C., \u201cWARCreate &#8211; Create Wayback-Consumable WARC Files from Any Webpage,\u201d In <em>Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries (JCDL)<\/em>. 2012, pp. 437\u2013438. <a href=\"https:\/\/doi.org\/10.1145\/2232817.2232930\">doi:10.1145\/2232817.2232930<\/a><\/p>\n<p>[<a id=\"note18\" href=\"#ref18\">18<\/a>] Kelly, M., Nelson, M.L., and Weigle, M.C, \u201cA Framework for Aggregating Private and Public Web Archives,\u201d In <em>Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries (JCDL)<\/em>, 2018, pp. 273\u2013282. <a href=\"https:\/\/doi.org\/10.1145\/3197026.3197045\">doi:10.1145\/3197026.3197045<\/a><\/p>\n<p>[<a id=\"note19\" href=\"#ref19\">19<\/a>] Webrecorder: web archiving for all. (n.d.). Webrecorder. Retrieved April 1, 2024, <a href=\"https:\/\/webrecorder.net\/\">https:\/\/webrecorder.net\/<\/a><\/p>\n<p>[<a id=\"note20\" href=\"#ref20\">20<\/a>] DROID: file format identification tool (n.d.), Retrieved April 4, 2024, <a href=\"https:\/\/www.nationalarchives.gov.uk\/information-management\/manage-information\/preserving-digital-records\/droid\/\">https:\/\/www.nationalarchives.gov.uk\/information-management\/manage-information\/preserving-digital-records\/droid\/<\/a><\/p>\n<p>[<a id=\"note21\" href=\"#ref21\">21<\/a>] Drexel University Information Technology, (n.d.), Frequently Asked Questions for Drexel Zoom Users, <a href=\"https:\/\/drexel.edu\/it\/help\/a-z\/zoom\/zoomfaq\/\">https:\/\/drexel.edu\/it\/help\/a-z\/zoom\/zoomfaq\/<\/a> Accessed: April 7, 2024<\/p>\n<h2>About the authors<\/h2>\n<p><em>Hyung Wook Choi<\/em> is a PhD candidate at Drexel University\u2019s College of Computing and Informatics. She holds Master degrees in Data Science and Library and Information Science from Drexel and Ewha Womans University. Her primary research is centered on detecting and tracking semantic shifts with NLP and Information Retrieval.<\/p>\n<p><em>Jon Wheeler<\/em> is a Data Curation Librarian within the University of New Mexico&#8217;s College of University Libraries and Learning Sciences. Jon&#8217;s role in the Libraries&#8217; Data Services initiatives include the development of research data ingest, packaging and archiving work flows. Jon&#8217;s research interests include the requirements and usability of sustainable architectures for long term data preservation and the disposition of research data in response to funding requirements.<\/p>\n<p><em>Weimao Ke<\/em> is an Associate Professor in Information Science at Drexel University. His research is centered on information retrieval (IR), with emphasis on computational models that support effective and intelligent human-information interaction. His work explores the application of machine learning, multi-agent systems, and information-theoretic frameworks to the modeling and analysis of complex information environments. He is also interested in simulation-based approaches for studying emergent behavior and dynamics in distributed systems, as well as the design of scalable and adaptive AI systems in information networks.<\/p>\n<p><em>Lei Wang<\/em> is an Assistant Teaching Professor at Drexel University&#8217;s College of Computing and Informatics. Her work focuses on clinical data analysis, natural language processing with large language models, and biomedical optical imaging. At Drexel, she teaches graduate-level courses in data science and leads research initiatives at the intersection of healthcare and AI. Her research interests include pediatric concussion, neuroimaging, and the application of machine learning in clinical informatics. She actively contributes to academic publishing and peer review in neuroscience and biomedical engineering.<\/p>\n<p><em>Jane Greenberg<\/em> is the Alice B. Kroeger Professor and Director of the Metadata Research Center at the College of Computing &#038; Informatics, Drexel University. Her research activities focus on metadata, knowledge organization\/semantics, linked data, data science, and information economics. Her research has been funded by the NSF, NIH, IMLS, NEH, Microsoft Research, GlaxoSmithKline, Santander Bank, Library of Congress, as well as other agencies and organizations.<\/p>\n<p><em>Mat Kelly<\/em> is an Assistant Professor in the Department of Information Science at Drexel University\u2019s College of Computing &#038; Informatics. He earned his Ph.D. in Computer Science from Old Dominion University. Dr. Kelly&#8217;s research centers on digital preservation, with a particular emphasis on the archiving and retrieval of private web content. His work has been supported by the Institute of Museum and Library Services and the National Science Foundation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This paper presents an initiative to extract and repurpose instructional content from a series of Blackboard course shells associated with IMLS-funded boot camp events conducted in June of 2021, 2022, and 2023. These events, facilitated by ten faculty members and attended by 68 fellows, generated valuable educational materials currently confined within proprietary learning management system environments. The objective of this project is to enable broader access and reuse of these resources by migrating them to a non-siloed, static website independent of the original Blackboard infrastructure. We describe our methodology for acquiring and validating the data exports, outline the auditing procedures implemented to ensure content completeness and integrity, and discuss the challenges encountered throughout the process. Finally, we report on the current status of this ongoing effort and its implications for future dissemination and reuse of educational materials.<\/p>\n","protected":false},"author":508,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18462","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18462","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/508"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18462"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18462\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18462"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18462"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18462"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18510,"date":"2025-10-21T15:58:56","date_gmt":"2025-10-21T19:58:56","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18510"},"modified":"2025-10-21T16:00:10","modified_gmt":"2025-10-21T20:00:10","slug":"extracting-a-large-corpus-from-the-internet-archive-a-case-study","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18510","title":{"rendered":"Extracting A Large Corpus from the Internet Archive, A Case Study"},"content":{"rendered":"<p>By Eric C. Weig<\/p>\n<h2>Background<\/h2>\n<p>Since 2015, The University of Kentucky Libraries have uploaded historic newspaper issues to the Internet Archive. We are now changing our newspaper program to host newspapers locally, and expanding the collection scope to include not only issues scanned from analog sources but also born digital and web harvested HTML (HyperText Markup Language) newspaper content.<\/p>\n<p>Gathering local copies of content we placed in the Internet Archive would have been cumbersome as it would have involved accessing long term tape based archival storage systems, which meant time and effort from multiple staff. Alternatively, an automated process might be utilized to pull files back down from the Internet Archive and AI could play a role to make needed development faster. I had some experience with Python and anticipated that AI would be more successful crafting small, narrowly focused scripts than more complex code. [<a id=\"ref1\" href=\"#note1\">1<\/a>]. Curiosity played a role here as well. With so much interest growing around the use of AI, giving it a try on a small and well-defined coding project sounded like fun, with the option of abandoning it if results were not successful.<\/p>\n<p>Initial research also uncovered that the Internet Archive encourages web crawling via its robots.txt file and offers access to an API to gather metadata and files. [<a id=\"ref7\" href=\"#note7\">7<\/a>] [<a id=\"ref4\" href=\"#note4\">4<\/a>]<\/p>\n<h2>Gathering Metadata for Objects<\/h2>\n<p>The first step in the process involved assembling all the metadata needed to gather the Internet Archive objects from the Kentucky Digital Newspaper collection (see Figure 1).<\/p>\n<p class=\"caption\"><img decoding=\"async\" src=\"\/media\/issue61\/weig\/figure_01.png\" \/><br \/>\n<strong>Figure 1.<\/strong> Collection landing page for Kentucky Digital Newspapers on the Internet Archive. Shows results for 86,819 items.<\/p>\n<p>At a minimal level, this requires the Internet Archive specific identifiers that were assigned to the objects. When looking at a URL for an object, the identifier is located at the very end of the URL, following the last forward slash. For example, the following object URL https:\/\/archive.org\/details\/xt74qr4nm47j has an object identifier that is \u2018xt74gr4nm47j\u2019. In the Internet Archive, various URLs relate to an object and follow specific patterns that reference the identifier value. For example, to download a .zip file containing all of the files associated with the object, including what was originally uploaded, the following URL pattern is used [<a id=\"ref5\" href=\"#note5\">5<\/a>]:<em> https:\/\/archive.org\/compress\/xt74qr4nm47j.<\/em><\/p>\n<p>Additional metadata was desirable to facilitate renaming and sorting the objects once they were downloaded as well as to calculate the total storage requirements for files prior to downloading. Specifically, four fields were retrieved: creator, identifier, date, and item_size. The \u2018creator\u2019 metadata values were useful in sorting downloaded content into directories organized by newspaper title. The \u2018identifier\u2019 metadata was used to construct the download URLs, and \u2018date\u2019 metadata was prepended to the downloaded .zip file names in order to add semantic information. The \u2018item_size\u2019 metadata was used to calculate how much local storage space we needed.<\/p>\n<p>Another aspect of getting the metadata involved ensuring it was in a parsable format. Fortunately, the Internet Archive has several options for gathering metadata formatted as structured data. For collections with fewer than ten thousand items, a URL using \/scrape\/ will retrieve data (see Figure 2). [<a id=\"ref4\" href=\"#note4\">4<\/a>]<\/p>\n<pre>https:\/\/archive.org\/services\/search\/v1\/scrape?fields=identifier,item_size,creator,date&amp;q=collection%3Akentuckynewspapers<\/pre>\n<p><strong>Figure 2.<\/strong> Sample Scrape API URL that gathers the first 10,000 metadata records the kentuckynewspapers collection.<\/p>\n<p>Since our collection included far more than 10,000 items, this would not work for us. Instead, I used the advanced search web interface to search for what I needed, as the advanced search results page also provides the option to export results as structured data (see Figure 3). [<a id=\"ref6\" href=\"#note6\">6<\/a>]<\/p>\n<p class=\"caption\"><img decoding=\"async\" src=\"\/media\/issue61\/weig\/figure_03.png\" \/><br \/>\n<strong>Figure 3.<\/strong> Internet Archive Web Interface Advanced Search GUI with field inputs used for generating the JSON formatted metadata output for all 86,819 items.<\/p>\n<p>I chose to use JSON as the output format (see Figure 4).<\/p>\n<p>callback({&#8220;responseHeader&#8221;:{&#8220;status&#8221;:0,&#8221;QTime&#8221;:469,&#8221;params&#8221;:{&#8220;query&#8221;:&#8221;collection:kentuckynewspapers&#8221;,&#8221;qin&#8221;:&#8221;collection:&#8221;kentuckynewspapers&#8221;&#8221;,&#8221;fields&#8221;:&#8221;creator,date,identifier,item_size&#8221;,&#8221;wt&#8221;:&#8221;json&#8221;,&#8221;sort&#8221;:&#8221;creatorSorter asc&#8221;,&#8221;rows&#8221;:1000,&#8221;json.wrf&#8221;:&#8221;callback&#8221;,&#8221;start&#8221;:0}},&#8221;response&#8221;:{&#8220;numFound&#8221;:86819,&#8221;start&#8221;:0,&#8221;docs&#8221;:[{&#8220;creator&#8221;:&#8221;Adair County news.(The)&#8221;,&#8221;date&#8221;:&#8221;1917-11-14T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;xt744j09xt7p&#8221;,&#8221;item_size&#8221;:13630085},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-10-18T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd99882j6j5s&#8221;,&#8221;item_size&#8221;:91957159},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-09-20T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9tm71v5w8w&#8221;,&#8221;item_size&#8221;:93942368},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-10-04T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9dz02z1d9w&#8221;,&#8221;item_size&#8221;:92310777},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-09-27T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9jq0sq9446&#8243;,&#8221;item_size&#8221;:95768429},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-09-20T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9pg1hh6t91&#8243;,&#8221;item_size&#8221;:93942985},{&#8220;creator&#8221;:&#8221;Adair county community voice&#8221;,&#8221;date&#8221;:&#8221;2012-09-13T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9610vq304m&#8221;,&#8221;item_size&#8221;:95027685},{&#8220;creator&#8221;:&#8221;Adair progress (The)&#8221;,&#8221;date&#8221;:&#8221;2015-06-11T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd91n7xk8900&#8243;,&#8221;item_size&#8221;:118977672},{&#8220;creator&#8221;:&#8221;Adair progress (The)&#8221;,&#8221;date&#8221;:&#8221;2012-08-16T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd90z70v8f74&#8243;,&#8221;item_size&#8221;:83747017},{&#8220;creator&#8221;:&#8221;Adair progress (The)&#8221;,&#8221;date&#8221;:&#8221;2014-02-06T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9x34mk6r4d&#8221;,&#8221;item_size&#8221;:89747837},{&#8220;creator&#8221;:&#8221;Adair progress (The)&#8221;,&#8221;date&#8221;:&#8221;2012-07-26T00:00:00Z&#8221;,&#8221;identifier&#8221;:&#8221;kd9d79571k58&#8243;,&#8221;item_size&#8221;:107745767},[\/text]<\/p>\n<p><strong>Figure 4.<\/strong> Sample JSON output generated from the Internet Archive Web Interface Advanced Search. Fields gathered included identifier, date, creator, and item_size.<\/p>\n<h2>Calculating Storage Needs<\/h2>\n<p>A calculation of storage requirements was essential for planning so that an automated process could run without encountering insufficient storage space.<\/p>\n<p>Calculating the storage space needed was relatively simple as all the information needed was represented in the JSON output, in the item_size metadata element..<\/p>\n<p>As I approached the use of AI for small script code generation, articles on effective prompt engineering were helpful. I chose to use the free ChatGPT (GPT-4o mini) for its ease of use and established accuracy with Python coding: A 2024 study published on Python code generation and AI found that \u201cof all the models tested, GPT-4 exhibited the highest proficiency in code generation tasks, achieving a success rate of 86.23% on the small subset that was tested. GPT-based models performed the best compared to the other two models, Bard and Claude.\u201d [<a id=\"ref2\" href=\"#note2\">2<\/a>] My approach was to start very simple with prompts initially aimed at producing code to perform simple processing tasks, then branching out to create more sophisticated prompts that incorporated multiple processes. I also spent time experimenting with prompts and then reviewing the resulting code, using few-shot learning by providing example elements within the prompt while sometimes performing multiple iterations with different phrasing and level of granularity as well as striving to cover all necessary details with as few words as possible. See Figure 5 for the completed prompt used to generate code. [<a id=\"ref9\" href=\"#note9\">9<\/a>] [<a id=\"ref3\" href=\"#note3\">3<\/a>]<\/p>\n<p><span style=\"color: #3366ff;\">Write a Python script that runs from the commandline and processes a .json file at a full path specified in the code. <em>The JSON file contains data wrapped in a JavaScript-style callback function, like callback({&#8230;}).<\/em> The script should extract the JSON content, parse it, and sum the values of the item_size field found in the list located at response.docs. The script should print and write to a log file listing the total size in bytes, KB, MB, GB, and TB. The log file path should be specified in the code. Include error handling for if the file is malformed and print a message indicating when the process is completed successfully.<\/span><\/p>\n<p><strong>Figure 5.<\/strong> Figure 5. show the ChatGPT (Chat Generative Pre-trained Transformer) prompt that generated storage.py.[<a id=\"ref14\" href=\"#note14\">14<\/a>] The italicized text indicates prompt content added after the initial prompt was conducted. The second prompt was then submitted in its entirety.<\/p>\n<p>One quick fix I needed to make in the resulting code was to make sure that the file paths were set as raw strings in order to ignore escape character sequences. [<a id=\"ref10\" href=\"#note10\">10<\/a>] I made this change directly to the code after it was generated.<\/p>\n<p class=\"caption\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"\/media\/issue61\/weig\/figure_06.png\" width=\"731\" height=\"105\" \/><br \/>\n<strong>Figure 6.<\/strong> Command-line output of the storage.py script. Shows calculated storage size in bytes, KB, MB, GB and TB.<\/p>\n<h2>Downloading Objects<\/h2>\n<p>Now that I had all the URLs to download the objects, I began considering a responsible approach to retrieving the data. I intended to heed the guidelines established in the Internet Archive\u2019s robots.txt file. [<a id=\"ref2\" href=\"#note2\">2<\/a>] Incorporating a 10 to 30 second pause between each download seemed reasonable. [<a id=\"ref11\" href=\"#note11\">11<\/a>]<\/p>\n<p><span style=\"color: #000000;\"><strong>Initial Prompt:<\/strong><\/span><br \/>\n<span style=\"color: #3366ff;\">Write a Python script that loads a JSON file located at a path specified in the code. The JSON file contains data wrapped in a JavaScript-style callback function, like callback({&#8230;}). The script should extract the JSON content, parse it, and read the values of identifier, creator, and date. for each identifier construct a download URL in the format https:\/\/archive.org\/compress\/{identifier}. Create a subdirectory under a base path specified in the code and named after the creator with characters that are invalid or problematic replaced with underscores and leading\/trailing whitespace stripped. Download the ZIP file corresponding to the identifier and save it as {date}_{identifier}.zip in the appropriate subdirectory. Wait for a random interval between 10 and 30 seconds before proceeding to the next download. If a download fails, log the failed URL to a log file specified in the code.<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong>Refining Prompt:<\/strong><\/span><br \/>\n<span style=\"color: #3366ff;\">Add code that groups the downloads into sets of 1000 and creates subdirectories in the downloads directory that are named 0001, 0002, etc.<\/span><\/p>\n<p><span style=\"color: #000000;\"><strong>Refining Prompt:<\/strong><\/span><br \/>\n<span style=\"color: #3366ff;\">Add Python code so that after each .zip file is downloaded, it is verified using zipfile.is_zipfile() to ensure it is a valid ZIP archive. If is_zipfile() returns False, an error message should be recorded in a log file (e.g., download.log) indicating that the file is not a valid ZIP. If the file passes this check, it should then be opened with zipfile.ZipFile() and tested using the testzip() method. If testzip() returns a file name, indicating a corrupted file inside the archive, this should also be logged with an appropriate warning or error message<\/span><\/p>\n<p><strong>Figure 7.<\/strong> ChatGPT prompt and refining prompts that generated download_zips.py.<\/p>\n<p>Lessons learned from earlier prompts led me to including the example callback function\u2014formatted like callback({&#8230;})\u2014 as a few-shot prompt so the script would parse the JSON effectively. Post code generation, I again set file paths as raw strings. Additionally, the prompt specified grouping downloads in manageable sets within sequentially ordered subdirectories and include validation for downloaded .zip files.<\/p>\n<p>Tests were conducted utilizing small sections of the JSON file. One error that occurred during some downloads was that the Python script could not create a valid directory name. Values in the creator metadata contained characters not valid for directory naming. A quick manual adjustment to the Python code to sanitize the creator metadata, addressed this issue.<\/p>\n<pre><span style=\"color: #000000;\"><strong>ORIGINAL CODE:<\/strong> \r\n# Subdirectory under group: creator name save_dir = os.path.join(group_dir, creator) <\/span>\r\n\r\n<span style=\"color: #000000;\"><strong>UPDATED CODE:<\/strong> \r\n# Sanitize creator sanitized_creator = re.sub(r\u2019[^A-Za-z0-9]\u2019,\u2019_\u2019, creator) \r\n# Subdirectory under group: creator name save_dir = os.path.join(group_dir, sanitized_creator<\/span><\/pre>\n<p><strong>Figure 8.<\/strong> Manual fix to AI generated code to sanitize text for use in naming Windows file system directories.<\/p>\n<p>Additional tests with more small subsections of the JSON file were successful and the script was then run in earnest (see Figure 9).<\/p>\n<p class=\"caption\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"\/media\/issue61\/weig\/figure_09.png\" width=\"735\" height=\"565\" \/><br \/>\n<strong>Figure 9.<\/strong> Command-line output from the download_zips.py script.<\/p>\n<h2>Results<\/h2>\n<p>The download_zips.py script ran intermittently for nearly two months. There were three interruptions due to local machine reboots for Windows updates. There were 6,489 failed downloads due to temporary network connection issues and what appeared to be corrupt files.<\/p>\n<p class=\"caption\"><img decoding=\"async\" class=\"\" src=\"\/media\/issue61\/weig\/figure_10.png\" \/><br \/>\n<strong>Figure 10.<\/strong> Calculation of failure rate for downloads.<\/p>\n<p>Downloaded files are being reviewed one title at a time. Any items missing due to unsuccessful downloads are downloaded a second time. The Internet archive uses a file manifest that includes checksum values for all the files contained within the .zip archive, and this was useful for reviewing files and assessing whether or not they were corrupted.<\/p>\n<p class=\"caption\"><img decoding=\"async\" class=\"\" src=\"\/media\/issue61\/weig\/figure_11.png\" \/><br \/>\n<strong>Figure 11.<\/strong> An XML manifest file is present within each .zip download. This file includes checksum values for each file within the .zip archive.<\/p>\n<h2>Conclusion<\/h2>\n<p>Downloading our previously uploaded content from the Internet Archive via automation was straightforward and easy to accomplish with some careful planning. The Internet Archive is tailored to allow for harvesting content. URLs can be formulated using the Web interface advanced search options and these URLs gather results formatted as structured data such as JSON and XML. Utilizing AI to generate small Python scripts that were narrow in focus, ChatGPT yielded well-formed Python that accomplished development milestones, allowing for quick testing and only minor problems to fix via debugging or inputting refining prompts.<\/p>\n<h2>References<\/h2>\n<p>[<a id=\"note1\" href=\"#ref1\">1<\/a>] Chang Y, et al. 2024. A survey on evaluation of large language models. ACM Trans Intell Syst Technol. 15(3):1\u201345. <a href=\"https:\/\/doi.org\/10.1145\/3641289\">https:\/\/doi.org\/10.1145\/3641289<\/a><\/p>\n<p>[<a id=\"note2\" href=\"#ref2\">2<\/a>] Coello CEA, et al. 2024. Effectiveness of ChatGPT in coding: a comparative analysis of popular large language models. Digital. 4(1):114\u2013125. <a href=\"https:\/\/doi.org\/10.3390\/digital4010005\">https:\/\/doi.org\/10.3390\/digital4010005<\/a><\/p>\n<p>[<a id=\"note3\" href=\"#ref3\">3<\/a>] Cooper N, et al. 2024. Harnessing large language models for coding, teaching and inclusion to empower research in ecology and evolution. Methods Ecol Evol. 15(10):1757\u20131763. <a href=\"https:\/\/doi.org\/10.1111\/2041-210X.14325\">https:\/\/doi.org\/10.1111\/2041-210X.14325<\/a><\/p>\n<p>[<a id=\"note4\" href=\"#ref4\">4<\/a>] Internet Archive. Archive.org about search [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/archive.org\/help\/aboutsearch.htm\">https:\/\/archive.org\/help\/aboutsearch.htm<\/a><\/p>\n<p>[<a id=\"note5\" href=\"#ref5\">5<\/a>] Internet Archive. Downloading \u2013 a basic guide \u2013 Internet Archive Help Center [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/help.archive.org\/help\/downloading-a-basic-guide\/\">https:\/\/help.archive.org\/help\/downloading-a-basic-guide\/<\/a><\/p>\n<p>[<a id=\"note6\" href=\"#ref6\">6<\/a>] Internet Archive. Search \u2013 a basic guide \u2013 Internet Archive Help Center [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/help.archive.org\/help\/search-a-basic-guide\/\">https:\/\/help.archive.org\/help\/search-a-basic-guide\/<\/a><\/p>\n<p>[<a id=\"note7\" href=\"#ref7\">7<\/a>] Internet Archive. robots.txt [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/archive.org\/robots.txt\">https:\/\/archive.org\/robots.txt<\/a><\/p>\n<p>[<a id=\"note8\" href=\"#ref8\">8<\/a>] Internet Archive. Internet Archive: digital library of free &amp; borrowable texts, movies, music &amp; Wayback Machine [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/archive.org\/about\/\">https:\/\/archive.org\/about\/<\/a><\/p>\n<p>[<a id=\"note9\" href=\"#ref9\">9<\/a>] Knoth N, et al. 2024. AI literacy and its implications for prompt engineering strategies. Comput Educ Artif Intell. 6:100225. <a href=\"https:\/\/doi.org\/10.1016\/j.caeai.2024.100225\">https:\/\/doi.org\/10.1016\/j.caeai.2024.100225<\/a><\/p>\n<p>[<a id=\"note10\" href=\"#ref10\">10<\/a>] Real Python. What are Python raw strings? \u2013 Real Python [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/realpython.com\/python-raw-strings\/\">https:\/\/realpython.com\/python-raw-strings\/<\/a><\/p>\n<p>[<a id=\"note11\" href=\"#ref11\">11<\/a>] Infatica. 2021. Responsible web scraping: an ethical way of data collection [Internet]. [accessed 2025 May 6]. <a href=\"https:\/\/infatica.io\/blog\/responsible-web-scraping\/\">https:\/\/infatica.io\/blog\/responsible-web-scraping\/<\/a><\/p>\n<h2>Notes<\/h2>\n<p>[<a id=\"note14\" href=\"#ref14\">14<\/a>] Code referenced in this article is available at:<br \/>\n<a href=\"https:\/\/github.com\/libmanuk\/IADownload\/\">https:\/\/github.com\/libmanuk\/IADownload\/<\/a><\/p>\n<h2>About the Author<\/h2>\n<p>Eric C. Weig (eweig@uky.edu) has been an academic librarian at the University of Kentucky Libraries since 1998. His current title is Web Development Librarian. Eric manages the University of Kentucky Libraries Drupal based website and intranet. He also assists with design and management of digital libraries for the University of Kentucky Special Collections Research Center.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Internet Archive was founded on May 10, 1996, in San Francisco, CA.  Since its inception, the archive has amassed an enormous corpus of content, including over 866 billion web pages, more than 42.5 million print materials, 13 million videos, and 14 million audio files. It is relatively easy to upload content to the Internet Archive.  It is also easy to download individual objects by visiting their pages and clicking on specific links.  However, downloading a large collection, such as thousands or even tens of thousands of items, is not as easy.  This article outlines how The University of Kentucky Libraries downloaded over 86 thousand previously uploaded newspaper issues from the Internet Archive for local use. The process leveraged ChatGPT to automate the process of generating Python scripts that accessed the Internet Archive via its API (Application Programming Interface). <\/p>\n","protected":false},"author":510,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18510","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18510","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/510"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18510"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18510\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18555,"date":"2025-10-21T15:58:55","date_gmt":"2025-10-21T19:58:55","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18555"},"modified":"2025-10-21T16:00:20","modified_gmt":"2025-10-21T20:00:20","slug":"retrieval-augmented-generation-for-web-archives-a-comparative-study-of-warc-gpt-and-a-custom-pipeline","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18555","title":{"rendered":"Retrieval-Augmented Generation for Web Archives: A Comparative Study of WARC-GPT and a Custom Pipeline"},"content":{"rendered":"<p>by Corey Davis<\/p>\n<h2>Introduction<\/h2>\n<p>Large Language Models (LLMs) have begun reshaping how libraries approach digital preservation and access. We are already seeing LLMs that help transcribe handwritten manuscripts and extract entities from historical texts, among other tasks (Hiltmann et al. 2024; Humphries et al. 2024). At their best, these systems can bridge technical skill gaps and enhance efficiency; at worst, they hallucinate misinformation and act as inscrutable black boxes. Retrieval-Augmented Generation (RAG) has emerged as a promising technique to mitigate some of these issues by grounding LLM outputs in our own unique digital collections. The idea is to map a digital collection into a high-dimensional vector space and let an LLM \u201cchat\u201d with the collection using natural language queries, backed by real source material.<\/p>\n<p>Web ARChive (WARC) files pose a particularly interesting test case for this approach. WARC files are notoriously difficult to search with traditional methods. Keyword searches in tools like the Wayback Machine or Archive-It often fail to surface meaningful results from the mass of HTML, scripts, images, and other files that go into creating a website (Milligan 2012; Costa 2021). To investigate whether LLMs in a RAG pipeline could improve web archive access, I conducted two related case studies during my 2024\u201325 research leave at the University of Victoria Libraries. I first experimented with WARC-GPT, an open-source tool developed by the Harvard Library Innovation Lab (Cargnelutti et al. 2024), to understand its capabilities out of the box. I then built a customized RAG pipeline, applying targeted data cleaning and architectural adjustments. Both approaches were tested on a collection of thousands of archived pages from the Bob\u2019s Burgers fan wiki, a site rich in unstructured narrative content [<a id=\"ref1\" href=\"#note1\">1<\/a>]. This article explores the difference between untuned (i.e. out-of-the-box) and tuned RAG implementations, and reflects on the promise, challenges, and significant effort required to optimize LLM-assisted access to web archives.<\/p>\n<p class=\"caption\"><img decoding=\"async\" src=\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/1\/14\/RAG_diagram.svg\/1024px-RAG_diagram.svg.png\" \/><br \/>\n<strong>Figure 1.<\/strong> A conceptual diagram of the RAG workflow used in these experiments. User queries are converted to embedding vectors and matched against a vector database of the archived collection. Relevant \u201ccontext chunks\u201d of text are retrieved and passed into an LLM, along with system prompts, which then generates a context-informed answer. Source: <i>https:\/\/en.wikipedia.org\/wiki\/Retrieval-augmented_generation#\/media\/File:RAG_diagram.svg<\/i><\/p>\n<h2>WARC-GPT: RAG for web archives<\/h2>\n<p>Released in early 2024, WARC-GPT is an open-source tool developed by the Harvard Library Innovation Lab that enables users to create a conversational chatbot over a set of WARC files. Instead of manually browsing or keyword searching a web archive, a user can query it in plain language and receive answers with cited sources. WARC-GPT ingests WARC files, converts the text into vector embeddings, and at query time retrieves relevant text snippets to feed into a Large Language Model (LLM) for answer generation. Notably, the system provides source attribution, listing which archived pages and snippets were used to generate the answer: a critical feature for building trust in an AI-driven research tool [<a id=\"ref2\" href=\"#note2\">2<\/a>].<\/p>\n<p>In hands-on testing, untuned WARC-GPT showcases the potential of conversational web archive search, but several significant challenges emerged. Foremost among these was the inherent noisiness of WARC files themselves, which encompass not only the main textual content but also raw HTML, CSS\/JS, boilerplate, navigation menus, scripts, and duplicate text.\u00a0 While WARC-GPT\u2019s ingestion pipeline uses a standard configuration of BeautifulSoup for HTML parsing, it does not apply site-specific logic to filter out irrelevant elements. As a result, semantically important content (often marked up in distinctive ways on community-curated sites) was frequently diluted by unrelated or repetitive text in the resulting embeddings.. Poor HTML filtering allowed low-value content to be embedded alongside meaningful text, increasing vector noise and degrading retrieval accuracy. Consequently, the retrieval process often surfaced spurious or misleading snippets, leading to fragmented or incomplete responses.<\/p>\n<p>A second major challenge was scale and performance. Converting a large web archive into vector embeddings is computationally intensive and quickly exposes infrastructure limitations. In my case, WARC-GPT\u2019s default pipeline took over a week to process what would be considered a modest-sized collection by research library standards. Embedding with transformer-based models, particularly those optimized for semantic search, like E5 or ADA, was especially taxing. On the Apple M3 Pro (with 18\u202fGB of unified memory), the process easily saturated both CPU\/GPU capacity and memory, leading to slowdowns during batch processing.<\/p>\n<p>Effective RAG pipelines require experimentation and iteration: embedding strategies and models are not one-size-fits-all. They need to be tailored to the collection, which means testing different models (e.g., text-embedding-ada-002, intfloat\/e5-base-v2), tuning chunk sizes and overlaps, exploring sparse vs. dense vs. hybrid retrieval, and comparing vector stores like FAISS, Chroma, or Weaviate with different indexing strategies. But when a single embedding run takes days or weeks, this kind of iterative tuning becomes impractical, especially for institutions lacking dedicated compute resources or flexible pipelines. Offloading to commercial cloud infrastructure is an option, but one that carries substantial cost, especially when using proprietary APIs from OpenAI, Anthropic, and others.<\/p>\n<p>For collections that change or grow regularly, the time and expense of re-indexing may be a non-starter. Without access to efficient, scalable infrastructure, the promise of AI-assisted discovery can quickly run up against hard practical limits. Despite these challenges, using WARC-GPT felt like a glimpse into the future of web archive access. It allowed exploratory Q&amp;A across an entire collection in a way that traditional search simply could not, surfacing some connections between documents and providing ok narrative answers.<\/p>\n<p>However, naive RAG systems like WARC-GPT still face significant limitations. Because each query is handled independently, they struggle to maintain context across documents or support multi-turn conversations. There is no memory of previous prompts, and little capacity for narrative continuity or thematic buildup across interactions [Kannappan 2023; Merola and Singh 2025; Anthropic 2024]. The inclusion of source links for each response was a critical feature, allowing users to verify claims and explore the original web pages in more depth.<\/p>\n<p>Ultimately, WARC-GPT showed that conversational AI can meaningfully augment web archives, but it also made clear that to deploy such a tool in production, I would need to improve data preprocessing, retrieval accuracy, and efficiency. This led to the question: could I build a more tailored RAG solution for our web archives?<\/p>\n<p><b>Creating and tuning a RAG pipeline<\/b><\/p>\n<p>Before discussing my custom RAG pipeline below, it\u2019s important to make clear that the tuning\u00a0 I did around\u00a0 better preprocessing, chunking strategies, and model selection, can be done in WARC-GPT as well. That, however, was not the goal of my research. WARC-GPT is flexible and capable of accommodating these kinds of tweaks, but I wanted to build something from the ground up to better understand how each component (chunking, embedding, retrieval, generation, etc.) actually works. This hands-on approach helped clarify why my early results with WARC-GPT fell flat: not because the tool is lacking, but because <b>RAG pipelines are highly sensitive to design and preprocessing choices.<\/b><\/p>\n<p>My goal was to leverage open-source components to address the data quality and performance issues I had\u00a0 observed in the untuned instance of WARC-GPT. In essence, I wanted to see whether rethinking the ingestion process could yield cleaner and\u00a0 faster results. The solution combined a suite of lightweight tools and custom scripts (available on GitHub) [<a id=\"ref3\" href=\"#note3\">3<\/a>]. Here\u2019s an overview of the approach and optimizations I implemented<\/p>\n<h2><b>Cleaning the input data<\/b><\/h2>\n<p>Instead of working with existing WARC files full of \u201cnoise\u201d (at least for the machines that are trying to process the text), I re-crawled the target websites using wget to generate fresh WARC files, deliberately excluding non-text content like images, video, CSS, and other page assets. This approach produced a leaner, text-focused corpus, optimized for semantic search rather than preserving the visual and structural integrity of entire web pages. Unlike the comprehensive, high-fidelity crawls performed by services like Archive-It, which prioritize preserving full web pages for human browsing, my captures focused solely on the content a reader would actually engage with: the main HTML and visible text. As a result, I began with a much cleaner dataset for embedding.<\/p>\n<p>I then wrote parsing scripts using BeautifulSoup and regular expressions to extract meaningful text from the WARCs and discard the rest.. This involved stripping out boilerplate, navigation menus, and scripts, and keeping only the core textual content and headings from each page. I also filtered out any non-English pages or sections and normalized whitespace and punctuation for consistency. The result was a corpus of reasonably clean, human-readable text ready for indexing.<\/p>\n<p>By extracting only the meaningful text from each page and discarding scripts, navigation elements, and other noise, the system embeds more meaningful components. This improved the quality of the vector index and significantly reduced storage requirements. In short, I did not index what I (or the machines, in this case) don\u2019t need for retrieval or for generating answers based on that content (a process known as inference in generative AI systems).<\/p>\n<h2><b>Customizing chunk size and vector model<\/b><\/h2>\n<p>As seen in figure 1 above, in retrieval-augmented generation, documents are divided into \u201cchunks\u201d, small, self-contained segments of text that are individually embedded and stored in a vector database for semantic search. I experimented to find a chunk size that balanced contextual completeness with vector granularity. I landed on chunks of about 512 tokens (roughly 400\u2013500 words), with a ~50-token overlap between chunks to avoid splitting sentences and damaging context. Smaller chunks helped limit irrelevant content per embedding, which improved retrieval precision. Using these chunk and overlap parameters, I was able to strike a good balance between preserving context and maintaining retrieval precision. This setup helped ensure that relevant information wasn\u2019t fragmented across multiple chunks, while keeping each embedding focused enough to support accurate retrieval during query time.<\/p>\n<p>For embedding the chunks themselves, I used the intfloat\/e5-large-v2 model from Hugging Face, an open-source transformer developed by Microsoft and fine-tuned for semantic search (Wang et al. 2022). It generates 1024-dimensional vectors and runs locally without API fees, which made it preferable over OpenAI\u2019s text-embedding-ada-002, which requires paid API usage (OpenAI 2023). All chunk embeddings were stored in a ChromaDB vector database, along with metadata linking, mirroring WARC-GPT\u2019s approach (but not, admittedly, as sophisticated).<\/p>\n<h2><b>Selecting an LLM chat bot to generate results<\/b><\/h2>\n<p>For the question-answering stage, I integrated the pipeline with OpenAI\u2019s GPT-4 via API. When a user submits a question, the system embeds the query using the E5 model and retrieves the top relevant text chunks from ChromaDB. Those chunks (typically 5 to 10) are then appended to the user\u2019s question as context, forming an \u201caugmented query\u201d that\u2019s then sent to GPT-4 to generate a conversational response (OpenAI 2024). GPT-4 was selected for its strong language understanding and generation capabilities, which are especially valuable when working with messy web archives. That said, the pipeline\u2013like WARC-GPT itself\u2013was built to be model-agnostic: one can (and should) swap in different LLMs during testing, including open-source alternatives like Mistral, LLaMA 3, DeepSeek, etc., though likely with some trade-offs in coherence or fluency.<\/p>\n<h2><b>The role of prompt engineering: System prompts<\/b><\/h2>\n<p>A further optimization available in this pipeline is the system prompt. In a RAG pipeline, system prompt design shapes how well the entire retrieval system works and is differentiated from the user prompt as it is applied globally, and behind the scenes, to any query entered by a user. User queries are embedded and semantically compared against the chunk embeddings from the web archives collection, so adding instructions, context-setting, or overly verbose phrasing directly to the user prompt can distort the semantic signature of the actual question. This risks retrieving irrelevant or less optimal chunks.\u00a0 In contrast, the system prompt wraps the user query in a carefully designed context that supports more accurate retrieval and generation. By clearly separating the user\u2019s original question from surrounding formatting or instructions (and using the system prompt strictly for downstream generation at inference time) I was able to significantly improve both retrieval accuracy and output quality.<\/p>\n<h2><b>Hardware Acceleration<\/b><\/h2>\n<p>I optimized the embedding step to run on an Apple Silicon GPU (specifically, a MacBook Pro with an M3 Pro chip). Leveraging GPU parallelism for batch embedding generation reduced processing time: what used to take nearly a week with the old pipeline was now done in just a few hours, also supported by the move to cleaner text and smaller chunks.<\/p>\n<h2><b>Implications of Optimization<\/b><\/h2>\n<h3><b>More Accurate results, fewer hallucinations<\/b><\/h3>\n<p>The impact of these optimizations was most evident on harder questions. For straightforward fact-based queries, both the bespoke RAG system and WARC-GPT performed well. But when the answer depended on surfacing a rare or deeply buried detail (like identifying a single quote or fact buried in a blog post among thousands of pages) the tuned pipeline performed better: it was more effective\u00a0 at surfacing the right snippet, thanks to a cleaner index and more precise retrieval (i.e. there was not\u00a0 as much noise interfering with the semantic signature of each chunk, so a user\u2019s query could more accurately be matched to chunks in the vector database).<\/p>\n<p>And when the tuned system couldn\u2019t find a relevant chunk, it often returned nothing or explicitly deferred, responding with phrases like \u201cI couldn\u2019t find that information,\u201d rather than attempting to fabricate an answer. This behavior was the result of more carefully engineered system prompts, which explicitly instructed the generative model (in this case, GPT-4) not to guess or generate speculative content in the absence of supporting evidence from the enhanced query text containing the retrieved chunks. By aligning the model\u2019s behavior with the principles of grounded generation, these prompts significantly reduced hallucinations (or confabulations), especially in cases where the retrieval step failed to return a relevant chunk at all. Rather than leaving the model to fill in the blanks (which LLMs are apt to do), the prompts encouraged transparency about the system\u2019s limits, a small but critical safeguard when working with incomplete or noisy collections like web archives.<\/p>\n<h3><b>Indexing Efficiency (resource implications)<\/b><\/h3>\n<p>Another benefit was efficiency. The vector database generated by the custom pipeline was around 240 MB, compared to roughly 10 GB for WARC-GPT\u2019s index on the same content. This ~40\u00d7 reduction in index size was primarily achieved by stripping out noisy content using a BeautifulSoup configuration tailored to the specific structure of the target site. A smaller index saves storage and speeds up retrieval (fewer vectors to scan) and reduces memory overhead. In practice, queries in the custom pipeline felt faster and the system was more responsive overall. This kind of efficiency could also translate to cost savings, particularly if running in the cloud or on constrained hardware.<\/p>\n<p>Again it\u2019s worth noting that WARC-GPT was designed to be flexible with model choices, including the ability to run fully locally using open-source LLMs via the Ollama framework. In my tests, I opted to use GPT-4 for its higher answer quality, whereas WARC-GPT defaults to open models preconfigured in Ollama. A strong language model can generate better answers from the same context. That said, using GPT-4 comes with trade-offs: it introduces cost and reliance on an external API, which may not be acceptable for every library setting. For now, the key takeaway is that by feeding the LLM cleaner, more focused data (and doing so more efficiently) I was able to significantly boost the system\u2019s performance.<\/p>\n<h3>Implications<\/h3>\n<p>Comparing an untuned to a custom-created and -tuned RAG pipeline raises a central question for libraries exploring AI: should we adopt off-the-shelf tools, or invest in building custom systems? WARC-GPT offered an accessible starting point (it is open-source and relatively easy to set up) but my deep dive revealed that achieving reliable performance depends heavily on data quality, retrieval precision, and processing efficiency. Any library considering a similar deployment will need to weigh the convenience of a pre-built framework against the benefits of tailoring a system to local needs, and the staff time and technical skills that effort requires. In my case, the custom approach paid off in performance, but it came with a hands-on cost that not every institution may be positioned to absorb.<\/p>\n<p>More broadly, these experiments underscore both exciting opportunities and serious challenges that come with bringing LLMs and RAG into library workflows:<\/p>\n<p><b>Transparent source attribution<\/b>: Unlike standalone LLMs that generate answers without revealing where the information came from (although that is changing as the big frontier models increasing access live web content during inference), RAG systems can point users back to the exact documents or text snippets used to construct a response. This traceability not only supports verification and citation, but also aligns with core library values around transparency, accountability, and intellectual honesty. For users, it means they can explore the source material themselves with more confidence (although, as with all things LLM, it is important to always maintain vigilance).<\/p>\n<p><b>Conversational access to collections: <\/b>RAG-powered chat interfaces can make large, unstructured collections like web archives more intuitively searchable. They allow users to ask complex questions and get narrative answers that would be impractical with traditional search engines. This can reveal connections and insights that keyword searches might miss.<\/p>\n<p><b>Automated metadata generation: <\/b>An LLM that\u2019s allowed to read through an entire collection could help summarize content, extract key entities, suggest topics, or even generate draft descriptions. This could assist librarians and archivists in creating metadata or finding aids, especially for collections lacking detailed descriptions. In our web archive scenario, one could imagine the LLM summarizing the main themes of a set of websites or identifying frequently mentioned names and issues across the collection.<\/p>\n<p><b>Semantic search capabilities<\/b> By embedding content in a vector space, I enabled semantic similarity searches: finding documents that are topically related even if they don\u2019t share the same keywords. This goes beyond the \u201cstring matching\u201d of conventional search. For researchers, this means a query about, say, climate change controversies might retrieve relevant pages even if those pages don\u2019t explicitly use that exact or approximate phrasing, because the LLM can match on conceptual similarity.<\/p>\n<p><b>Technical and resource barriers: <\/b>Setting up and maintaining RAG pipelines requires technical expertise in LLMs, data processing, and scripting; many libraries are still in the process of developing these capabilities. There are also computational costs, including robust hardware (GPUs, lots of memory, fast storage) and ongoing expenses when using commercial APIs. Smaller institutions may find this prohibitive, and even large libraries will need to justify the local infrastructure and\/or cloud computing costs needed to run RAG services at scale (Huang et al. 2023; Allen Institute 2023).<\/p>\n<p><b>Residual hallucinations:<\/b>Even with well-tuned RAG pipelines, hallucinations remain a persistent risk. While RAG aims to ground LLM outputs in source material, this only works reliably if the model faithfully adheres to system prompts and if the retrieved content is relevant and sufficiently informative. In practice, even state-of-the-art models sometimes ignore instructions not to guess or fabricate answers, especially when prompted with vague, underspecified, or speculative queries (). This failure to follow directives can result in confident, fabricated responses that are not grounded in the retrieved evidence. In library and archival settings, where precision and verifiability are paramount, these confabulations pose a direct challenge to the trustworthiness of AI-assisted services. Guardrails like refusal prompts, deferred responses, and citation requirements help mitigate, <i>but will not eliminate<\/i> this behavior. As of now, no prompt or retrieval strategy can fully guarantee hallucination-free outputs.<\/p>\n<p><b>Ethical and legal concerns:<\/b> AI-generated answers raise important questions about transparency, accuracy, and bias. In a library context, providing fabricated or misleading information can erode trust, a resource that is more critical than ever in an era marked by rising misinformation, political polarization, and growing public skepticism toward institutions. As geopolitical tensions distort narratives and destabilize information ecosystems, libraries are left to recommit their role as trusted stewards of credible knowledge. We must consider how to verify, correct, and contextualize LLM outputs, and how to clearly communicate the provenance of any given response. WARC-GPT\u2019s inclusion of source citations is one step in the right direction, but not all tools offer this. There are also legal considerations: large-scale text analysis and embedding may raise copyright concerns, and use of third-party AI services can conflict with privacy or data ownership policies.<\/p>\n<p><b>Bias at scale:<\/b> Another concern is the amplification of existing bias in our collections. Archival and library materials reflect the perspectives, exclusions, and structural inequities of the societies that produced them. When these materials are fed into LLMs without sufficient context or critical framing, there\u2019s a risk that biased or harmful content gets reproduced as neutral or even authoritative. A RAG system (or any LLM) doesn\u2019t \u201cknow\u201d the provenance or politics of a source unless we explicitly encode that context. How to do this is the great unsolved question for LLMs and other deep neural networks. We risk building pipelines that surface and summarize legacy bias in ways that flatten nuance or reinforce harmful assumptions. This isn\u2019t a new problem in libraries, but in an AI-powered environment, the speed and scale at which these biases can circulate makes it newly urgent.<\/p>\n<p><b>Maintenance and sustainability: <\/b>Deploying an AI tool isn\u2019t a one-time event. Models require periodic updates, underlying collections evolve, and dependencies shift. There\u2019s also a risk in relying too heavily on proprietary services, which may change terms or shut down unexpectedly. Libraries will need long-term plans to manage these systems, such as whether to invest in in-house training and GPU infrastructure, or to lean more heavily on external, commercial services both from publishers and cloud computing providers. A hybrid approach may be ideal, with frequently accessed or sensitive data handled locally, while less critical tasks leverage the cloud.<\/p>\n<p>In navigating these trade-offs, a thoughtful approach with human oversight is essential. At this point AI tools should augment human expertise in archives and libraries. This kind of oversight is foundational to responsible implementation, especially in these early stages of integrating AI into cultural memory work.<\/p>\n<p>One of the most persistent challenges in integrating AI is the inscrutable nature of the models themselves. Deep neural networks like LLMs function as black boxes: we can observe their inputs and outputs, but their internal reasoning remains largely opaque, even to their creators. This lack of transparency becomes especially troubling when combined with the well-documented issue of hallucinations (Ji et al. 2023). In library and archival settings, where trust is foundational and verifiability is non-negotiable (on paper at least), this presents a serious risk. As we experiment with these tools, ensuring source attribution, maintaining human oversight, and building systems that fail gracefully (rather than confidently lying) must remain core design priorities.<\/p>\n<p><b>Conclusion<\/b><\/p>\n<p>LLMs and RAG signal a fundamental shift in how users can discover and interact with digital collections. My work with WARC-GPT and a custom RAG pipeline revealed both the promise and the pitfalls of applying these tools to web archives. I saw firsthand how AI can synthesize information from across an entire collection and return it in a conversational, human-readable format, lowering barriers for researchers and surfacing connections that keyword search alone might miss.<\/p>\n<p>But the most important lesson was this: <b>RAG pipelines are highly sensitive to design and preprocessing choices<\/b>. Performance didn\u2019t come from the language model alone, it depended on everything that came before it: cleaner data, tuned chunking, targeted filtering, and efficient infrastructure. These foundational steps had a greater impact on retrieval accuracy and output quality than any one prompt or model tweak. With thoughtful design, I found I could significantly improve responsiveness, relevance, and usefulness in a library context, but only by first doing the unglamorous work of preparing the data pipeline.<\/p>\n<p>Recognizing this sensitivity to design choices is key to implementing LLM-assisted services responsibly. Success depends not just on the capabilities of the language model, but on the rigor and care applied throughout the pipeline.Issues of data quality, hallucinated answers, and the compute demands of large models and RAG infrastructure generally, remain real and pressing. Any library considering a \u201cGPT for web archives\u201d approach will need to invest in the slow work: building preprocessing pipelines, testing retrieval methods, refining prompts, and continuously checking for accuracy. It is a serious undertaking, but it\u2019s also one that could enable entirely new forms of access, research, and engagement.<\/p>\n<p>Looking ahead, the future of AI in libraries is likely to be hybrid and human-centred. We will mix local infrastructure for collections that require control, with cloud-based models where scale matters. We will\u00a0 blend automation with the discernment of librarians and archivists. And no<\/p>\n<p>A one-size-fits-all solution will emerge: what works for one collection may be completely wrong for another. But what\u2019s already clear is that the human role is not going away. If anything, our expertise in context, ethics, and stewardship will become even more critical as these tools gain traction.<\/p>\n<p>And yet, I would be remiss not to acknowledge the bigger picture. The same technologies powering this promising moment are part of a much larger transformation, one that includes growing calls from AI researchers and ethicists about the risks posed by super intelligent systems. As we build small, useful tools for cultural memory work, we do so in the shadow of something much larger, more powerful, and less predictable. It\u2019s a strange time to be hopeful, and a necessary time to be cautious.<\/p>\n<p>RAG can help unlock our collections in remarkable new ways. But doing so thoughtfully\u2013openly, critically, and with humility\u2013might just help us hold onto the trust and wisdom we\u2019ll need in the face of what\u2019s coming.<\/p>\n<p>&nbsp;<\/p>\n<p><b>Bibliography<\/b><\/p>\n<p>AI Now Institute. 2023. Computational power and AI [Internet]. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/ainowinstitute.org\/publications\/compute-and-ai\">https:\/\/ainowinstitute.org\/publications\/compute-and-ai<\/a><\/p>\n<p>Anthropic. 2024. Introducing contextual retrieval [Internet]. [accessed 2025 Jun 4]. Available from: https:\/\/<a href=\"http:\/\/www.anthropic.com\/news\/contextual-retrieval\">www.anthropic.com\/news\/contextual-retrieval<\/a><\/p>\n<p>Cargnelutti M, Mukk K, Stanton C. 2024. WARC-GPT: An open-source tool for exploring web archives using AI [Internet]. Harvard Library Innovation Lab. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/lil.law.harvard.edu\/blog\/2024\/02\/12\/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai\/\">https:\/\/lil.law.harvard.edu\/blog\/2024\/02\/12\/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai\/<\/a><\/p>\n<p>Charleston Hub. 2023. In libraries we trust [Internet]. [accessed 2025 Jun 4]. Available from: https:\/\/<a href=\"http:\/\/www.charleston-hub.com\/2025\/01\/in-libraries-we-trust\/\">www.charleston-hub.com\/2025\/01\/in-libraries-we-trust\/<\/a><\/p>\n<p>Costa M. 2021. Full-text and URL search over web archives [Internet]. arXiv preprint arXiv:2108.01603. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2108.01603\">https:\/\/arxiv.org\/abs\/2108.01603<\/a><\/p>\n<p>Cottier B, Rahman R, Fattorini L, Maslej N, Besiroglu T, Owen D. 2024. The rising costs of training frontier AI models [Internet]. arXiv preprint arXiv:2405.21015. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2405.21015\">https:\/\/arxiv.org\/abs\/2405.21015<\/a><\/p>\n<p>EAB. 2023. Public trust in higher education has reached a new low [Internet]. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/eab.com\/resources\/blog\/strategy-blog\/americans-trust-higher-ed-reached-new-low\/\"> https:\/\/eab.com\/resources\/blog\/strategy-blog\/americans-trust-higher-ed-reached-new-low\/<\/a><\/p>\n<p>Hiltmann T, Dr\u00f6ge M, Dresselhaus N, Grallert T, Althage M, Bayer P, Eckenstaler S, Mendi K, Schmitz JM, Schneider P, Sczeponik W, Skibba A. 2025. NER4all or context is all you need: using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach. arXiv [Preprint]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2502.04351\">https:\/\/arxiv.org\/abs\/2502.04351<\/a><\/p>\n<p>Humphries M, Leddy LC, Downton Q, Legace M, McConnell J, Murray I, Spence E. 2024. Unlocking the Archives: Using Large Language Models to Transcribe Handwritten Historical Documents. arXiv [Preprint]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2411.03340\">https:\/\/arxiv.org\/abs\/2411.03340<\/a><\/p>\n<p>Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang Y, Madotto A, Fung P. 2023. A survey on hallucination in large language models. ACM Comput Surv [Internet]. [accessed 2025 Jun 4];56(1):1\u201338. Available from: <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3703155\">https:\/\/dl.acm.org\/doi\/10.1145\/3703155<\/a><\/p>\n<p>Kannappan G. 2023. The Achilles&#8217; heel of naive RAG: When retrieval isn&#8217;t enough [Internet]. Medium. [accessed 2025 Jun 4]. Available from:<\/p>\n<p><a href=\"https:\/\/medium.com\/@ganeshkannappan\/the-achilles-heel-of-naive-rag-when-retrieval-isn-t-enough-3c1ab812abbb\">https:\/\/medium.com\/@ganeshkannappan\/the-achilles-heel-of-naive-rag-when-retrieval-isnt-enough-3c1ab812abbb<\/a><\/p>\n<p>Merola C, Singh J. 2025. Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation [Internet]. arXiv preprint arXiv:2504.19754. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2504.19754\">https:\/\/arxiv.org\/abs\/2504.19754<\/a><\/p>\n<p>Milligan I. 2012. WARC files: A challenge for historians, and finding needles in haystacks [Internet]. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/ianmilli.wordpress.com\/2012\/12\/12\/warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks\/\">https:\/\/ianmilli.wordpress.com\/2012\/12\/12\/warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks\/<\/a><\/p>\n<p>OpenAI. 2023. OpenAI platform: Vector embedding [Internet]. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/platform.openai.com\/docs\/guides\/embeddings\">https:\/\/platform.openai.com\/docs\/guides\/embeddings<\/a><\/p>\n<p>Wang L, Pradeep R, Ainslie J, Shi K, Zettlemoyer L, Zitnick CL. 2022. Text embeddings by weakly-supervised contrastive pre-training [Internet]. arXiv preprint arXiv:2212.03533. [accessed 2025 Jun 4]. Available from: <a href=\"https:\/\/arxiv.org\/abs\/2212.03533\">https:\/\/arxiv.org\/abs\/2212.03533<\/a><\/p>\n<p>&nbsp;<\/p>\n<p><b>Notes<\/b><\/p>\n<p>[<a id=\"note1\" href=\"#ref1\">1<\/a>] <a href=\"https:\/\/bobs-burgers.fandom.com\/\">https:\/\/bobs-burgers.fandom.com<\/a>. This site serves as an unexpectedly rich tool for exploring web archives as a non-traditional, community-driven form of scholarship. As a fan-curated knowledge base centered on the long-running animated series, it exemplifies the kind of vernacular documentation that libraries and archives have historically overlooked. Yet fandom wikis like this one are often meticulously maintained, deeply intertextual, and laden with cultural meaning: qualities that make them ideal test cases for assessing how well AI systems can interpret and navigate messy, user-generated web content. From a practical standpoint, this site also lends itself to \u201cground-truthing\u201d: as someone who comfort-watches the show, I\u2019m intimately familiar with its characters, plot arcs, and in-jokes, which makes it easier to spot hallucinations or subtle misrepresentations generated by AI tools.<\/p>\n<p>[<a id=\"note2\" href=\"#ref2\">2<\/a>] Trust in libraries and universities has long underpinned our supposed authority as stewards of knowledge and public goods, but that trust is increasingly fragile. Libraries still enjoy relatively high public regard (Charleston Hub 2023), yet universities face mounting skepticism amid political polarization and questions about institutional neutrality (EAB 2023). As misinformation spreads and civic discourse fractures, our ability to function as credible, trusted spaces is under threat. For libraries embedded within higher education, the stakes are especially high: we must not only preserve and provide access to knowledge, but do so transparently, responsibly, and with renewed urgency. In this climate, even technical choices\u2013like ensuring LLM-generated responses include source attribution\u2013can become conscious acts of trust-building, reaffirming the library\u2019s role as a safeguard against the erosion of shared knowledge and truth, regardless of how quixotic this all might seem right now.<\/p>\n<p>[<a id=\"not3\" href=\"#ref3\">3<\/a>] <a href=\"https:\/\/github.com\/coreyleedavis\/libguides-rag\">https:\/\/github.com\/coreyleedavis\/libguides-rag<\/a>. I relied heavily on ChatGPT and Claude to help write the code. This pipeline wouldn\u2019t exist without them, to be honest. That said, I\u2019d really welcome any feedback from folks who actually know how to write Python from scratch. Suggestions for improving the scripts or overall structure are more than welcome.<\/p>\n<h2><b>About the Author<\/b><\/h2>\n<p>Corey Davis is the Digital Preservation Librarian at the University of Victoria Libraries, where he leads initiatives on AI, web archives, and digital preservation infrastructure.<\/p>\n<p><a href=\"mailto:coreyd@uvic.ca\">coreyd@uvic.ca<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Large Language Models (LLMs) are reshaping digital preservation and access in libraries, but their limitations (hallucinations, opacity, and resource demands) remain significant.<br \/>\nRetrieval-Augmented Generation (RAG) offers a promising mitigation strategy by grounding LLM outputs used  in specific digital collections. This article compares the performance of WARC-GPTs default RAG implementation with unfiltered WARC files from Archi Io ve-It against a custom-built RAG solution utilizing optimization strategies in both modelling and data (WARC) preprocessing. Tested on a collection of thousands of archived pages from the Bob\u2019s Burgers fan wiki, the study analyzes trade-offs in preprocessing, embedding strategies, retrieval accuracy, and system responsiveness. Findings suggest that while WARC-GPT lowers barriers to experimentation, custom RAG pipelines offer substantial improvements for institutions with the technical capacity to implement them, especially in terms of data quality, efficiency, and trustworthiness.<\/p>\n","protected":false},"author":510,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18555","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18555","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/510"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18555"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18555\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18555"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18555"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18555"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18501,"date":"2025-10-21T15:58:54","date_gmt":"2025-10-21T19:58:54","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18501"},"modified":"2025-10-21T16:00:30","modified_gmt":"2025-10-21T20:00:30","slug":"building-and-deploying-the-digital-humanities-quarterly-recommender-system","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18501","title":{"rendered":"Building and Deploying the <em>Digital Humanities Quarterly<\/em> Recommender System"},"content":{"rendered":"<p>By Haining Wang, Joel Lee, John A. Walsh, Julia Flanders, and Benjamin Charles Germain Lee<\/p>\n<h2>Background<\/h2>\n<p>The <em>Digital Humanities Quarterly<\/em> journal (DHQ) was founded in 2005 by the Alliance of Digital Humanities Organizations (ADHO) and the Association for Computers and the Humanities (ACH) as an open-access journal of digital humanities, responding to a need for a journal that could represent the field of digital humanities to the wider academic public that was beginning to discover the field and take an interest in it. The first issue was published in 2007, and since its inception, the journal has published over 750 articles as of 2025. The journal is broadly scoped to cover the entire field of digital humanities, including relevant topics from closely adjacent fields. However, rather than reinforcing the boundaries between subdomains and the increasing specialization of the field, the journal has sought to make the many different areas of digital humanities discoverable and intelligible to one another and to a broad audience of engaged non-specialists, including both experts and novices. As an open-access journal that is often the first point of reference for readers, DHQ thus serves as a gateway through which multiple domains of digital humanities research activity can be discovered and brought into conversation. However, existing methods on the DHQ website, such as searching by keyword, title, author, or journal issue, did not succeed in fully opening the DHQ corpus of published articles to readers. Being able to identify relationships between articles plays an important role for readers in this cross-domain exploration, as it enables readers who find an article on a topic of interest to find their way to other articles on the same topic, without needing expert knowledge of the field and its many specializations, which are often difficult to clearly delineate. There are also different kinds of \u201crelatedness\u201d that may be useful: related methodology, related humanities research area, co-citation, and potentially others as well.<\/p>\n<p>In 2022, DHQ formed a new data analytics working group which began tackling the question of how to provide article recommendations as part of a larger overhaul and expansion of DHQ\u2019s technical infrastructure. This article describes the different approaches the group developed, and how they have been implemented within the journal\u2019s publication platform, in conjunction with DHQ\u2019s broader editorial team.<\/p>\n<h2>Methods<\/h2>\n<h3>Recommendation Methods<\/h3>\n<p>Recommender systems are ubiquitous online, facilitating discoverability of content across platforms ranging from digital libraries to e-commerce to music platforms. In the context of academic papers, sites such as Semantic Scholar [<a id=\"ref3\" href=\"#note3\">3<\/a>] and Google Scholar [<a id=\"ref4\" href=\"#note4\">4<\/a>] offer \u201cRelated Papers\u201d and  \u201cRecommended Articles\u201d features to help researchers discover new papers of interest. We drew inspiration from this functionality and implemented a Python-based article recommendation system for DHQ\u2019s diverse content and audience to enhance engagement and discovery. Rather than selecting just one method of generating recommendations, we opted for three distinct methods, each with its own tradeoffs. This section outlines our three approaches used to generate recommendations, and the principles followed when developing the recommendation system. These three methods are complementary, in that the articles recommended by each approach will often be different, encouraging distinct paths for the reader to explore.<\/p>\n<p><strong>Keyword-based approach (#1):<\/strong> DHQ has a controlled vocabulary, DHQ topic keywords[<a id=\"ref5\" href=\"#note5\">5<\/a>], for the description of the area and subject of their papers. Each submission is assigned around three to six of these keywords by our editors. The DHQ topic keyword list is designed around concepts rather than specific terms, recognizing that there are often multiple terms for the same broad concept area. DHQ uses topic identifiers encompassing multiple related terms to identify broad concept areas for each article (e.g. \u201ccode_studies\u201d, \u201ccultural_heritage\u201d) rather than trying to determine which more precise term might best apply in a specific case. For instance, the \u201ccode_studies\u201d identifier encompasses closely related but not synonymous terms like \u201ccode studies\u201d, \u201ccritical code studies\u201d, and \u201csoftware studies\u201d for better description of the work. DHQ has 88 topic keywords as of 2024. To better reflect the evolving landscape of scholarship, the taxonomy expands as DHQ editors regularly review the keywords submitted by authors and assess whether new terms should be integrated into the DHQ-wide keywording system.<\/p>\n<p>We used the heuristic that similar papers should share more keywords and, hence, retrieve for each paper the other DHQ articles that share the highest number of keywords. In cases where multiple articles have the same number of shared keywords, a random selection is made from those articles each time the recommendations are regenerated, and this random selection is shown in the interface (in the future, the full set of recommendations could be displayed). Given consistent annotation across articles, the resulting recommendations from this approach offer our readers a disciplinary view of relevant articles.<\/p>\n<p><strong>Term Frequency-Inverse Document Frequency (TF-IDF) Approach (#2):<\/strong> TF-IDF measures the importance of specific words in a document, adjusted for the overall frequency distribution of words. Using this method, we discover similarities between articles by focusing on words that are rare across the entire corpus but appear frequently within related articles. Specifically, we use the Best Matching 25 (BM25) algorithm, which evaluates the full text of articles\u2014including titles, abstracts, and body texts\u2014to compute similarity scores that reflect the frequency and importance of words across documents (Robertson and Zaragoza, 2009)[<a id=\"ref6\" href=\"note6\">6<\/a>]. Compared to a vanilla TF-IDF, BM25 adjusts for term frequency and document length, allowing it to handle the natural variation in document sizes and the informativeness of terms within DHQ. BM25 serves as a strong baseline in information retrieval across diverse domains. It achieved second place in the 2021 COLIEE legal retrieval competition (Rosa et al., 2021)[<a id=\"ref7\" href=\"#note7\">7<\/a>] and performed robustly in climate-related IR tasks (Diggelmann et al., 2020)[<a id=\"ref8\" href=\"#note8\">8<\/a>]. In addition, research demonstrates that combining BM25 with vector search (as we do with SPECTER2 below) significantly outperforms either approach alone (Formal et al., 2021)[<a id=\"ref9\" href=\"#note9\">9<\/a>]. This makes it an ideal complement to our deep learning method.<\/p>\n<p><strong>Deep learning approach (#3):<\/strong> In addition to relying on taxonomy and surface-level similarity, we implemented a deep learning-based approach to capture semantic relationships between manuscripts. This approach uses the <a href=\"https:\/\/arxiv.org\/abs\/2004.07180\">SPECTER2<\/a> model, created by the Allen Institute for Artificial Intelligence (Singh et al., 2022)[<a id=\"ref10\" href=\"#note10\">10<\/a>]. SPECTER2 is a transformer-based language model specifically designed for scientific papers. Trained on a vast corpus of scientific literature, SPECTER2 learns relationships between papers by considering both the textual content and citation patterns, helping it to \u201cunderstand\u201d how articles relate to one another. The model encodes each paper\u2019s title and abstract into a dense, fixed-length vector\u2014essentially a numerical summary of the paper&#8217;s content and its conceptual \u201cposition\u201d in the field. This summarizing vector allows us to compare papers by calculating the cosine similarity between their vectors, identifying those with the highest similarity as the most related. By focusing on these vectors, we can recommend articles that capture deeper thematic connections beyond just shared keywords or common phrases, providing readers with robust recommendations that reflect scholarly proximity. We generate SPECTER2 embeddings on titles and abstracts because SPECTER2 itself is optimized for titles and abstracts, rather than full papers. The pros and cons of the described approaches are listed in Table 1.<\/p>\n<div class = \"caption\">\n<strong>Table 1.<\/strong> The pros and cons of each approach used by the DHQ recommendation system. <\/p>\n<table>\n<tr>\n<th><\/th>\n<th>DHQ Keyword-Based<\/th>\n<th>TF-IDF-Based<\/th>\n<th>Deep Learning-Based<\/th>\n<\/tr>\n<tr>\n<th>Rationale<br \/><span style=\"font-weight: normal;\">(&#8220;Papers are similar because they&#8221;)<\/th>\n<td>share more DHQ keywords.<\/span><\/td>\n<td>share more representative words<\/td>\n<td>share a similar position in a large body of scholarly works.<\/td>\n<\/tr>\n<tr>\n<th>Materials<\/th>\n<td>DHQ Keyword<\/td>\n<td>Full text (only with reference stripped)<\/td>\n<td>Title and abstract<\/td>\n<\/tr>\n<tr>\n<th>Pros<\/th>\n<td>High Explainability<\/td>\n<td>Comprehensive coverage &amp; high explainability<\/td>\n<td>Fine-grained semantic similarity<\/td>\n<\/tr>\n<tr>\n<th>Cons<\/th>\n<td>Coarse granularity<\/td>\n<td>Focuses on surface-level similarity<\/td>\n<td>Not explainable<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p>When implementing the recommendation system, we kept three principles in mind:<\/p>\n<ul>\n<li><strong>Transparency:<\/strong> DHQ\u2019s practice of encoding articles in TEI format allows for easy extraction of relevant texts. DHQ uses a light customization of the TEI Guidelines, and our encoding practices and schema are publicly available and documented, together with all TEI-encoded articles, in DHQ\u2019s GitHub repository, and we have committed the code and recommendations to the public domain under the CC0 1.0 Universal Public Domain Dedication, waiving copyright to ensure unrestricted use by the community. The system is fully open-sourced, with detailed documentation available to support reproduction, reflecting DHQ\u2019s community-driven ethos. Additionally, the recommendations are refreshed bi-weekly to ensure they remain current. The code and related resources are available at <a href=\"https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers\">https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers<\/a>.<\/li>\n<li><strong>Robustness:<\/strong> We wrote (and are still writing) more tests to make sure core functionality will work as intended. We also implemented a continuous integration pipeline using GitHub Actions to pull changes from the DHQ\u2019s main GitHub repository to update the recommendations on a bi-weekly basis.<\/li>\n<li><strong>Pedagogy:<\/strong> As mentioned, when implementing the system, we used no internal information, which allows anyone\u2014especially humanists starting their first Python project, those interested in understanding how different generations of recommendation systems work, or those curious about how a deep learning model \u201ccompresses\u201d a few hundred words into machine-understandable vectors\u2014to easily plug and play with the system using only a few lines of code [<a id=\"noteref\" href=\"#note\">Note<\/a>].  Additionally, the code is well-annotated for pedagogical use and is compatible with the <a href=\"https:\/\/peps.python.org\/pep-0008\/\">PEP8<\/a> standard and the <a href=\"https:\/\/google.github.io\/styleguide\/pyguide.html\">Google Python Style Guide<\/a>.<\/li>\n<\/ul>\n<h3>Integrating the Recommendations into the <em>Digital Humanities Quarterly<\/em> Website<\/h3>\n<p>The output of each recommendation system is a TSV file which lists all article IDs with metadata and ten different article IDs which represent the top ten related articles, in descending order (a sample TSV for the deep learning-based recommendations can be found here: <a href=\"https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers\/blob\/main\/dhq-recs-zfill-spctr.tsv\">https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers\/blob\/main\/dhq-recs-zfill-spctr.tsv<\/a>). These TSV files are moved back into the DHQ main journal repository so that they can be refreshed on a regular basis.<\/p>\n<p>To display the recommendations, we chose to add a box at the bottom of articles that shows the top five recommended articles from each of the three recommendation systems. We use those same TSV files to retrieve information about each recommended article and display it inside the box. In addition, we include some explanatory text about the recommendations, and link to an explore page where readers can learn more about the methodology.<\/p>\n<p class = \"caption\">\n<a href=\"\/media\/issue61\/wang\/wang_figure_1_lg.png\"><img decoding=\"async\" src=\"\/media\/issue61\/wang\/wang_figure_1.png\"><\/a><br \/>\n<strong>Figure 1.<\/strong> Top of an article with the \u201cSee Recommendations\u201d Button on the toolbar\n<\/p>\n<p class = \"caption\">\n<a href=\"\/media\/issue61\/wang\/wang_figure_2_lg.png\"><img decoding=\"async\" src=\"\/media\/issue61\/wang\/wang_figure_2.png\"><\/a><br \/>\n<strong>Figure 2.<\/strong> Recommendation box at the bottom of the article\n<\/p>\n<p>The above figures show how this is implemented within a DHQ article: A \u201cSee Recommendations\u201d button was placed at the top of the toolbar (Figure 1) that brings the user down to the bottom of the page with the recommendations from each of the three recommendation systems, as well as the explanatory text (Figure 2). Note that the recommendations are mostly varied from one system to another, touching on aspects of the article such as technical methodology in OCR and archival medium in historical newspapers. Recommendations in this example also range from the entirety of DHQ\u2019s publication history. Note further that Figure 2 shows the descriptive text surrounding the recommendation methods for the end-user, which also links to a more in-depth description page for the end-user to learn more if desired.<\/p>\n<p>Lastly, the development of the recommendations coincided with DHQ\u2019s broader transition to utilizing static site generation via Apache Ant. Thus, the creation of the recommendation box and the population of recommendations for each article were integrated into the general XSLT code that generates each article page statically. This ensured that the integration of the recommendation box was a relatively seamless addition consistent with the general workflow of DHQ\u2019s current and future update process. Now, when new articles are accepted and prepared for publication, the recommendation system can be triggered manually or rerun on a biweekly basis via Github Actions. This will generate new TSV recommendation files which include forthcoming articles, and potentially update and refresh the recommendations for existing articles. Then, the static site can be regenerated to display the recommendation box for the new articles.<\/p>\n<h2>Discussion &amp; Future Work<\/h2>\n<p>In this article, we have detailed our efforts at <em>Digital Humanities Quarterly<\/em> to develop and deploy a paper recommender system for the over 750 papers in our corpus. We have detailed our implementation of three different and complementary recommendation methods: a keyword-based approach, a TF-IDF-based approach, and a deep learning-based approach, each of which has distinct advantages. We have also detailed the new interface for browsing the paper recommendations and contextualized it within DHQ\u2019s new static site generation.<\/p>\n<p>We have introduced this case study as a road map for other journals and libraries interested in implementing their own recommender systems. Given that <em>Digital Humanities Quarterly<\/em> operates as an open source, non-profit journal, we recognize the importance of designing systems that can be built, deployed, and maintained within shoestring budgets and limited staff time. The preprocessing scripts for the recommender system can easily be run on a laptop or the equivalent computing power. We hope that our code may be of use to other practitioners. Along these lines, if readers have questions regarding the implementation of an analogous system for another corpus, we would be more than happy to have further discussions.<\/p>\n<p>We recognize the importance of evaluating the quality of the recommendations we are currently serving on the journal\u2019s website. Accordingly, we plan to issue a survey to <em>Digital Humanities Quarterly<\/em> readers soliciting feedback surrounding these recommendations, including both Likert scale evaluations and qualitative responses. We will then incorporate the feedback and improve our recommender system. Also, we aim to continue experimenting with different approaches to generating recommendations, including adopting state-of-the-art embeddings models as they are released. Lastly, we welcome feedback from <em>Digital Humanities Quarterly<\/em> and <em>Code4Lib<\/em> readers.<\/p>\n<h2>Acknowledgments<\/h2>\n<p>We would like to thank Ash Clark in the Digital Scholarship Group at Northeastern University for eir help in reviewing and editing the XSLT code for the recommendation display on article pages. We would also like to thank former <em>Digital Humanities Quarterly<\/em> Collaborative Development Editor Hoyeol Kim for their initial work on keyword extraction and paper recommendations, as well as former <em>Digital Humanities Quarterly<\/em> General Editor Emily Edwards for their input and feedback throughout the development of the recommendations. Lastly, we would like to warmly acknowledge the work done by the <em>Digital Humanities Quarterly<\/em> Managing Editors to develop the journal\u2019s keywording system, with particular thanks to Dave DeCamp, RB Faure, and Benjamin Grey, and also to Lindsay Day who did the initial keyword research during an internship with the Northeastern University Digital Scholarship Group. <\/p>\n<h2 class=\"abouttheauthor\">About the Authors<\/h2>\n<p><em>Haining Wang<\/em> (<a href=\"mailto:hw56@iu.edu\">hw56@iu.edu<\/a>) is a postdoctoral researcher at Indiana University School of Medicine and serves as a Data Analytics Editor at Digital Humanities Quarterly.<\/p>\n<p><em>Joel Lee<\/em> (<a href=\"mailto:joe.lee@northeastern.edu\">joe.lee@northeastern.edu<\/a>) is a data engineer in the Digital Scholarship Group at Northeastern University Library and serves as a Collaborative Development Editor at Digital Humanities Quarterly.<\/p>\n<p><em>John A. Walsh<\/em> (<a href=\"mailto:jawalsh@iu.edu\">jawalsh@iu.edu<\/a>) is the Director of the HathiTrust Research Center, an Associate Professor of Information and Library Science in the Luddy School of Informatics, Computing, and Engineering at Indiana University and serves as a General Editor of Digital Humanities Quarterly.<\/p>\n<p><em>Julia Flanders<\/em> (<a href=\"mailto:j.flanders@northeastern.edu\">j.flanders@northeastern.edu<\/a>) is a Professor of the Practice and the Director of the Digital Scholarship Group at Northeastern University, and serves as Editor in Chief of Digital Humanities Quarterly.<\/p>\n<p><em>Benjamin Charles Germain Lee<\/em> (<a href=\"mailto:bcgl@uw.edu\">bcgl@uw.edu<\/a>) is an Assistant Professor in the Information School at the University of Washington and serves as a General Editor at Digital Humanities Quarterly.<\/p>\n<h2>Notes<\/h2>\n<p>[<a id=\"note\" href=\"#noteref\">Note<\/a>] For anyone who wants to make their own IR project using our approach, they can reproduce our method on the DHQ corpus (i.e., all DHQ articles publicly shared on GitHub). For those who have Python 3.10 installed (which most computers do; check by opening Terminal and typing &#8220;python -V&#8221;), they can git clone our repository (for necessary utilities), create a Python virtual environment and install necessary dependencies (for supporting libraries so we don&#8217;t need to build the wheels ourselves), and then run scripts by calling &#8220;python run_bm25_recs.py&#8221;. We have detailed instructions at: <a href=\"https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers?tab=readme-ov-file#reproduction\">https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers?tab=readme-ov-file#reproduction<\/a>.<\/p>\n<h2>References<\/h2>\n<p>[<a id=\"note1\" href=\"#ref1\">1<\/a>] The <em>Digital Humanities Quarterly<\/em> website: <a href=\"https:\/\/dhq.digitalhumanities.org\">https:\/\/dhq.digitalhumanities.org<\/a><\/p>\n<p>[<a id=\"note2\" href=\"#ref2\">2<\/a>] The <em>Digital Humanities Quarterly<\/em> similar papers codebase: <a href=\"https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers\">https:\/\/github.com\/Digital-Humanities-Quarterly\/DHQ-similar-papers<\/a><\/p>\n<p>[<a id=\"note3\" href=\"#ref3\">3<\/a>] Semantic Scholar website: <a href=\"https:\/\/www.semanticscholar.org\/\">https:\/\/www.semanticscholar.org\/<\/a><\/p>\n<p>[<a id=\"note4\" href=\"#ref4\">4<\/a>] Google Scholar website: <a href=\"https:\/\/scholar.google.com\/\">https:\/\/scholar.google.com\/<\/a><\/p>\n<p>[<a id=\"note5\" href=\"#ref5\">5<\/a>] DHQ Topic Keywords: <a href=\"https:\/\/github.com\/Digital-Humanities-Quarterly\/dhq-journal\/wiki\/DHQ-Topic-Keywords\">https:\/\/github.com\/Digital-Humanities-Quarterly\/dhq-journal\/wiki\/DHQ-Topic-Keywords<\/a><\/p>\n<p>[<a id=\"note6\" href=\"#ref6\">6<\/a>] Stephen Robertson and Hugo Zaragoza. 2009. \u201cThe Probabilistic Relevance Framework: BM25 and Beyond.\u201d Found. Trends Inf. Retr. 3, 4 (April 2009), 333\u2013389. <a href=\"https:\/\/doi.org\/10.1561\/1500000019\">https:\/\/doi.org\/10.1561\/1500000019<\/a><\/p>\n<p>[<a id=\"note7\" href=\"#ref7\">7<\/a>] Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto Lotufo, and Rodrigo Nogueira. 2021. \u201cYes, BM25 is a Strong Baseline for Legal Case Retrieval,\u201d ArXiv.  <a href=\"https:\/\/arxiv.org\/abs\/2105.05686\">https:\/\/arxiv.org\/abs\/2105.05686<\/a><\/p>\n<p>[<a id=\"note8\" href=\"#ref8\">8<\/a>] Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. 2020. \u201cCLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims,\u201d ArXiv. <a href=\"https:\/\/arxiv.org\/abs\/2012.00614\">https:\/\/arxiv.org\/abs\/2012.00614<\/a><\/p>\n<p>[<a id=\"note9\" href=\"#ref9\">9<\/a>] Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and St\u00e9phane Clinchant. 2021. \u201cSPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval,\u201d ArXiv. <a href=\"https:\/\/arxiv.org\/abs\/2109.10086\">https:\/\/arxiv.org\/abs\/2109.10086<\/a><\/p>\n<p>[<a id=\"note10\" href=\"#ref10\">10<\/a>] Singh, Amanpreet, Mike D&#8217;Arcy, Arman Cohan, Doug Downey and Sergey Feldman. 2022. \u201cSciRepEval: A Multi-Format Benchmark for Scientific Document Representations.\u201d Conference on Empirical Methods in Natural Language Processing. <a href=\"https:\/\/aclanthology.org\/2023.emnlp-main.338\/\">https:\/\/aclanthology.org\/2023.emnlp-main.338\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since 2007, <em>Digital Humanities Quarterly<\/em> has published over 750 scholarly articles, constituting a significant repository of scholarship within the digital humanities. As the journal\u2019s corpus of articles continues to grow, it is no longer possible for readers to manually navigate the title and abstract of every article in order to stay apprised of relevant work or conduct literature reviews. To address this, we have implemented a recommender system for the <em>Digital Humanities Quarterly<\/em> corpus, generating recommendations of related articles that appear below each article on the journal\u2019s website with the goal of improving discoverability. These recommendations are generated via three different methods: a keyword-based approach based on a controlled vocabulary of topics assigned to articles by editors; a TF-IDF approach applied to full article text; and a deep learning approach using the Allen Institute for Artificial Intelligence\u2019s SPECTER2 model applied to article titles and abstracts. In this article, we detail our process of creating this recommender system, from the article pre-processing pipeline to the front-end implementation of the recommendations on the <em>Digital Humanities Quarterly<\/em> website [<a id=\"ref1\" href=\"#note1\">1<\/a>]. All of the code for our recommender system is publicly available in the <em>Digital Humanities Quarterly<\/em> GitHub repository [<a id=\"ref2\" href=\"#note2\">2<\/a>].<\/p>\n","protected":false},"author":202,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18501","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18501","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/202"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18501"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18501\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18501"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18501"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18501"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18541,"date":"2025-10-21T15:58:53","date_gmt":"2025-10-21T19:58:53","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18541"},"modified":"2025-10-21T16:00:40","modified_gmt":"2025-10-21T20:00:40","slug":"what-it-means-to-be-a-repository-real-trustworthy-or-mature","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18541","title":{"rendered":"What it Means to be a Repository: Real, Trustworthy, or Mature?"},"content":{"rendered":"<p>by Seth Shaw<\/p>\n<h2>Current notions of &#8220;Digital Repositories&#8221;<\/h2>\n<p>Once I heard an archivist describe their institution\u2019s existing digital infrastructure, a respectable setup with network-based storage and well-organized metadata files, as &#8220;not a real repository.&#8221; Why not? What does it mean to be a &#8220;real repository?&#8221; Are there set qualifications a digital repository must meet before it can be &#8220;real&#8221;, much like Pinocchio&#8217;s quest (in the Disney retelling) of proving himself brave, truthful, and unselfish, to become a &#8220;real boy&#8221;? How many archivists wish their repository infrastructure was a real repository much as Geppetto wished his wooden puppet was a real boy? &#8220;A very lovely thought&#8230; but not at all practical,&#8221; Geppetto lamented. Are archivists waiting for an information technology department or third-party provider, a Blue Fairy, to turn their repository into a real one? They do not need a Blue Fairy, their repository is already &#8220;real,&#8221; if immature. This article discusses what it means to be a &#8220;real&#8221; digital repository and advocates shifting the discussion from realness to trustworthiness and maturity.<\/p>\n<h2>Digital Archival Repositories as Organization, Practices, and Infrastructure<\/h2>\n<p>In the most basic sense a repository is &#8220;a place where things can be stored and maintained; a storehouse&#8221; <a href=\"#ref1\" id=\"note1\">[1]<\/a> and, thus, a digital repository is a place to store and maintain digital &#8220;things.&#8221; If being &#8220;real&#8221; only requires fulfilling this definition then, in its simplest form, a digital repository could be any digital storage media, such as a USB flash-drive, which an archivist uses to maintain digital content. Yet this will not satisfy most, if any, archivists\u2019 definition of a digital repository. What other qualifications must be met?<\/p>\n<p>John Mark Ockerbloom notes while discussing the fundamental nature of a repository that, at the level of individual consumer devices, &#8220;new kinds of repositories that have little or no connection to traditional libraries&#8230; [are] on their users&#8217; own computers&#8230; [and] on Internet sites.&#8221; He continues by qualifying &#8220;personal digital repositories.&#8221; They cannot simply be a &#8220;glorified filesystem or website;&#8221; they must support &#8220;managing metadata, supporting discovery, provid[e] for access control [&#8230; ]and [support] long-term access and use of the content.&#8221; Although these attributes appear reasonable and might fit the mental model some archivists have of a digital repository, I imagine that many, if not most, archivists would not consider Ockerbloom&#8217;s examples of personal digital repositories, such as iTunes or GoogleDocs, appropriate applications to use for their digital repository. <a href=\"#ref2\" id=\"note2\">[2]<\/a> <\/p>\n<p>When archivists discuss repositories, generally speaking, they are referring to either &#8220;the building (or portion thereof) housing archival collections&#8221; (the repository is an infrastructure) and\/or organizations that collect archival materials and administer them according to archival principles and practices (the repository is a function). <a href=\"#ref3\" id=\"note3\">[3]<\/a> By extension, when one discusses digital repositories, they could be referring either to the software and hardware storing digital archival content (infrastructure) and\/or the organization collecting it (function). Digital repository infrastructure (storing digital material) should therefore incorporate as much of the archival repository function as possible (administering content according to archival principles and practices) something which none of our previous examples, necessarily, do. As Adrian Cunningham warned the profession: &#8220;digital archives are at risk of being managed just like vanilla digital libraries, thus dumbing down the peculiar challenges and complexities of preserving records.&#8221; <a href=\"#ref4\" id=\"note4\">[4]<\/a>  When discussing &#8216;digital archives&#8217; the emphasis should be on &#8216;archives&#8217; rather than on &#8216;digital.\u2019<\/p>\n<p>Joanne Kaczmarek, et. al., in their article on evaluating repository software applications, notes that:<\/p>\n<blockquote><p>These implementations have at their core software applications that are often called &#8220;repositories.&#8221; Yet, the traditional notion of a repository is one that conjures up images of well known institutions such as the British Library, the National Archives of the Netherlands, the Library of Congress or the Getty Museum. This notion of a repository implies a larger organizational commitment to the trustworthy oversight of valued and valuable resources that goes far beyond specific software implementations and underlying hardware and software architectures[&#8230;] <a href=\"#ref5\" id=\"note5\">[5]<\/a><\/p>\n<p>We have spent much time reiterating for ourselves places of overlap and distinction between these two generalized notions of a repository. As a clarification aid, we have coined the terms &#8216;Big R&#8217; and &#8216;little r,&#8217; and applied them in project meetings to distinguish between a repository at the institutional level (&#8216;Big R&#8217;) and an actual repository software application (&#8216;little r&#8217;).<\/p><\/blockquote>\n<p>Here then is the essential point: a &#8220;real&#8221; digital repository is the archival institution (&#8216;Big R&#8217;) managing digital materials within their digital repository infrastructure (&#8216;little r&#8217;), regardless of implementation, according to archival principles and practices. An archival institution does not require complex software for their digital repository infrastructure. Archivists need not feel that their digital archival repository is not &#8220;real&#8221; because it is simple.<\/p>\n<h2>OAIS, as a Reference Model<\/h2>\n<p>The &#8220;Reference Model for an Open Archival Information System (OAIS)&#8221; <a href=\"#ref6\" id=\"note6\">[6]<\/a>  is well known to the archival community and is commonly used to describe what it means to be a digital repository. <a href=\"#ref7\" id=\"note7\">[7]<\/a>  The OAIS reference model is often used in the literature as the standard for describing digital repositories even if only as a brief reference. <a href=\"#ref8\" id=\"note8\">[8]<\/a>  Many articles that describe implementing an OAIS generally focus on the repository software architecture rather than the repository as an institution, and even these may consider only a subset of the standard such as the ingest function or the information packages. <a href=\"#ref9\" id=\"note9\">[9]<\/a>  One of the few exceptions is Muma, Dingwall, and Bigelow&#8217;s article which describes their use of Archivematica and ICA-AtoM in terms of the reference model but also describes aspects of institutional policy such as migration strategy implications, an aspect of the preservation planning function. <a href=\"#ref10\" id=\"note10\">[10]<\/a>  The general emphasis in the literature on software systems obscures the fact that complying with the standard requires having appropriate policies and procedures in place to govern repository functions, not necessarily having complex software to automate them. This can lead to the impression that an OAIS-conformant repository requires complex software environments, seemingly out of reach for smaller archival institutions, which is not the case.<\/p>\n<p>OAIS specifies six essential responsibilities <a href=\"#ref11\" id=\"note11\">[11]<\/a>  (paraphrased):<\/p>\n<ol>\n<li>Negotiating and accepting content from producers <a href=\"#ref12\" id=\"note12\">[12]<\/a> <\/li>\n<li>Gaining the necessary control for long term preservation (legal right to preserve, permission to migrate formats, and collaborative agreements with other archives for preservation expertise)<\/li>\n<li>Identifying primary user groups (designated communities) to whom services are tailored<\/li>\n<li>Ensuring these designated communities have what they need to understand the content (i.e. documented access copies)<\/li>\n<li>Ensuring policies and procedures are documented and followed and materials are, accordingly, preserved<\/li>\n<li>Ensuring users have access to preservation metadata so they can trace derivative files to the originals.<\/li>\n<\/ol>\n<p>These responsibilities are familiar to archivists who have a long tradition of meeting them for their analog collections. OAIS defines models for information packages (content and their metadata) and the functions necessary to fulfill these responsibilities. Notably, as a reference model, OAIS &#8220;does not define or require any particular method of implementation of these concepts.&#8221; <a href=\"#ref13\" id=\"note13\">[13]<\/a> <\/p>\n<p>The following sections describe the OAIS information packages  and functions to illustrate that their &#8220;requirements&#8221; for compliance, although they may appear very technical, can be implemented using simple methods, and can often employ existing archival practices, workflows, and tools.<\/p>\n<h2>Packaging Information<\/h2>\n<p>Archivists are familiar with collecting and maintaining metadata to administrate and provide access to archival materials. Archival metadata and documentation have taken many forms over the years including inventories, collection control files, shelf lists, and finding aids. In the past few decades metadata and documentation have migrated to digital forms including spreadsheets, custom databases, encoded files, and archival data management systems such as AtoM and ArchivesSpace. <a href=\"#ref14\" id=\"note14\">[14]<\/a>  <\/p>\n<p>OAIS sets standards for packaging information that can be implemented without substantial additional effort. For example, a Package Description is descriptive metadata that &#8220;enables the Consumer to locate information of potential interest, analyze that information, and order desired information,&#8221; for example, a finding aid or catalog record. An archives can store package descriptions in their existing archival management systems whether they be spreadsheets, ArchivesSpace, or something else.<\/p>\n<p>Similarly, a Content Object is the combination of the Data Object <a href=\"#ref15\" id=\"note15\">[15]<\/a>  and Representation Information. Representation Information is &#8220;the information that maps a Data Object into more meaningful concepts,&#8221; i.e. what someone needs to know to render the object and make it comprehensible. A basic method to record Representation Information is to note a file&#8217;s format. There are a number of relatively easy-to-use tools available today for this purpose such as Droid and FITS. <a href=\"#ref16\" id=\"note16\">[16]<\/a>  Therefore, to fulfill these requirements, all an archives needs to do is acquire the data, run a format identification tool such as Droid, and save the results in a consistent location. Ideally an archives would also acquire a copy of the format&#8217;s specification and\/or software capable of interpreting the file format. An archives would also document the data&#8217;s semantic structure for datasets as needed, such as the meanings of spreadsheet columns. <a href=\"#ref17\" id=\"note17\">[17]<\/a> <\/p>\n<p>OAIS also specifies five categories of Preservation Description Information (PDI): provenance, reference, fixity, context, and access rights information. PDI is administrative metadata, and all of the categories save one (fixity) are common to archival practice for analog materials. Archivists document the provenance of their materials as well as context and access rights in collection control files and finding aids. Archivists use reference information to identify collections, series, boxes, folders, and even individual items and link our metadata to our materials. Archivists can use these existing descriptive tools to fulfill the requirements of an OAIS digital repository. <\/p>\n<p>Fixity is the one preservation description information category that does not fit well into existing practices. As with representation information, there are a number of readily available tools such as RapidCRC <a href=\"#ref18\" id=\"note18\">[18]<\/a>  and DataAccessioner <a href=\"#ref19\" id=\"note19\">[19]<\/a>  that produce manifests which record the objects identifier (usually the current file path) and the checksum digest. At a minimal level for fixity, all an archives needs to do is run a checksum generation tool (or validation tool if fixity information was provided by the donor) and store the checksums in a consistent location. <\/p>\n<p>Packaging Information, &#8220;either actually or logically, binds or relates the components of the package into an identifiable entity.&#8221; <a href=\"#ref20\" id=\"note20\">[20]<\/a>  This can be done by specifying which parts of the information package are stored in which files in a directory structure using a consistent naming scheme. Archivematica&#8217;s AIP specification serves as their Packaging Information by describing how they store their Archival Information Package components (a directory and naming scheme and METS file within a Bag-It bag within a 7-Zip file). <a href=\"#ref21\" id=\"note21\">[21]<\/a> In most cases the various components will be stored in different systems. This is true not only of simple repositories that keep digital objects and manifests on external media with other preservation description information in a tool like ArchivesSpace but also the complex digital repositories. The important point is that an archives needs to document this packaging information, however it is implemented.<\/p>\n<h2>Digital Repository Functions<\/h2>\n<p>Section 4.2 of the OAIS reference model describes the six digital repository functional entities: ingest, archival storage, data management, access, preservation planning, and administration.    Each of these entities represent someone or something (e.g. software) which performs the functions associated with it. <\/p>\n<p>The first subsection describing these functional entities in detail (4.2.3.2) discusses common services, a section that is usually omitted in conversations of the standard because it underlies and supports everything else, thus it is not explicitly listed as a functional area. The common services section recognizes that a technological baseline of computing capability is required for working with digital records, such as operating systems, networking, and system security. An archives must identify the hardware and software infrastructure necessary to meet their own digital repository needs. As OAIS states in their section on security, &#8220;the appropriate level of protection [or any capability] is determined based upon the value of the information to the application end-users and the perception of threats to it.&#8221; <a href=\"#ref22\" id=\"note22\">[22]<\/a>   Acquiring technical infrastructure is as much about cost-benefit and risk analysis as it is functional requirements. Digital archiving can be done simply and relatively inexpensively or in a very sophisticated and costly manner. What are the risks, what are the benefits, and how much do the stakeholders care to spend on it? Infrastructure and security requirements for local history centers will differ from requirements for corporate and government archives holding restricted but commercially valuable records attractive to attackers.  <\/p>\n<p>The ingest functional entity <a href=\"#ref23\" id=\"note23\">[23]<\/a>   has five functions: receive submissions, quality assurance, generate archival information packages, generate descriptive information, and coordinate updates. Receiving submissions and quality assurance require copying the data and ensuring it is what the archives expected to receive\u2014either by paging through the files to verify content, much as one would do with boxes of paper records received, or validating checksums if they are provided. Generating the archival information package involves gathering the representation information and preservation description information and recording them according to the structure determined by the repository&#8217;s packaging information. This could be a basic checksum manifest, a list of identified formats, and a text file recording everything else in a few file directories. Generating descriptive information is similar to creating archival finding aids for analog collections: record a title, dates, and extents along with any other appropriate fields in whatever archival information management system the archives currently uses. Coordinating updates involves moving the archival information package into long-term storage and making the descriptive information available with all the other collection descriptions. In other words, the ingest is the same as traditional archival accessioning and baseline processing, only with digital materials.<\/p>\n<p>The archival storage functional entity <a href=\"#ref24\" id=\"note24\">[24]<\/a> is much like managing archival stacks, although in a digital context. The first function, receive data, ensures the archival information package is stored on appropriate media which is much like finding a shelf in the stacks for a collection&#8217;s boxes and recording their locations. The second, managing storage hierarchy, ensures that the storage used meets the organization&#8217;s storage policy requirements, left open to the organization&#8217;s discretion. Are the digital stacks large enough, close enough to users, and secure enough to protect the materials they house? The third function, replace media, is the least like physical stacks as archivists do not throw out and build new stacks when the existing ones wear out (very often, at least), but most digital media is inherently unstable and so will need to be replaced on a regular basis according to institutional policy, which is left to the institution to determine. The fourth function, error checking, monitors data\/media for errors. Similar to how an archives monitors heat and humidity levels, and examines records for signs of mold, mildew, red rot, pests, and other signs of deterioration, archivists must regularly check for digital degradation. This section of OAIS does include a rare case of recommending specific technologies, in this case for fixity checking (it suggests Cyclic Redundancy Checks and Solomon Reed encoding), although different ones are often used by most archives (previously MD5 and now more commonly SHA-256). What matters is that the task gets done, no matter which specific technology is used. <a href=\"#ref25\" id=\"note25\">[25]<\/a> The disaster recovery function requires an archives to plan for the unfortunate reality that disasters occur by placing at least one copy in a physically separate facility (however it gets there). Finally, the provide data function requires the archive to provide data to the requester if the requester is authorized to access it. An archives retrieves boxes from the stacks for researchers and needs to get data from digital storage when necessary as well. As per the standard, the details of how this is done are left for the archives to determine. <\/p>\n<p>The data management functional entity <a href=\"#ref26\" id=\"note26\">[26]<\/a> maintains a &#8220;data management database&#8221; which stores descriptive information (finding aids, catalog records, accession records, etc.) and system information (administrative metadata). Descriptive information for digital materials can be maintained in the same descriptive systems an archives already uses, be it a library catalog, a collection of EAD finding aids, a home-grown relational database, or an Archival Description Management System such as ArchivesSpace. <a href=\"#ref27\" id=\"note27\">[27]<\/a> Similarly, system information can be stored in a series of text files, eXtensible Markup Language (XML) files, a relational database, a triple-store, or any other sort of datastore the archives deems appropriate for maintaining administrative metadata. Each system will have their own affordances for management and querying. An archives administers the database (by ensuring it is working properly and meets the archives&#8217; needs), performs queries (either by manually looking through records or preset queries), generates reports as needed (such as summaries of holdings or usage statistics by, again, either manual or programmatic means), and receives updates (loading new or updating existing collection descriptions and administrative metadata). <\/p>\n<p>The administration and preservation planning functional entities <a href=\"#ref28\" id=\"note28\">[28]<\/a>  , naturally, deal with administrative and planning, not technical tasks. An archives administration needs to negotiate submissions of materials, set policies, ensure policies are being followed, manage requests, and so on. Preservation planning monitors technology, tracks user needs, develops strategies, and plans implementation.<\/p>\n<p>The access functional entity <a href=\"#ref29\" id=\"note29\">[29]<\/a> allows for the discovery and delivery of digital archival materials. Although the language used to describe coordinating access activities appears complex, it is actually a very abstract way of describing reference requests archivists get day-to-day: receiving a request for information, answering the request, receiving a request to access content, and then coordinating that access. Generating the DIP (dissemination information package) involves gathering the necessary data and metadata necessary to fulfill the reference request and any necessary conversions. For example, if the researcher only needs a meeting minutes series that are all PDFs, generating the DIP only requires gathering the PDFs and a copy of, or link, to the relevant finding aid. The deliver response function could be accomplished by sending an email (assuming the archives has the right to distribute copies) with the PDFs and a link to the finding aid. If the minutes are in an older document format the requester can not open then the generate DIP function can include making access versions. The challenges of providing effective access to digital materials, including rights issues and systems obsolescence, are significant, but OAIS &#8220;compliance&#8221; is not one of them.<\/p>\n<p>Being a functionally complete, \u201creal repository\u201d, does not necessarily mean that it is effective or scalable. It may be better than nothing but still not enough to satisfy an archivist\u2019s desires. What then can an archives do to ascertain the quality of their repository?<\/p>\n<h2>Trustworthy Digital Repositories<\/h2>\n<p>Rather than asking what makes a digital repository \u201creal,\u201d archivists may ask what makes a digital repository trustworthy? Being a real repository is, in the stricter sense, relatively simple, being a trustworthy repository of digital materials is much more difficult. Archives commit to being trustworthy custodians of archival materials, but born-digital materials require additional reassurances due to their high degree of mutability and their relative lack of artifactual qualities suitable for traditional diplomatic analysis. Heather MacNeil states that \u201cfor the preservers of electronic records to function effectively as trusted custodians [&#8230;]it is not sufficient that they simply declare that the records in their custody are presumptively authentic; they also provide grounds for such declaration.\u201d <a href=\"#ref30\" id=\"note30\">[30]<\/a> Matthew Kirschenbaum, et. al., also state than \u201can institution that has a proven track record with regard to conserving, processing, and making available paper manuscripts\u2014in other words, is trusted to handle traditional archival materials\u2014is not necessarily a trustworthy custodian of digital objects.\u201d <a href=\"#ref31\" id=\"note31\">[31]<\/a>  What then makes a repository &#8220;a trustworthy custodian of digital objects&#8221;?<\/p>\n<p>In 2002 RLG and OCLC provided seven recommendations to help repositories ensure their trustworthiness.  They focus primarily on good management practices including institutional stability and careful documentation of policies and digital holdings. Like the OAIS reference model, there is no requirement for a certain degree of technological sophistication. The closest it gets to a technical requirement is the declaration that systems must include routine integrity checking and data redundancy, both of which can be managed using external media and fixity checking software. <a href=\"#ref32\" id=\"note32\">[32]<\/a><\/p>\n<p>The first recommendation in the RLG\/OCLC report is to  \u201cdevelop a framework and process to support the certification of digital repositories\u201d <a href=\"#ref33\" id=\"note33\">[33]<\/a>  which lead to the development of the \u201cTrusted Repositories Audit &#038; Certification: Criteria and Checklist\u201d (TRAC) <a href=\"#ref34\" id=\"note34\">[34]<\/a>  which served as the basis for ISO 16363:2012 \u201cAudit and certification of trustworthy digital repositories.\u201d <a href=\"#ref35\" id=\"note35\">[35]<\/a>  ISO 16363 includes three main areas of interest: organizational infrastructure, digital object management, and technical infrastructure and security. All three sections are policy-centric and focus on managing and documenting implementation rather than dictating complex technology infrastructure. Marks notes in &#8220;Becoming a Trusted Digital Repository&#8221; that &#8220;ISO 16363 is not a blueprint [&#8230;]but it does provide a set of useful requirements to keep in mind during the design process.&#8221; <a href=\"#ref36\" id=\"note36\">[36]<\/a> <\/p>\n<p>ISO 16363:2012 and its predecessors are not the only standards to establish requirements for digital repository trustworthiness. The Data Seal of Approval has sixteen core requirements, <a href=\"#ref37\" id=\"note37\">[37]<\/a> the Nestor Seal of Approval extended self-assessment (also related to TRAC) has thirty-four criteria, <a href=\"#ref38\" id=\"note38\">[38]<\/a>  and the Center for Research Libraries attempted to summarize the various requirements in 2007 into their \u201cTen Principles\u201d. <a href=\"#ref39\" id=\"note39\">[39]<\/a> Again, the fundamental requirements across all of these systems are that policies, procedures, and documentation are all in place rather than specific or complex systems. A repository\u2019s consideration for effective policy, procedure, and documentation is the driving factor in these standards for trustworthiness, not the sophistication of their hardware or software. <\/p>\n<p>The bar for certifiable trustworthiness these standards are set quite high. This can be rather disheartening to those developing repositories. Certifiable trustworthiness is a laudable goal to strive for but it takes a great deal of work. It can seem unattainable given the competing demands for an archivist\u2019s attention, but <em>certifiable<\/em> trustworthiness is not necessarily required.<\/p>\n<p>Heather MacNeil states in Trust and Professional Identity: Narratives, Counter-Narratives and Lingering Ambiguities \u201cthat the trustworthiness of records is socially negotiated.\u201d <a href=\"#ref40\" id=\"note40\">[40]<\/a>  Some parts of society will trust almost anyone while others will trust almost no one, regardless of the efforts to prove themselves. By extension, a repository\u2019s trustworthiness is also socially negotiated. Greg Bak in his article &#8220;Trusted by whom? TDRs, standards culture and the nature of trust&#8221; illustrates how the existing standards of trust rely on a technocratic view of trust rather than one based on building relationships. <a href=\"#ref41\" id=\"note41\">[41]<\/a>  Thus, it isn&#8217;t even clear how much influence these standards will have on the trust a repository will receive from their user community. Some research has been done on users&#8217; perception of trust in scientific data repositories, <a href=\"#ref42\" id=\"note42\">[42]<\/a> academic institutional repositories, <a href=\"#ref43\" id=\"note43\">[43]<\/a> and government archives <a href=\"#ref44\" id=\"note44\">[44]<\/a>  but not in other institutional archives nor manuscript repositories.<\/p>\n<p>At one archive a potential donor asked them to describe their capabilities for preserving electronic records. A relatively brief statement (a few paragraphs) was prepared in response to the request which outlined current policies, an admission of gaps, and plans for improvement. Despite an admission of imperfection this honest and open response engendered sufficient trust for the donor to deposit their records with the repository. Although this example is anecdotal, it illustrates that, perhaps, a good faith effort at forward progression can, for some communities, be sufficient to engender trust in the archives within their producers and consumers.<\/p>\n<p>An archives may decide not to formally pursue certified trustworthiness, but trustworthy digital repository standards, such as ISO 16363, can serve as guides for improvement. A repository can perform a self-audit to identify gaps. For example, The University Archives at University of Illinois at Urbana-Champaign performed a self-assessment based on ISO 16363 guided by Steve Marks\u2019 &#8220;Module 8: Becoming a Trustworthy Digital Repository&#8221; and concluded that<\/p>\n<blockquote><p>&#8220;conducting an informal self-assessment not only helped us to better understand our own preservation operations, it also provided significant insight into the frameworks, policies, and practices that UA and DP will need to put in place if we wish to be successful and trustworthy.&#8221; <a href=\"#ref45\" id=\"note45\">[45]<\/a> <\/p><\/blockquote>\n<p>Self-assessments are useful exercises, but they may still leave the archivist wondering which of their many gaps they should address first.<\/p>\n<h2>Where to From Here? Developing Repository Maturity<\/h2>\n<p>Digital repositories mature through multiple stages of development. All digital repositories begin with the inclination that something needs to be done. Capability grows gradually until a robust digital repository, in terms of both organizational and technical infrastructure, has matured into a full digital preservation program.  The three main repository maturity models used to describe the process of becoming a mature digital repository are Kenney and McGovern\u2019s Five Organizational Stages, <a href=\"#ref46\" id=\"note46\">[46]<\/a> the Digital Preservation Coalition\u2019s Rapid Assessment Model (DPC RAM), <a href=\"#ref47\" id=\"note47\">[47]<\/a> and the National Digital Stewardship Alliance&#8217;s (NDSA) Levels of Preservation. <a href=\"#ref48\" id=\"note48\">[48]<\/a> <\/p>\n<h3>Five Organizational Stages of Digital Preservation<\/h3>\n<p>Kenny and McGovern&#8217;s Five Organizational Stages focus squarely on the maturity of a repository&#8217;s organizational infrastructure. They note &#8220;a real key in assessing digital preservation is to understand the organizational impediments to digital preservation practice&#8221; and that &#8220;technology is not the solution, only part of it.&#8221; As the name indicates, their maturity model is divided into five stages of organizational development: Acknowledge, Act, Consolidate, Institutionalize, and Externalize. In addition to defining these stages McGovern and Kenny provide key indicators to help repositories recognize when they have reached a particular stage, including policy and planning, technological infrastructure (again, as principles rather than specific technologies), and the content and use being supported.<\/p>\n<p>A repository reaches the Acknowledge stage when they openly and honestly accept that digital preservation is not only a theoretical concern, but a practical concern that their repository must address. This stage reflects a repository&#8217;s acceptance of their responsibility for the digital materials within their stewardship. How many readers have received digital media with a records transfer or accession and then turned a blind eye? Or been asked about capturing a website or social media account and then simply decline capability and\/or quickly change the subject? In 2008, Susan Davis quoted a survey respondent who stated &#8220;We are passively accepting born-digital materials [&#8230;] All planning, policy, etc. take a back seat to day-to-day efforts to keep up with basic activities.&#8221; <a href=\"#ref49\" id=\"note49\">[49]<\/a> Although this repository accepted digital materials, they ignored them. Kenny and McGovern wrote &#8220;ownership by itself [&#8230;]is insufficient.&#8221; This repository had not yet acknowledged a responsibility for preservation. A repository can acknowledge their responsibility by having an honest conversation among staff and with donors about the implications of accepting born-digital materials, either in a theoretical overview or as a specific case study, and then commit resources (for example, staff time for drafting policy, if nothing else) to move forward.<\/p>\n<p>The Act stage, doing something to address digital preservation challenges, naturally flows from acknowledgement. McGovern and Kenny describe this stage as being project-oriented. An institution could start with an inventory project (where are all the media? what websites or social media accounts need to be captured?) and then move on to projects such as acquiring adequate storage, migrating media, and web-capture, as appropriate. This is the bootstrapping stage where the initial policies and infrastructure are being developed. At least two Presidents of the Society of American Archivists issued calls-to-action which embody this stage.    The first came from Richard Pearce-Moses in 2006 who stated &#8220;many [archivists] feel paralyzed because they don\u2019t know what to do. The key is: We cannot let the perfect be the enemy of the possible. Ask yourself, What can I do today? No matter how small, do something.&#8221;  <a href=\"#ref50\" id=\"note50\">[50]<\/a>  Helen Tibbo reiterated the challenge in 2011 to &#8220;take some steps and do something to preserve digital content important to your collection and your users.&#8221; <a href=\"#ref51\" id=\"note51\">[51]<\/a><\/p>\n<p>The Consolidate stage is where the dust of initial project-based development begins to settle and ad-hoc activities transition into programmatic policies and workflows built on now-established infrastructure. The institution has transitioned from &#8220;we should be doing this&#8221; and &#8220;we are doing something about it&#8221; to &#8220;this is a regular part of what we do.&#8221; It does not mean everything is perfect, ideal, or sophisticated. It may (and initially, probably will) be terribly insufficient compared to where an archives wants to be. That is fine, for now. The point is that digital preservation activities are now regular operations rather than exceptional projects. Improvement at this stage often comes from small progressive enhancements to existing operations. An archives may still engage in large projects, but there is the expectation that the project results will be integrated with current infrastructure and workflows, not a stand-alone proof-of-concept.<\/p>\n<p>In the Institutionalize stage digital preservation grows beyond the archives and incorporates the parent institution as well. The metaphorical airplane has lost cabin pressure (digital preservation is a problem for everyone now), the archives has put on their oxygen mask (digital preservation in the archive is programmatic), and now it can help others. For archives situated inside academic libraries, such as the McGovern and Kenney&#8217;s Cornell example, this involves engaging the larger library in digital repository efforts to support a broader range of content such as electronic theses and dissertations, scholarly publications, and research data. For corporate, government, and other archives this stage may be characterized as digital preservation outreach to the organization&#8217;s constituent parts to encourage good digital preservation practices.  <a href=\"#ref52\" id=\"note52\">[52]<\/a> McGovern and Kenny also include in this stage aligning local digital preservation policies with external standards and guidelines such as OAIS and TDR and gap-analysis; although gap analysis, at least informally, may be introduced as early as the Act stage as it can inform first steps. <\/p>\n<p>The Externalize stage focuses on building partnerships with other like-minded institutions and consortia. McGovern and Kenny cite a number of consortia as examples such as LOCKSS and the California Digital Library. This stage reflects a confidence in a repository&#8217;s organizational and technical capability and a trust that their partners are sufficiently capable to cooperate with them.  <a href=\"#ref53\" id=\"note53\">[53]<\/a> Although McGovern and Kenny place these collaborations in the last stage of maturity, an archives may choose to find a consortium early on in their development in order to leverage their members\u2019 experience and capabilities. In either case, the organization is making a commitment to abide by the standards and expectations of those consortia. An archivist must ensure the repository understands the implications of any such association before committing to it.<\/p>\n<p>McGovern and Kenney&#8217;s Levels describe an archive&#8217;s administrative capacity to accept their responsibility for born-digital content, take action, systematize, expand, and collaborate. The focus is on institutional readiness and maturity. An archivist should consider their own repository in light of these stages, and review the key indicators McGovern and Kenney provide in their chapter.<\/p>\n<h3>National Digital Stewardship Alliance Levels of Digital Preservation<\/h3>\n<p>In 2012 the National Digital Stewardship Alliance Levels team concluded that none of the existing repository assessment models &#8220;specifically addressed the need for practical technical guidance when a preservationist takes preliminary first steps or builds on steps already taken.&#8221; They then determined to produce a non-judgmental &#8220;matrix of activities that were detailed enough to be meaningful, while still being succinct enough to fit on a single page&#8221;. <a href=\"#ref54\" id=\"note54\">[54]<\/a> The goal was to emphasize steps a practitioner can take, today if possible, to improve their digital repository, &#8220;not the social or legal structure that would be in place to sustain those activities,&#8221; a departure from other policy and workflow-heavy models. <a href=\"#ref55\" id=\"note55\">[55]<\/a><\/p>\n<p>As with the other models, the NDSA Levels are &#8220;agnostic towards both content type and technology&#8221; while setting &#8220;a community-approved minimum level of prerequisites.&#8221;  <a href=\"#ref56\" id=\"note56\">[56]<\/a> The first level  <a href=\"#ref57\" id=\"note57\">[57]<\/a> can be implemented with simple infrastructure. More sophisticated technology environments may be necessary for the higher levels which focus on automation, but a lack of technical resources should not prevent an archive from doing something beneficial now.<\/p>\n<p>The Levels of Preservation matrix describes four numbered levels (or tiers)  <a href=\"#ref58\" id=\"note58\">[58]<\/a> that &#8220;progressively reduce various risks to digital materials&#8221;  <a href=\"#ref59\" id=\"note59\">[59]<\/a> across five functional areas (or categories): Storage and Geographic Location, File Fixity and Data Integrity, Information Security, Metadata and File Formats.<\/p>\n<p>The first functional area, Storage and Geographic Location, begins with migrating data off the original media into a consistent storage system. The levels progress by increasing copies and geographic diversity along with long-term planning and system monitoring. While no specific storage technology is recommended, nor a particular infrastructure required, a &#8220;nearline or online system using &#8230; some combination of spinning disk and magnetic tape&#8221; is suggested as an appropriate option. A repository can effectively make use of offline storage at these earlier stages but require some manual administration, whereas the nearline and online systems enable the automation described by the more advanced stages.<\/p>\n<p>The File Fixity and Data Integrity functional area addresses data fixity. Keeping data fixed (i.e. immutable) helps to ensure record authenticity. Fixity in current repositories is usually achieved with checksums. The first stage of this functional area requires verifying any donor-provided checksums if provided; otherwise an archives generates their own checksums. The levels progress by adding other related activities such as write blockers, virus checking, automation, logging, and ensuring no one person can modify all the copies (either intentionally or unintentionally). For example, an archives may begin by creating checksums using a tool like RapidCRC, then setting a regular schedule for validating those checksums, then progress to automated checking on a regular basis using a tool like AV Preserve&#8217;s Fixity, <a href=\"#ref60\" id=\"note60\">[60]<\/a> before adopting a more comprehensive repository software that includes regular fixity auditing as a feature.<\/p>\n<p>In line with the data integrity section, the Information Security functional area secures materials from human harm (again, either intentional or unintentional). Knowing who has access and preventing unnecessary access are the first step. Levels progress by increasing logging and automation.<\/p>\n<p>The Metadata functional area focuses on managing administrative, technical, and descriptive metadata. The first stage starts with inventories that are safely stored. The levels progress by including additional types of preservation metadata. &#8220;In most systems nearly all of this metadata &#8230; can and should be generated and processed computationally and not manually.&#8221; <a href=\"#ref61\" id=\"note61\">[61]<\/a> Not all the metadata need to be of the same format or in the same location and most of the day-to-day tools in use (beyond commercial or Fedora-based repository systems), do not use common schema or consistent locations. Simple repositories will likely begin using several stand-alone tools that generate or record metadata in their own fashion. For example, format migration tools for creating access copies may not produce any logging metadata, so an archivist will have to record them manually somewhere. Manual, non-standard, or stand-alone metadata files are less desirable because they can increase administrative overhead and make auditing more complex, but they are still acceptable. It is better to have metadata in a non-standard format, in a stand-alone file, or manually recorded than none at all. Ideally a system could manage these metadata holistically, but that is a sign of program maturity, not a requirement.<\/p>\n<p>The File Formats functional area addresses format obsolescence by addressing the degree of programmatic emphasis rather than format endangerment (risk of obsolescence). The first level encourages archivists to advocate when possible for donors to use open formats as this advocacy can reduce the degree of format complexity and diversity the archive will have to work with later. The levels progress as an archives turns their attention to inventorying file formats in their collections, monitoring format endangerment, <a href=\"#ref62\" id=\"note62\">[62]<\/a> and then attending to format migrations or emulation solutions in the last level.  <\/p>\n<p>NDSA plans to update the Levels document as the needs of the digital preservation community change. For example, Shira Peltzman proposed an additional Access Functional Area. <a href=\"#ref63\" id=\"note63\">[63]<\/a> This new functional area would begin by ensuring the materials are secured from harm during access (such as a write-protected media or workstation) and matures as the access mechanism automates and provides richer access environments. This proposal was taken up by the Digital Libraries Federation (DLF) Born-Digital Access Working Group resulting in their own Five-Levels report. <a href=\"#ref64\" id=\"note64\">[64]<\/a><\/p>\n<p>The NDSA report lists several uses for the Levels document including community consensus building and communicating with stakeholders. <a href=\"#ref65\" id=\"note65\">[65]<\/a> The fifth use, &#8220;assess compliance with preservation best practices and identify key areas to improve,&#8221; is in line with the other models this paper has discussed. The NDSA Levels&#8217; authors emphasize that this is not intended to give a single score; an archives is free to investigate a single functional area of interest or review all of them, as the archives sees fit, resulting in &#8220;broad areas to improve, identify areas of service excellence and pinpoint specific enhancements&#8230; and to track progress over time.&#8221; <a href=\"#ref66\" id=\"note66\">[66]<\/a> All of this from a single-page maturity model. An archivist could consider this a &#8220;quick start&#8221; digital repository maturity assessment.<\/p>\n<h3>Digital Preservation Coalition Rapid Assessment Model<\/h3>\n<p>The Digital Preservation Coalition\u2019s Rapid Assessment Model (DPC RAM) was designed \u201cto enable a rapid assessment of an organization\u2019s digital preservation capability whilst remaining agnostic to solutions and strategy.\u201d <a href=\"#ref67\" id=\"note67\">[67]<\/a> It was primarily based on the Digital Preservation Maturity Model found in Adrian Brown&#8217;s book \u201cPractical Digital Preservation: A How-to Guide for Organizations of Any Size\u201d and informed by other maturity models, including the NDSA model discussed in the previous section. <a href=\"#ref68\" id=\"note68\">[68]<\/a> It has received two updates since its first publication in 2019 based on community feedback and \u201cevolving digital preservation good practice.\u201d <a href=\"#ref69\" id=\"note69\">[69]<\/a> <\/p>\n<p>DPC RAM uses five levels of maturity which balance the semantic scaling of the McGovern and Kenney model and the purely numerical progressive nature of the NDSA model. Accordingly, it doesn\u2019t scale the same way as either. DPC RAM begins with a zero level, \u201cMinimal Awareness\u201d which essentially translates to \u201cI\u2019ve heard of it\u201d or \u201cI\u2019m just now hearing about this.\u201d At this level the DPC RAM may be the organization\u2019s realization that this capability needs to be addressed. This pre-level leads naturally into level one, \u201cAwareness\u201d which translates to an educated understanding of the basic principles and why it is necessary. This level matches McGovern and Kenney\u2019s first level, Acknowledge, as it represents an understanding that this capability is the organization\u2019s responsibility and the need to act. The following levels are less tightly coupled. DPC RAM focuses on increased capability from two to four, labeled \u201cBasic,\u201d \u201cManaged,\u201d and \u201cOptimized\u201d. The \u201cBasic\u201d and \u201cManaged\u201d levels could loosely be compared to McGovern and Kenney\u2019s \u201cAct\u201d and \u201cConsolidate\u201d stages where the actions taken establish those basic capabilities and the consolidation process creates the capabilities for managing them. Beyond that the \u201cOptimized\u201d level takes on aspects of both McGovern and Kenney\u2019s \u201cInstitutionalization\u201d and \u201cExternalization\u201d stages but emphasizes proactive management of the capability: focusing future capabilities and challenges and building in continuous improvement.<\/p>\n<p>While McGovern and Kenney\u2019s model addresses an organization\u2019s digital preservation program as a whole, the DPC RAM (like the other models that influenced it, such as the NDSA model) recognizes that an organization\u2019s maturity is multi-dimensional. Different aspects of the organization\u2019s capabilities will mature at different times and rates depending on the organization\u2019s current priorities and most pressing needs. DPC RAM even builds this on-going evaluation and continuous improvement into their model as one of the assessment areas. These capabilities are divided into two primary sections: Organizational capabilities and Service Capabilities. The Organizational capabilities focus on the ability to manage the digital preservation endeavor, for example policy, legal compliance, and information technology resourcing. These capabilities cover most of the areas in which McGovern and Kenney\u2019s model are concerned: policy, strategy, and community. Service capabilities focus on the processes used, such as acquisition, preservation actions, and metadata management. These capabilities are more closely aligned with the \u201chands-on\u201d digital preservation concerns highlighted by the NDSA model.<\/p>\n<p>As complex or as detailed as the model may sound, the model is laid out so that a relatively quick self-assessment can be performed in an afternoon, depending on how familiar one is with the capabilities and organization being assessed and still result in a rough idea of where a repository stands and where effort will be best spent moving forward. Moving forward is the critical part. One of the driving principles of the DPC RAM is continuous improvement. The instructions explicitly suggest that organizations take the assessment results and consider what \u201clevel they would like to achieve in the future\u201d to \u201cincrease [their] understanding of gaps and priorities for moving forward.\u201d <a href=\"#ref70\" id=\"note70\">[70]<\/a> This is, perhaps, the most important aspect: helping an archives reach informed discussions about their repository&#8217;s maturity, where they need to prioritize development, and making plans to do so. <\/p>\n<h2>Conclusion<\/h2>\n<p>While the labels, emphases, and details vary between all three maturity models, they are all useful in helping an archives consider where their digital repository stands and what can be done next to grow and develop. The Center for Research Libraries\u2019 \u201cTen Principles\u201d note that \u201cfor repositories of all types and sizes preservation activities must be scaled to the needs and means of their respective designated community or communities.\u201d <a href=\"#ref71\" id=\"note71\">[71]<\/a> No two archives are the same. Policies and procedures should be adapted to meet local needs. As Williams and Berilla stated when documenting their repository&#8217;s journey to implement digital archiving:<\/p>\n<blockquote><p>&#8220;practicality often trumps theory, and a middle ground of digital content management must be contemplated. Because of their own idiosyncrasies, institutions must cherry pick among best practices for what works for them. In essence, every institution must develop a unique plan.&#8221;<\/p><\/blockquote>\n<p> <a href=\"#ref72\" id=\"note72\">[72]<\/a><\/p>\n<p>When we consider &#8220;young&#8221; analog archives they may be very simple: a few boxes of materials in a closet with a simple inventory. Most archives have humble beginnings. Over the years, through much advocacy, planning, and resourcefulness these archives grow in size and capability. Professionals do not (or should not) disparage the work of an archivist following archival principles by saying they do not have a &#8220;real&#8221; archives, because their nascent repository is immature. They have a real archives, one that we hope will succeed and become more mature over time through the archivist&#8217;s diligent effort. <\/p>\n<p>Simple digital repositories are still &#8220;real&#8221; even if, perhaps, immature. They can mature if given the proper attention and resources. Chris Prom describes his own digital preservation journey and the steps he took toward repository development in a keynote address for the 2011 Society of Georgia Archivists annual meeting. <a href=\"#ref73\" id=\"note73\">[73]<\/a> The address describes a &#8220;Do-it-Yourself Trusted Digital Repository&#8221; for which &#8220;most repositories already have what they need&#8221; and which his repository used while a more sophisticated repository application was in development. <a href=\"#ref74\" id=\"note74\">[74]<\/a> Prom&#8217;s experience, though not framed within the maturity models this article discussed, does follow the same patterns: accepting responsibility, identifying needs, finding solutions to meet those needs, and making improvements as capability and resources increase. <\/p>\n<p>In 2013 the &#8220;Jump In Initiative&#8221; hosted by the Society of American Archivists&#8217; Manuscripts Repositories Section encouraged archivists to &#8220;Jump In to managing born-digital content,&#8221; <a href=\"#ref75\" id=\"note75\">[75]<\/a>  spurring repositories to shift from the Acknowledge stage of Kenny and McGovern&#8217;s model to the Act stage. The participation rules stated that &#8220;participants must be from an institution without an electronic records program in place&#8221; to target repositories that had not yet reached the Consolidate stage. Twenty-three repositories reported back on their efforts to survey their born-digital content. <a href=\"#ref76\" id=\"note76\">[76]<\/a> None of these repositories began with the technical infrastructure typically associated with digital repositories. Nevertheless, they were digital repositories in the process of maturing.<\/p>\n<p>Archivists want their digital repositories to be mature now, but maturity only comes through dedicated effort, not with the wave of a Blue Fairy&#8217;s wand. Digital repositories can, and should, start with simple infrastructure and simple processes. What matters right now is that repositories make progress. Describing a digital repository&#8217;s status in terms of a maturity model allows them to be honest in where they stand while still being respectful of our small and simple, but very real, digital repositories.<\/p>\n<h2>Notes<\/h2>\n<p><a href=\"#note1\" id=\"ref1\">[1]<\/a> Pearce-Moses, Richard. \u201cRepository.\u201d A Glossary of Archival and Records Terminology. Chicago, IL: The Society of American Archivists, 2005. <a href=\"https:\/\/www2.archivists.org\/glossary\/terms\/r\/repository\">https:\/\/www2.archivists.org\/glossary\/terms\/r\/repository<\/a>.<\/p>\n<p><a href=\"#note2\" id=\"ref2\">[2]<\/a> Ockerbloom, John Mark. \u201cEverybody\u2019s Repositories (First of a Series).\u201d Everybody\u2019s Libraries, May 7, 2008. <a href=\"https:\/\/everybodyslibraries.com\/2008\/05\/07\/everybodys-repositories-first-of-a-series\/\">https:\/\/everybodyslibraries.com\/2008\/05\/07\/everybodys-repositories-first-of-a-series\/<\/a>.<\/p>\n<p><a href=\"#note3\" id=\"ref3\">[3]<\/a> Pearce-Moses, Richard. \u201cArchives.\u201d A Glossary of Archival and Records Terminology. Chicago, IL: The Society of American Archivists, 2005. <a href=\"https:\/\/www2.archivists.org\/glossary\/terms\/a\/archives\">https:\/\/www2.archivists.org\/glossary\/terms\/a\/archives<\/a>.<\/p>\n<p><a href=\"#note4\" id=\"ref4\">[4]<\/a> Adrian Cunningham. &#8220;Digital Curation\/Digital Archiving: A View from the National Archives of Australia.&#8221; The American Archivist, Fall\/Winter 2008, Vol. 71, No. 2, p. 533. <a href=\"https:\/\/doi.org\/10.17723\/aarc.71.2.p0h0t68547385507\">https:\/\/doi.org\/10.17723\/aarc.71.2.p0h0t68547385507<\/a>.<\/p>\n<p><a href=\"#note5\" id=\"ref5\">[5]<\/a> Joanne Kaczmarek et al., \u201cUsing the Audit Checklist for the Certification of a Trusted Digital Repository as a Framework for Evaluating Repository Software Applications: A Progress Report,\u201d D-Lib Magazine 12, no. 12 (December 2006), <a href=\"https:\/\/doi.org\/10.1045\/december2006-kaczmarek\">https:\/\/doi.org\/10.1045\/december2006-kaczmarek<\/a>.<\/p>\n<p><a href=\"#note6\" id=\"ref6\">[6]<\/a> Consultative Committee for Space Data Systems. Reference Model for an Open Archival Information System (OAIS). CCSDS, 650.0-M-3. Washington, DC: CCSDS Secretariat, 2024. <a href=\"https:\/\/ccsds.org\/wp-content\/uploads\/gravity_forms\/5-448e85c647331d9cbaf66c096458bdd5\/2025\/01\/\/650x0m3.pdf\">https:\/\/ccsds.org\/wp-content\/uploads\/gravity_forms\/5-448e85c647331d9cbaf66c096458bdd5\/2025\/01\/\/650x0m3.pdf<\/a>. An earlier version was adopted as ISO Standard 14721:2012.<\/p>\n<p><a href=\"#note7\" id=\"ref7\">[7]<\/a> If, however, the reader is not familiar with the model, there are many existing resources that summarize or explain OAIS including both: Brian Lavoie. \u201cThe Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition).\u201d Digital Preservation Coalition, October 1, 2014. <a href=\"https:\/\/doi.org\/10.7207\/twr14-02\">https:\/\/doi.org\/10.7207\/twr14-02<\/a>. and Ockerbloom, John Mark. \u201cWhat Repositories Do: The OAIS Model.\u201d Everybody\u2019s Libraries, October 13, 2008. <a href=\"http:\/\/everybodyslibraries.com\/2008\/10\/13\/what-repositories-do-the-oais-model\/\">http:\/\/everybodyslibraries.com\/2008\/10\/13\/what-repositories-do-the-oais-model\/<\/a>.<\/p>\n<p><a href=\"#note8\" id=\"ref8\">[8]<\/a> The author counted over one-hundred articles across several journals, including American Archivist, Archivaria, D-Lib, and Archival Science, that referenced OAIS, with many of them falling into this category. <\/p>\n<p><a href=\"#note9\" id=\"ref9\">[9]<\/a> For example, Ronald Jantz and Michael J. Giarlo, \u201cDigital Preservation: Architecture and Technology for Trusted Digital Repositories,\u201d D-Lib Magazine 11, no. 06 (June 2005), <a href=\"https:\/\/doi.org\/10.1045\/june2005-jantz\">https:\/\/doi.org\/10.1045\/june2005-jantz<\/a> briefly discusses the OAIS object models in regards to submitting METS-based submission information packages to their Fedora-based repository; Mary Vardigan and Cole Whiteman, &#8220;ICPSR meets OAIS: Applying the OAIS reference model to the social science archive context,&#8221; Archival Science 7, (2007), 73-87, <a href=\"http:\/\/hdl.handle.net\/2027.42\/60440\">http:\/\/hdl.handle.net\/2027.42\/60440<\/a> does a good job describing OAIS in general terms for context and then compares the ICPSR&#8217;s system to OAIS; and Timothy Arnold and Walker Sampson, &#8220;Preserving the Voices of Revolution: Examining the Creation and Preservation of a Subject-Centered Collection of Tweets from the Eighteen Days in Egypt,&#8221; American Archivist vol. 77, vo. 2 (Fall\/Winter 2014), pp. 510-533. <a href=\"https:\/\/doi.org\/10.17723\/aarc.77.2.794404552m67024n\">https:\/\/doi.org\/10.17723\/aarc.77.2.794404552m67024n<\/a> describes a Twitter tweet in terms of the OAIS information package.<\/p>\n<p><a href=\"#note10\" id=\"ref10\">[10]<\/a> Courtney C. Mumma, Glenn Dingwall, Sue Bigelow, &#8220;A First Look at the Acquisition and Appraisal of the 2010 Olympic and Paralympic Winter Games Fonds: or, SELECT * FROM VANOC_Records AS Archives WHERE Value=\u201ctrue\u201d;&#8221;, Archivaria 72 (Fall 2011).<\/p>\n<p><a href=\"#note11\" id=\"ref11\">[11]<\/a> OAIS, pg 3-1.<\/p>\n<p><a href=\"#note12\" id=\"ref12\">[12]<\/a> See also: Consultative Committee for Space Data Systems. Producer-Archive Interface Methodology Abstract Standard (PAIMAS). CCSDS, 651.0-M-1. Washington, DC: CCSDS Secretariat, 2004. https:\/\/public.ccsds.org\/Pubs\/651x0m1.pdf. Adopted as ISO Standard 20652:2006.<\/p>\n<p><a href=\"#note13\" id=\"ref13\">[13]<\/a> OAIS, pg 1-3.<\/p>\n<p><a href=\"#note14\" id=\"ref14\">[14]<\/a> AtoM and ArchivesSpace can be found at <a href=\"https:\/\/www.accesstomemory.org\">https:\/\/www.accesstomemory.org<\/a> and <a href=\"http:\/\/archivesspace.org\/\">http:\/\/archivesspace.org\/<\/a>, respectively.<\/p>\n<p><a href=\"#note15\" id=\"ref15\">[15]<\/a> In the case of digital materials, this refers to the bits that make the digital object; although OAIS allows physical objects to be classified as data objects, too.<\/p>\n<p><a href=\"#note16\" id=\"ref16\">[16]<\/a> Droid <<a href=\"http:\/\/www.nationalarchives.gov.uk\/information-management\/manage-information\/preserving-digital-records\/droid\/\">http:\/\/www.nationalarchives.gov.uk\/information-management\/manage-information\/preserving-digital-records\/droid\/<\/a>> and FITS <<a href=\"http:\/\/fitstool.org\">http:\/\/fitstool.org<\/a>>.<\/p>\n<p><a href=\"#note17\" id=\"ref17\">[17]<\/a> Many of the records archives collect are either unstructured (e.g. images, audio, office documents) or include semantics as part of the format (e.g. email headers, XML-based documents, Twitter Tweet objects) and do not require additional documentation of semantic information. Other formats, such as spreadsheets or databases, structure data but do not inherently carry semantic information with them. E.g. it is not always clear what the values in a spreadsheet column is supposed to mean. These should be documented somewhere, e.g. an associated README file and a DACS (Describing Archives, A Content Standard) Technical Access Note. <\/p>\n<p><a href=\"#note18\" id=\"ref18\">[18]<\/a> <a href=\"https:\/\/www.ov2.eu\/programs\/rapidcrc-unicode\">https:\/\/www.ov2.eu\/programs\/rapidcrc-unicode<\/a><\/p>\n<p><a href=\"#note19\" id=\"ref19\">[19]<\/a> <a href=\"https:\/\/github.com\/digitalpowrr\/DataAccessioner\">https:\/\/github.com\/digitalpowrr\/DataAccessioner<\/a><\/p>\n<p><a href=\"#note20\" id=\"ref20\">[20]<\/a> OAIS, Section 4.3.2.4.4 (pg 4-37).<\/p>\n<p><a href=\"#note21\" id=\"ref21\">[21]<\/a> <a href=\"https:\/\/wiki.archivematica.org\/AIP_structure\">https:\/\/wiki.archivematica.org\/AIP_structure<\/a><\/p>\n<p><a href=\"#note22\" id=\"ref22\">[22]<\/a> OAIS, pg 4-4.<\/p>\n<p><a href=\"#note23\" id=\"ref23\">[23]<\/a> OAIS, section 4.2.3.3.<\/p>\n<p><a href=\"#note24\" id=\"ref24\">[24]<\/a> OAIS, section 4.2.3.4.<\/p>\n<p><a href=\"#note25\" id=\"ref25\">[25]<\/a> For good introduction to fixity checking see: Paula De Stefano et al., \u201cChecking Your Digital Content: What Is Fixity, and When Should I Be Checking It?\u201d (National Digital Stewardship Alliance, 2014), <a href=\"http:\/\/hdl.loc.gov\/loc.gdc\/lcpub.2013655117.1\">http:\/\/hdl.loc.gov\/loc.gdc\/lcpub.2013655117.1<\/a>.<\/p>\n<p><a href=\"#note26\" id=\"ref26\">[26]<\/a> OAIS, section 4.2.3.5.<\/p>\n<p><a href=\"#note27\" id=\"ref27\">[27]<\/a> <a href=\"http:\/\/www.archivesspace.org\">http:\/\/www.archivesspace.org<\/a><\/p>\n<p><a href=\"#note28\" id=\"ref28\">[28]<\/a> OAIS, sections 4.2.3.6 and 4.2.3.7.<\/p>\n<p><a href=\"#note29\" id=\"ref29\">[29]<\/a> OAIS, section 4.2.3.8.<\/p>\n<p><a href=\"#note30\" id=\"ref30\">[30]<\/a> Heather MacNeil, \u201cProviding Grounds for Trust: Developing Conceptual Requirements for the Long-Term Preservation of Authentic Electronic Records,\u201d Archivaria 50 (January 1, 2000), <a href=\"http:\/\/archivaria.ca\/index.php\/archivaria\/article\/view\/12765\">http:\/\/archivaria.ca\/index.php\/archivaria\/article\/view\/12765<\/a>, pg. 72. See also MacNeil, Heather. \u201cTrust and Professional Identity: Narratives, Counter-Narratives and Lingering Ambiguities.\u201d Archival Science 11 (October 12, 2011): 175\u201392. <a href=\"https:\/\/doi.org\/10.1007\/s10502-011-9150-5\">https:\/\/doi.org\/10.1007\/s10502-011-9150-5<\/a>. <\/p>\n<p><a href=\"#note31\" id=\"ref31\">[31]<\/a> Kirschenbaum, Matthew G., Richard Ovenden, Gabriela Redwine, and Rachel Donahue. Digital Forensics and Born-Digital Content in Cultural Heritage Collections. CLIR Publication 149. Washington, D.C: Council on Library and Information Resources, 2010. pg 29.<\/p>\n<p><a href=\"#note32\" id=\"ref32\">[32]<\/a> For example, a repository could wrap content in a Bag-It bag and place copies on at least two external hard drives which could periodically be retrieved and validated using Bagger or another command-line tool according to a set schedule and tracked using a spreadsheet. See the discussion of OAIS and fixity earlier in this article.<\/p>\n<p><a href=\"#note33\" id=\"ref33\">[33]<\/a> Research Libraries Group. \u201cTrusted Digital Repositories: Attributes and Responsibilities.\u201d Mountain View, California: RLG, Inc., 2002. <a href=\"http:\/\/www.oclc.org\/content\/dam\/research\/activities\/trustedrep\/repositories.pdf\">http:\/\/www.oclc.org\/content\/dam\/research\/activities\/trustedrep\/repositories.pdf<\/a>. pg 35.<\/p>\n<p><a href=\"#note34\" id=\"ref34\">[34]<\/a> Center for Research Libraries (CRL), Online Computer Library Center (OCLC), and National Archives and Records Administration. \u201cTrustworthy Repositories Audit &#038; Certification: Criteria and Checklist.\u201d OCLC and CRL, February 2007. <a href=\"http:\/\/www.crl.edu\/sites\/default\/files\/attachments\/pages\/trac_0.pdf\">http:\/\/www.crl.edu\/sites\/default\/files\/attachments\/pages\/trac_0.pdf<\/a>.<\/p>\n<p><a href=\"#note35\" id=\"ref35\">[35]<\/a> International Organization for Standardization. \u201cSpace Data and Information Transfer Systems &#8211; Audit and Certification of Trustworthy Digital Repositories.\u201d ISO 16363:2012. Geneva, Switzerland: ISO, 2012. <a href=\"https:\/\/www.iso.org\/standard\/56510.htm\">https:\/\/www.iso.org\/standard\/56510.htm<\/a>l. Also available as CCSDS 652.0-M-1 at <a href=\"https:\/\/public.ccsds.org\/Pubs\/652x0m1.pdf\">https:\/\/public.ccsds.org\/Pubs\/652x0m1.pdf<\/a>.<\/p>\n<p><a href=\"#note36\" id=\"ref36\">[36]<\/a> Steve Marks, Becoming a Trusted Digital Repository (Society of American Archivists, 2015), p 4. <\/p>\n<p><a href=\"#note37\" id=\"ref37\">[37]<\/a> Data Seal of Approval and ICSU World Data System. \u201cCore Trustworthy Data Repositories Requirements.\u201d Data Seal of Approval, November 2016. <a href=\"https:\/\/drive.google.com\/file\/d\/0B4qnUFYMgSc-eDRSTE53bDUwd28\/view\">https:\/\/drive.google.com\/file\/d\/0B4qnUFYMgSc-eDRSTE53bDUwd28\/view<\/a>. See also the Data Seal of Approval website <a href=\"https:\/\/www.datasealofapproval.org\/en\/\">https:\/\/www.datasealofapproval.org\/en\/<\/a>.<\/p>\n<p><a href=\"#note38\" id=\"ref38\">[38]<\/a> Nestor. \u201cAssessment Form for Obtaining the Nestor Seal for Trustworthy Digital Archives.\u201d Nestor, 2017. http:\/\/files.dnb.de\/nestor\/zertifizierung\/Einreichungsformular_EN.docx. See also Nestor. \u201cNestor Seal for Trustworthy Digital Archives.\u201d Nestor, March 8, 2017. <a href=\"http:\/\/www.langzeitarchivierung.de\/Subsites\/nestor\/EN\/Siegel\/siegel_node.html\">http:\/\/www.langzeitarchivierung.de\/Subsites\/nestor\/EN\/Siegel\/siegel_node.html<\/a>. Based on DIN 31644: \u201cCriteria for for trustworthy digital archives.\u201d<\/p>\n<p><a href=\"#note39\" id=\"ref39\">[39]<\/a> Center for Research Libraries (CRL). \u201cTen Principles.\u201d Center for Research Libraries Global Resources Network, 2007. <a href=\"https:\/\/www.crl.edu\/archiving-preservation\/digital-archives\/metrics-assessing-and-certifying\/core-re\">https:\/\/www.crl.edu\/archiving-preservation\/digital-archives\/metrics-assessing-and-certifying\/core-re<\/a>.<\/p>\n<p><a href=\"#note40\" id=\"ref40\">[40]<\/a> MacNeil, Heather. \u201cTrust and Professional Identity: Narratives, Counter-Narratives and Lingering Ambiguities.\u201d Archival Science 11 (October 12, 2011): 175\u201392. <a href=\"https:\/\/doi.org\/10.1007\/s10502-011-9150-5\">https:\/\/doi.org\/10.1007\/s10502-011-9150-5<\/a>, pg. 187.<\/p>\n<p><a href=\"#note41\" id=\"ref41\">[41]<\/a> Greg Bak, \u201cTrusted by Whom? TDRs, Standards Culture and the Nature of Trust,\u201d Archival Science 16, no. 4 (December 1, 2016): 373\u2013402, <a href=\"https:\/\/doi.org\/10.1007\/s10502-015-9257-1\">https:\/\/doi.org\/10.1007\/s10502-015-9257-1<\/a>.<\/p>\n<p><a href=\"#note42\" id=\"ref42\">[42]<\/a> Kathleen Fear and Devan Ray Donaldson, \u201cProvenance and Credibility in Scientific Data Repositories,\u201d Archival Science 12, no. 3 (September 1, 2012): 319\u201339, <a href=\"https:\/\/doi.org\/10.1007\/s10502-012-9172-7\">https:\/\/doi.org\/10.1007\/s10502-012-9172-7<\/a>; Elizabeth Yakel et al., \u201cTrust in Digital Repositories,\u201d International Journal of Digital Curation 8, no. 1 (June 20, 2013): 143\u201356, <a href=\"https:\/\/doi.org\/10.2218\/ijdc.v8i1.251\">https:\/\/doi.org\/10.2218\/ijdc.v8i1.251<\/a>; Ayoung Yoon, \u201cEnd Users\u2019 Trust in Data Repositories: Definition and Influences on Trust Development,\u201d Archival Science 14, no. 1 (March 1, 2014): 17\u201334, <a href=\"https:\/\/doi.org\/10.1007\/s10502-013-9207-8\">https:\/\/doi.org\/10.1007\/s10502-013-9207-8<\/a>.<\/p>\n<p><a href=\"#note43\" id=\"ref43\">[43]<\/a> Adolfo G. Prieto, \u201cFrom Conceptual to Perceptual Reality: Trust in Digital Repositories,\u201d Library Review 58, no. 8 (September 4, 2009): 593\u2013606, <a href=\"https:\/\/doi.org\/10.1108\/00242530910987082\">https:\/\/doi.org\/10.1108\/00242530910987082<\/a>.<\/p>\n<p><a href=\"#note44\" id=\"ref44\">[44]<\/a> Dara M. Price and Johanna J. Smith, \u201cThe Trust Continuum in the Information Age: A Canadian Perspective,\u201d Archival Science 11, no. 3\u20134 (November 1, 2011): 253\u201376, <a href=\"https:\/\/doi.org\/10.1007\/s10502-011-9148-z\">https:\/\/doi.org\/10.1007\/s10502-011-9148-z<\/a>; Sue Childs and Julie McLeod, \u201cA Case Example of Public Trust in Online Records \u2013 The UK Care.Data Programme\u201d (InterPARES Trust Project, February 8, 2015), <a href=\"https:\/\/interparestrust.org\/assets\/public\/dissemination\/EU17_20150802_UKCareDataProgramme_FinalReport_Final.pdf\">https:\/\/interparestrust.org\/assets\/public\/dissemination\/EU17_20150802_UKCareDataProgramme_FinalReport_Final.pdf<\/a>; Michelle Spelay, \u201cTrusted Online Access to Distributed Holdings of Digital Public Records,\u201d Literature Review (InterPARES Trust, September 2016), <a href=\"https:\/\/interparestrust.org\/assets\/public\/dissemination\/AA05_20160915_TrustedOnlineAccessPublicRecords_LiteratureReview_FINAL.pdf\">https:\/\/interparestrust.org\/assets\/public\/dissemination\/AA05_20160915_TrustedOnlineAccessPublicRecords_LiteratureReview_FINAL.pdf<\/a>.<\/p>\n<p><a href=\"#note45\" id=\"ref45\">[45]<\/a> Bethany Anderson, \u201cAppendix A: Case Study: University Archives, University of Illinois at Urbana-Champaign,\u201d in Module 8: Becoming a Trusted Digital Repository, by Steve Marks, Trends in Archival Practice (Chicago, Il: Society of American Archivists, 2015), 60\u201364, <a href=\"https:\/\/www2.archivists.org\/sites\/all\/files\/Module_8_CaseStudy_BethanyAnderson.pdf\">https:\/\/www2.archivists.org\/sites\/all\/files\/Module_8_CaseStudy_BethanyAnderson.pdf<\/a>. See also their earlier article: Joanne Kaczmarek et al., \u201cUsing the Audit Checklist for the Certification of a Trusted Digital Repository as a Framework for Evaluating Repository Software Applications: A Progress Report,\u201d D-Lib Magazine 12, no. 12 (December 2006), <a href=\"https:\/\/doi.org\/10.1045\/december2006-kaczmarek\">https:\/\/doi.org\/10.1045\/december2006-kaczmarek<\/a>.<\/p>\n<p><a href=\"#note46\" id=\"ref46\">[46]<\/a> Kenney, Anne R., and Nancy Y. McGovern. \u201cThe Five Organizational Stages of Digital Preservation.\u201d In Digital Libraries: A Vision for the 21st Century: A Festschrift in Honor of Wendy Lougee on the Occasion of Her Departure from the University of Michigan, edited by Patricia Hodges. Michigan Publishing, University of Michigan Library, 2003. <a href=\"https:\/\/doi.org\/10.3998\/spobooks.bbv9812.0001.001\">https:\/\/doi.org\/10.3998\/spobooks.bbv9812.0001.001<\/a>.<\/p>\n<p><a href=\"#note47\" id=\"ref47\">[47]<\/a> Digital Preservation Coalition Rapid Assessment Model (version 3 &#8211; March 2024) <a href=\"http:\/\/doi.org\/10.7207\/dpcram24-03.\">http:\/\/doi.org\/10.7207\/dpcram24-03.<\/a><\/p>\n<p><a href=\"#note48\" id=\"ref48\">[48]<\/a> See Phillips, Megan, Jefferson Bailey, Andrea Goethals, and Trevor Owens. \u201cThe NDSA Levels of Digital Preservation: An Explanation and Uses.\u201d National Digital Stewardship Alliance, 2013. <a href=\"http:\/\/ndsa.org\/documents\/NDSA_Levels_Archiving_2013.pdf\">http:\/\/ndsa.org\/documents\/NDSA_Levels_Archiving_2013.pdf<\/a>. See also <a href=\"http:\/\/ndsa.org\/activities\/levels-of-digital-preservation\/\">http:\/\/ndsa.org\/activities\/levels-of-digital-preservation\/<\/a>.<\/p>\n<p><a href=\"#note49\" id=\"ref49\">[49]<\/a> Davis, Susan. \u201cElectronic Records Planning in \u2018Collecting\u2019 Repositories.\u201d The American Archivist 71, no. 1 (April 1, 2008): 167\u201389. <a href=\"https:\/\/doi.org\/10.17723\/aarc.71.1.024q2020828t7332\">https:\/\/doi.org\/10.17723\/aarc.71.1.024q2020828t7332<\/a>, pg 180.<\/p>\n<p><a href=\"#note50\" id=\"ref50\">[50]<\/a> Richard Pearce-Moses, \u201cJanus in Cyberspace: Archives on the Threshold of the Digital Era,\u201d The American Archivist 70, no. 1 (January 1, 2007): 13\u201322, <a href=\"https:\/\/doi.org\/10.17723\/aarc.70.1.n7121165223j6t83\">https:\/\/doi.org\/10.17723\/aarc.70.1.n7121165223j6t83<\/a>.<\/p>\n<p><a href=\"#note51\" id=\"ref51\">[51]<\/a> Helen Tibbo, \u201cOn the Occasion of SAA\u2019s Diamond Jubilee: A Profession Coming of Age in the Digital Era (with an Introduction by Jane Kenamore),\u201d The American Archivist 75, no. 1 (April 1, 2012): 16\u201334, <a href=\"https:\/\/doi.org\/10.17723\/aarc.75.1.a054u0t82478x41v\">https:\/\/doi.org\/10.17723\/aarc.75.1.a054u0t82478x41v<\/a>.<\/p>\n<p><a href=\"#note52\" id=\"ref52\">[52]<\/a> For example, reaching out to offices of origin to discuss the items listed in the InterPares 2 Project&#8217;s  &#8220;Creator Guidelines&#8221; <a href=\"http:\/\/www.interpares.org\/public_documents\/ip2(pub)creator_guidelines_booklet.pdf\">http:\/\/www.interpares.org\/public_documents\/ip2(pub)creator_guidelines_booklet.pdf<\/a>.<\/p>\n<p><a href=\"#note53\" id=\"ref53\">[53]<\/a> For more on formalizing trust between collaborative partners see F. Berman, A. Kozbial, R. H. McDonald, and B. Schottlaender, \u201cThe Need for Formalized Trust in Digital Repository Collaborative Infrastructure,\u201d Proceedings of the NSF\/JISC Repository Workshop (2007). <a href=\"https:\/\/libraries.ucsd.edu\/chronopolis\/_files\/publications\/berman_schottlaender.pdf\">https:\/\/libraries.ucsd.edu\/chronopolis\/_files\/publications\/berman_schottlaender.pdf<\/a>.<\/p>\n<p><a href=\"#note54\" id=\"ref54\">[54]<\/a> NDSA, pg 2.<\/p>\n<p><a href=\"#note55\" id=\"ref55\">[55]<\/a> NDSA, pg 3-4.<\/p>\n<p><a href=\"#note56\" id=\"ref56\">[56]<\/a> NDSA, pg 1, 2.<\/p>\n<p><a href=\"#note57\" id=\"ref57\">[57]<\/a> &#8220;Levels&#8221; in the NDSA model are equivalent to &#8220;stages&#8221; in the other models.<\/p>\n<p><a href=\"#note58\" id=\"ref58\">[58]<\/a> Each level also includes a label providing a rough characterization of the stage&#8217;s goal. For example, the first stage is labeled &#8220;Protect your data&#8221; and includes activities such as ensuring two copies are not collocated, fixity information is gathered or created, inventories are created, and restricting who has access to the copies. The fourth stage is labeled &#8220;Repair your data&#8221; and includes activities such as replacing\/repairing corrupted data and performing format migrations or providing emulated environments. <\/p>\n<p><a href=\"#note59\" id=\"ref59\">[59]<\/a> NDSA, pg 2.<\/p>\n<p><a href=\"#note60\" id=\"ref60\">[60]<\/a> <a href=\"https:\/\/www.avpreserve.com\/products\/fixity\/\">https:\/\/www.avpreserve.com\/products\/fixity\/<\/a><\/p>\n<p><a href=\"#note61\" id=\"ref61\">[61]<\/a> Phillips, Bailey, Goethals, and Owens, pg 5.<\/p>\n<p><a href=\"#note62\" id=\"ref62\">[62]<\/a> For more on format endangerment see Library of Congress, \u201cSustainability of Digital Formats: Planning for Library of Congress Collections,\u201d webpage, March 10, 2017, <a href=\"http:\/\/www.loc.gov\/preservation\/digital\/formats\/index.html\">http:\/\/www.loc.gov\/preservation\/digital\/formats\/index.html<\/a> and Heather Ryan, &#8220;Occam\u2019s Razor and File Format Endangerment Factors,&#8221; Proceedings of the 11th International Conference on Digital Preservation (iPres), October 6-10, 2014: Melbourne, Australia. <a href=\"https:\/\/ipres-conference.org\/ipres14\/sites\/default\/files\/upload\/iPres-Proceedings-final.pdf\">https:\/\/ipres-conference.org\/ipres14\/sites\/default\/files\/upload\/iPres-Proceedings-final.pdf<\/a>.<\/p>\n<p><a href=\"#note63\" id=\"ref63\">[63]<\/a> Shira Peltzman, \u201cExpanding NDSA Levels of Preservation,\u201d The Signal, April 12, 2016, <a href=\"https:\/\/blogs.loc.gov\/thesignal\/2016\/04\/expanding-ndsa-levels-of-preservation\/\">https:\/\/blogs.loc.gov\/thesignal\/2016\/04\/expanding-ndsa-levels-of-preservation\/<\/a>.<\/p>\n<p><a href=\"#note64\" id=\"ref64\">[64]<\/a> \u201cLevels of Born-Digital Access,\u201d Digital Libraries Federation, February 5, 2020. <a href=\"https:\/\/doi.org\/10.17605\/OSF.IO\/R5F78\">https:\/\/doi.org\/10.17605\/OSF.IO\/R5F78<\/a>. <\/p>\n<p><a href=\"#note65\" id=\"ref65\">[65]<\/a> Phillips, Bailey, Goethals, and Owens, pgs 5-6.<\/p>\n<p><a href=\"#note66\" id=\"ref66\">[66]<\/a> Phillips, Bailey, Goethals, and Owens, pg 5.<\/p>\n<p><a href=\"#note67\" id=\"ref67\">[67]<\/a> DPC RAM, pg 2.<\/p>\n<p><a href=\"#note68\" id=\"ref68\">[68]<\/a> Brown, A (2013). Practical Digital Preservation: a how-to guide for organizations of any size, Facet Publishing: London.<\/p>\n<p><a href=\"#note69\" id=\"ref69\">[69]<\/a> DPC RAM, pg 3.<\/p>\n<p><a href=\"#note70\" id=\"ref70\">[70]<\/a> DPC RAM, pg 4.<\/p>\n<p><a href=\"#note71\" id=\"ref71\">[71]<\/a> Center for Research Libraries. \u201cTen Principles,\u201d n.d. <a href=\"https:\/\/www.crl.edu\/archiving-preservation\/digital-archives\/metrics-assessing-and-certifying\/core-re\">https:\/\/www.crl.edu\/archiving-preservation\/digital-archives\/metrics-assessing-and-certifying\/core-re<\/a>.<\/p>\n<p><a href=\"#note72\" id=\"ref72\">[72]<\/a> Joseph A. Williams and Elizabeth M. Berilla. &#8220;Minutes, Migration, and Migraines: Establishing a Digital Archives at a Small Institution.&#8221; The American Archivist, Spring\/Summer 2015, Vol. 78, No. 1, p. 88. <a href=\"https:\/\/doi.org\/10.17723\/0360-9081.78.1.84\">https:\/\/doi.org\/10.17723\/0360-9081.78.1.84<\/a>.<\/p>\n<p><a href=\"#note73\" id=\"ref73\">[73]<\/a> Prom, Christopher. \u201cMaking Digital Preservation Practical: A Personal Odyssey.\u201d Provenance, Journal of the Society of Georgia Archivists 29, no. 1 (2011). <a href=\"http:\/\/digitalcommons.kennesaw.edu\/provenance\/vol29\/iss1\/2\/\">http:\/\/digitalcommons.kennesaw.edu\/provenance\/vol29\/iss1\/2\/<\/a>.<\/p>\n<p><a href=\"#note74\" id=\"ref74\">[74]<\/a> Prom, 13.<\/p>\n<p><a href=\"#note75\" id=\"ref75\">[75]<\/a> &#8220;Jump In Initiative.&#8221; Society of American Archivists. <a href=\"https:\/\/www2.archivists.org\/groups\/manuscript-repositories-section\/jump-in-initiative\">https:\/\/www2.archivists.org\/groups\/manuscript-repositories-section\/jump-in-initiative<\/a>.<\/p>\n<p><a href=\"#note76\" id=\"ref76\">[76]<\/a> &#8220;Jump In Initiative 2013 Results.&#8221; Society of American Archivists. <a href=\"https:\/\/www2.archivists.org\/groups\/manuscript-repositories-section\/jump-in-initiative-2013-results\">https:\/\/www2.archivists.org\/groups\/manuscript-repositories-section\/jump-in-initiative-2013-results<\/a>.<\/p>\n<h2>About the authors<\/h2>\n<p>Seth Shaw is the Digital Library Software Engineer for the Arizona State University Library and an Islandora Core Committer. He previously served as a Developer for Special Collections and Archives for the University of Nevada, Las Vegas, an Assistant Professor of Archival Studies at Clayton State University, and the Electronic Records Archivist for the Duke University Archives. He earned his Master of Science in Information, Archives and Records Management from the University of Michigan, School of Information, and his Bachelor of Science in Information Systems at Brigham Young University &#8211; Idaho.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Archivists occasionally describe digital repositories as being &#8220;not real,&#8221; suggesting that their technical digital preservation infrastructure is inadequate to the task of digital preservation. This article discusses the concept of digital repositories, highlighting the distinction between digital repository technical infrastructure and institutions collecting digital materials, and what it means to be a &#8220;real&#8221; digital repository. It argues that the Open Archival Information System Reference Model and notions of Trustworthy Digital Repositories are inadequate for determining the &#8220;realness&#8221; of a digital repository and advocates using maturity models as a framework for discussing repository capability.  <\/p>\n","protected":false},"author":18,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18541","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18541","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18541"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18541\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18541"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18541"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18541"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18535,"date":"2025-10-21T15:58:52","date_gmt":"2025-10-21T19:58:52","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18535"},"modified":"2025-10-21T16:00:51","modified_gmt":"2025-10-21T20:00:51","slug":"from-notes-to-networks-using-obsidian-to-teach-metadata-and-linked-data","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18535","title":{"rendered":"From Notes to Networks: Using Obsidian to Teach Metadata and Linked Data"},"content":{"rendered":"<p>by Kara Long and Erin Yunes<\/p>\n<h2>Introduction<\/h2>\n<p>Beginning in fall 2023, librarians in the Data Services unit at Virginia Tech University Libraries introduced Obsidian as a data visualization and metadata tool for working with relational datasets. Obsidian is a lightweight, versatile note-taking and personal knowledge management (PKM) software designed for creating and managing linked text files. <a href=\"#ref1\" id=\"note1\">[1]<\/a> Obsidian is free to download and use, and while it is not an open-source product, its development team has committed to the use of open formats and made the tool extensible and highly customizable through the use of open plug-ins and themes.<\/p>\n<p>In this article we describe the use of Obsidian on two different research project teams with very different methods and objectives in their respective fields: the Rematriation Project, and the Dangerous Harbors project. We will reflect on successes, lessons learned, and areas for further exploration.<\/p>\n<h2>Data Services at Virginia Tech University Libraries<\/h2>\n<p>The Virginia Tech University Libraries (VTUL) serves over 38,000 students from the main campus in Blacksburg, Virginia to satellite campuses and research centers across the state, in Washington D.C., and internationally. Established in 1872 as the Virginia Agricultural and Mechanical College, the university has since grown in its land-grant mission to offer nearly 280 undergraduate and graduate majors.<\/p>\n<p>The Data Services unit of VTUL facilitates access, management, and analysis of data for both internal library operations as well as faculty- and student-led research teams and initiatives. Data Services\u2019 responsibilities include library catalog data, the data repository (VTechData), and instruction in data collection, analysis, and documentation. As a team of functional specialists, Data Services supports research projects across a range of disciplines with varied research data practices and needs. We also support training and supervising students engaged in gathering, extracting, or preparing research data. The training we offer ranges from one-time workshops to ongoing partnerships with researchers, supervising students, and providing leadership in data collection and processing throughout the research lifecycle.<\/p>\n<p>Kara Long, Assistant Director for Metadata Services, and Dr. Erin Yunes, CLIR Community Data Postdoctoral Fellow, began experimenting with Obsidian in Fall 2023 while working on the Rematriation Project team. The Rematriation Project is a partnership between VTUL and the Aqqaluk Trust (AT), a community organization based in Kotzebue, Alaska. In the same semester, Kara introduced Obsidian to the Dangerous Harbors project research team. Inspired by feedback from both of these teams, we sought ways to make data entry, descriptive metadata work, and linked data concepts more engaging and accessible for the undergraduate and graduate students, as well as community members, working on these projects.<\/p>\n<h2>Obsidian<\/h2>\n<p>Obsidian is a light weight, free-to-download application that runs locally and does not require a network connection. It is often referred to as a note-taking application or a tool for \u201cpersonal knowledge management\u201d (PKM). Creators, Shida Li and Erica Xu, developed and first released Obsidian in March 2020. As of the writing of this article, the latest Obsidian release is version 1.8.10, released in April 2025. The Obsidian application can run on Windows, Linux, or Mac operating systems, and mobile versions are available for iOS and Android devices. The development team hosts a Discord server and forum available to users. <a href=\"#ref2\" id=\"note2\">[2]<\/a><\/p>\n<p>While it is not a metadata management system in the traditional sense, the use of Markdown syntax, bi-directional linking, graph views, tags, and YAML front matter make it a powerful tool for building and organizing a customized, structured knowledge base. Obsidian stores all files, or \u201cnotes,\u201d in markdown on the user\u2019s device. A key feature of Obsidian is the creation and management of these notes, which can be interconnected with links. This has been a useful teaching and demonstration tool, visually illustrating the semantic relationships and networks, connections that are not always immediately obvious in linear note-taking systems or spreadsheets.<\/p>\n<p>The tools typically supported by the Data Services unit can require training and a level of technical expertise that not all participants on the research teams possessed or needed. Especially when working with community partners, not all collaborators had the time to invest in learning highly technical systems. Despite this, it was important to find ways to engage all the project stakeholders in conversations about the data we were gathering. We were willing to investigate unconventional tools to do so, as long as those tools aligned with our team\u2019s values and community-driven goals. We wanted to balance our desire to develop our teams\u2019 skills and understanding with our need to create rich, accurate, and reusable datasets. Using Obsidian allowed us to scaffold discussions in a way that minimized the need for prior technical knowledge, while making participation more accessible.<\/p>\n<h2>Case Study 1: The Rematriation Project<\/h2>\n<p>The Rematriation project is co-led by Dr. Cana Uluak Itchuaqiyaq, Assistant Professor of Professional and Technical Writing at Virginia Tech, and leadership of the Aqqaluk Trust, an Inuit-led and Inuit-serving community organization, based in Kotzebue, Alaska. <a href=\"#ref3\" id=\"note3\">[3]<\/a> The Rematriation project is an Indigenous-led initiative that aims to empower communities through digital archiving to create, preserve, protect, and return access of I&ntilde;upiat cultural, scientific, and community knowledges. This project emphasises Indigenous data sovereignty (IDsov) and aims to build local capacities for archiving and digital literacy. Throughout this project, the team regularly discussed how and why to implement and practice the tenets outlined in the CARE Principles for Indigenous Data Governance. <a href=\"#ref4\" id=\"note4\">[4]<\/a> These principles have deeply informed how we approached our work and have been especially informative for those team members with a background in scholarship and research. The principles alone, however, do not fully meet the needs of community members in navigating decisions around data governance and building new or reimagined relationships with researchers. Complementary frameworks and tools, based in local knowledge and experience are needed to bridge the gap between high-level guidance and the day to day work of building more equitable networks and systems for information heritage.<\/p>\n<p>The Rematriation Project is structured around four central goals:<\/p>\n<ul>\n<li>Digitize and catalog tribal materials such as papers, photographs, recordings, and notes.<\/li>\n<li>Develop tools and curriculum to build local capacity for digital archiving, including metadata literacy and use of accessible digital platforms.<\/li>\n<li>Create and test a prototype digital archive, designed for Inuit users, that links to and helps organize existing archives.<\/li>\n<li>Establish protocols for Indigenous data and research sovereignty to ensure communities retain authority over their knowledge.<\/li>\n<\/ul>\n<p>The term \u201crematriation\u201d&mdash;used by Unangax&#770; scholar Dr. Eve Tuck&mdash;describes Indigenous-led processes of returning and restoring cultural knowledge. <a href=\"#ref5\" id=\"note5\">[5]<\/a> This framing reflects a decolonial approach to stewardship, placing Indigenous priorities, technologies, and community decision-making at the center. At the heart of the Rematriation Project is a commitment to restoring community autonomy over cultural memory and knowledge production. This project reimagines the digital archive as a living, community-owned system that supports IDsov and culturally grounded stewardship.<\/p>\n<p>To achieve these goals, the archive must first be usable and accessible, especially in contexts like the Arctic, where connectivity and infrastructure can be limited. Obsidian\u2019s lightweight design and extensive customization options made it an ideal tool for developing an engaging, flexible, and offline-accessible tool to prototype our archival work.<\/p>\n<p>Equally essential to the success of this archive is its ability to support reflexive and culturally appropriate metadata practices. In early community development meetings held in October 2022, Rematriation Project community partners emphasized the need for the archive to reflect and incorporate localized terms for any materials the system may hold. This was not only a design preference but a foundational requirement: the archive must reflect local knowledge systems and naming conventions, rather than overwrite them or impose westernized standards.<\/p>\n<h3>Metadata as Relational Practice<\/h3>\n<p>Metadata creation and linked annotation play a central role in the Rematriation Project\u2019s goal to recontextualize and recover Inuit knowledge from settler archives. Obsidian\u2019s markdown-based structure and customizable tagging made it possible to design an archival prototype that foregrounds Indigenous terminologies and relationships. The use of Obsidian allowed us to ensure that digital materials could be linked, interpreted and retrieved in ways that are meaningful to local users.<\/p>\n<p>The archival prototype we built used a sample from the research of Caleb Pungowiyi, a Siberian Yupik leader whose materials, spanning from the 1980s to early 2000s, include photos, notes, policy documents, and scientific observations. His work on Arctic ecology and climate change brought Indigenous perspectives to national and international policy decision-making. <a href=\"#ref6\" id=\"note6\">[6]<\/a> We digitized and described over 6,500 files with the support of the VTUL Digital Imaging Lab. The Pungowiyi family loaned the materials to the Rematriation Project and, following a post-custodial, community-first model of archiving, were returned to the family after the digitization process.<\/p>\n<p>Focusing exclusively on a portion of the Pungowiyi\u2019s photographs, we built the first version of the Obsidian prototype vault, where each photo became an individual markdown file or note. We used OpenRefine <a href=\"#ref7\" id=\"note7\">[7]<\/a> to clean and format metadata generated by the digitization lab and the Obsidian plugin JSON\/CSV Importer, <a href=\"#ref8\" id=\"note8\">[8]<\/a>  to batch create notes for 820 image files. This first vault incorporated YAML front matter to ensure consistent metadata inputs with tags and backlinks to reflect relationships and recurring themes. Using the Map View <a href=\"#ref9\" id=\"note9\">[9]<\/a> plugin, we used geolocation metadata to generate an interactive map, which gave us a visual representation of the places documented in Pungowiyi\u2019s research.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/journal.code4lib.org\/media\/issue61\/long\/figure1.png\" width=\"527\" height=\"662\" alt=\"Figure 1: A screenshot of the YAML frontmatter at the head of a note describing photograph \u201cp0098\u201d of the Caleb Pungowiyi Papers\" \/><\/p>\n<p class=\"caption\">\n  <strong>Figure 1.<\/strong> A screenshot of the YAML frontmatter at the head of a note describing photograph \u201cp0098\u201d of the Caleb Pungowiyi Papers\n<\/p>\n<p>In the second phase of the prototype, Aqqaluk Trust staff member Dylan Paisaq Itchuaqiyaq joined in the development process, testing the system and contributing localized metadata, strengthening the archive\u2019s community relevance. After an initial round of metadata enhancement, Paisaq identified several \u201cquality of life\u201d improvements&mdash;practical features that would make working in Obsidian more efficient and user-friendly. These included the need for synonym matching to connect different terms, streamlined bulk tagging, and image gallery metadata creation options. This phase was an important step to ensure that the archive could support consistency and flexibility as it scales for broader community use.<\/p>\n<h3>Bridging Technical and Cultural Needs<\/h3>\n<p>Obsidian was particularly effective in bridging technical and community needs. It supported ethical documentation of metadata terms chosen by I&ntilde;upiat collaborators and enabled non-technical contributors to work directly in a legible, non-proprietary format. Above all, it was essential to create a space for community members to define metadata categories and control the language they chose to use to describe the archival materials in the prototype vault. Because selected materials may be designated by community members as containing Traditional Knowledges (TK) or sensitive family stories, the knowledge that each vault was available only on a user\u2019s machine (and not in a shared spreadsheet or drive), allowed users to capture their thoughts freely.<\/p>\n<h2>Case Study 2: Dangerous Harbors<\/h2>\n<p>The Dangerous Harbors Project aims to aggregate and make accessible narratives of escape from servitude and enslavement in Virginia, Maryland, and North Carolina during the seventeenth century. Dr. Jessica Taylor, Associate Professor of Oral and Public History at Virginia Tech, leads the research team in this on-going project. With the support of an NEH planning grant for the 2023-2024 academic year, Dr. Taylor and a team of faculty, students, and librarians digitized, transcribed, and generated a dataset from court records collected from three Virginia counties related to escape attempts of enslaved and indentured persons.<\/p>\n<p>While these are public records, seventeenth-century handwriting and rhetoric are challenging to read and comprehend. Additionally, these records are often stored at the county- or colony-level and siloed from records in neighboring jurisdictions documenting similar or related proceedings. Creating a dataset that crosses county-lines allows us to expand our understanding of unfree persons and the social and legal networks of the period. The team seeks to build on the work of projects such as <a href=\"http:\/\/enslaved.org\">Enslaved.org<\/a> and Freedom on the Move, and we have aligned our data model with the <a href=\"http:\/\/enslaved.org\">Enslaved.org<\/a> Ontology <a href=\"#ref10\" id=\"note10\">[10]<\/a> to increase interoperability and reuse with similarly-aligned data gathering projects.<\/p>\n<p>The Dangerous Harbors dataset consists of three record types: persons, events, and sources. Each person-record represents one individual, uniquely identified in the primary source documents. Each event-record represents a single event (such as a legal proceeding or hearing), described in the primary source documents. The primary sources are represented by source-records and citations. As a collection of court orders, the documents are already itemized by event: each entry in the order books describes one proceeding. The persons involved, however, may be recorded several times across multiple events; and, multiple individuals are typically recorded in association with each event. The roles of individuals vary across events. For example, an individual acting as a jury member in one proceeding may be providing legal counsel in another. As sources were transcribed, the team populated a shared spreadsheet to record persons and events. Each person and event were assigned a unique identifier, allowing us to associate individuals, events, and the documentary sources with each other.<\/p>\n<h3>Metadata Preparation<\/h3>\n<p>Despite using a flat spreadsheet to gather the data from the source material, our conceptual model was based on linked data principles, which we were able to demonstrate and share with students using Obsidian.<\/p>\n<p>In the court records, people with the same or similar names were often recorded with very little to differentiate them. Names were also often recorded with variant spellings or as partial versions. We used contextual evidence from the sources to make these determinations, such as the point in time when the proceeding occurred, the locations mentioned, and the presence or absence of other persons previously associated with the person in question. We recorded notes about near-matches, confidence level, and any relevant evidence for ambiguous cases.<\/p>\n<p>We used the Obsidian plug-in, JSON\/CSV Importer to create a note for each person-record and each event-record from our shared spreadsheet into a new Obsidian vault. The JSON\/CSV Importer plug-in uses a Handlebars template file, stored in the Obsidian vault, to configure imported data, including creating properties in the YAML frontmatter, and adding suffixes to create unique note filenames. <a href=\"#ref11\" id=\"note11\">[11]<\/a> In Obsidian, all notes within the same folder must have a unique filename, so each note has a unique file path. Having assigned a unique identifier to each person-record and each event-record, we used these identifiers as the note filenames. We imported the person-records and event-records to Obsidian folders labeled \u201cPersons\u201d and \u201cEvents,\u201d respectively. We used the aliases property to record each person\u2019s name, when known. This allowed us to reference and retrieve notes for each person either by their name or by their unique identifier.<\/p>\n<p>Using the double bracket syntax (see later), we linked persons, events, and sources. Obsidian can display these links through different view options. From an open note, a user can select \u201cOpen linked view\u201d from the \u201cMore options\u201d menu. From this menu, users can dynamically create the following displays: a local knowledge graph, showing the open note and all linked notes; a list of notes that link back to the open note (called backlinks); a list of all outgoing links; and an outline view, showing a more hierarchical arrangement of headings and links. In addition to these views, users can also access a knowledge graph of the entire vault from the main navigation menu, called \u201cOpen graph view.\u201d<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/journal.code4lib.org\/media\/issue61\/long\/figure2.png\" width=\"550\" height=\"715\" alt=\"Figure 2: This screenshot shows the More Options menu, and the different views available to show how notes are linked together in Obsidian\" \/><\/p>\n<p class=\"caption\">\n  <strong>Figure 2.<\/strong> This screenshot shows the More Options menu, and the different views available to show how notes are linked together in Obsidian\n<\/p>\n<h3>Staging for Data Publication<\/h3>\n<p>Over the course of the project, we imported approximately one hundred notes describing persons and seventy notes describing events. We chose this subset of the data because it had been gathered early on in the life of the project, and team members were familiar with the idiosyncrasies of the records used to derive the data&mdash;which provided another opportunity to educate students and new team members about what they might encounter in working with seventeenth-century records. This is a small representation of the complete dataset, which now includes over 340 events and roughly 850 persons. The dataset will grow as work on the project continues. Using a subset of the data was suitable for demonstrating the utility of assigning unique identifiers to each event, person, and source, and defining the relationships between them. This proof-of-concept also prepared the team for conversations about modeling the Dangerous Harbors dataset for a linked open data (LOD) environment. While Wikidata and Wikibase are more user-friendly than creating and publishing RDF or serialized XML as linked data, there is still a significant learning curve for many researchers and students who are unfamiliar with linked data concepts in general. Obsidian was a vital visual aid and easy-to-use tool that could demonstrate in real time the advantages to modeling data this way.<\/p>\n<p>The Dangerous Harbors project has several aims, including the development of lesson plans and educational resources for teachers and community groups, and the publication of the dataset and corresponding records in formats that are accessible and reusable. Working with Obsidian has provided a more inclusive environment to discuss linked data with stakeholders and team members who may be new to these methods of knowledge organization. And, it allowed us to explore different ways of representing the relationships in the data before transferring the data to a Wikibase.cloud instance. <a href=\"#ref12\" id=\"note12\">[12]<\/a> The Wikibase software offers more in terms of data publication, querying, and integrating data from external sources; however, we plan to continue to use Obsidian as a staging area and jumping-off point to engage new and returning team members in data modeling discussions. At this stage, the team continues to use a shared spreadsheet to export data to Obsidian and Wikibase; however, viewing the data in Obsidian has revealed errors and inconsistencies that were difficult to detect when working in a spreadsheet alone.<\/p>\n<h3>Technical Infrastructure and Plugins<\/h3>\n<p>An Obsidian \u201cvault\u201d is the top-level folder containing all of the sub-folders and files (or notes), attachments, and configuration files that the user is able to access, edit, and link while using Obsidian. A user can create several vaults or switch between vaults, but links can only be created within a single vault. For both the Rematriation Project and Dangerous Harbors projects, research teams worked with a discrete dataset, which we worked with in separate vaults.<\/p>\n<p>The double bracket syntax ([[text]]) in Obsidian, creates a link between one note and another. Inserting the hash character (#tag) before any text creates a tag. Tags allow users to sort notes into user-defined categories. From the navigation pane, a user can view all the tags they have created in Obsidian and navigate amongst notes containing only those tags. Tags can appear in the body of a note or in the front matter as a property of that note. Obsidian note properties are configured and stored as YAML frontmatter. <a href=\"#ref13\" id=\"note13\">[13]<\/a> Properties in Obsidian can be added to templates that can then be used to create new notes. For example, if a user creates a new note about an entity in a dataset, they can generate that note from a template with the appropriate properties describing that entity ready for entry. From a template, a property can also be \u201cpre-filled\u201d with values (such as the dates or tags) appropriate for that particular note.<\/p>\n<p>Users can create properties to use across all notes within a vault. When creating a new property, a user provides a unique name for that property and selects a data type. Obsidian supports the following property data types: text, list, number, checkbox, date, and date and time. Properties can only be populated with values of the selected data type. There are three default properties that have fixed data types in Obsidian: cssclasses, tags, and aliases. The cssclass property uses a list data type and allows you to apply css styling to a note or group of notes. The tags property data type is similar to list, but can only be populated with existing tags or newly created tags. The aliases property is also similar to the list data type but stores variations on the name of the note that allows the user to retrieve or reference that note by either its name or any of its aliases. Use of these properties is not required in any given note or project, but could be useful for filtering or searching for notes by tags and alternative text, or applying specific styling. We did not take advantage of these specific properties in our projects.<\/p>\n<p>Users can extend the features and functionality of Obsidian with plugins. A library of core plugins and community-contributed plugins is accessible from the Obsidian options menu. It is necessary to have an active network connection to access an up-to-date list of all of the options available and download them. Throughout our experimentation with Obsidian, we found several community plug-ins that were essential to using the application for exploring research data, including: Dataview, JSON\/CSV Importer, Image Gallery <a href=\"#ref14\" id=\"note14\">[14]<\/a>, and Mapview.<\/p>\n<p>Through the use of templates, folder structures, plug-ins, and properties we created Obsidian vaults that, themselves, acted as templates and instruction manuals for data collection and sharing. <a href=\"#ref15\" id=\"note15\">[15]<\/a> For both research teams, we used data initially gathered in spreadsheets to populate an Obsidian vault with notes, which were shared with project team members. Using copies of a shared vault, team members were able to experiment, enhance, and explore the dataset in their own Obsidian instance and provide feedback on the data that had already been gathered, make changes, and share those changes back with the team. Of course, there is a risk of conflicting data when multiple team members work on their own copies of the dataset, and in practice, the data may not always align perfectly. Obsidian allows team members to document their observations or proposed changes directly in the body of a note for later discussion. When conflicts or ambiguities arise, they are resolved collaboratively: for the Dangerous Harbors Project, decisions are made by the project leaders in consultation with subject specialists; for Rematriation Project, the community partners would determine the appropriate resolution.<\/p>\n<p>Obsidian also supports a number of export options and community plug-ins that facilitate export in a variety of file formats, such as PDF. The Dataview <a href=\"#ref16\" id=\"note16\">[16]<\/a> plugin and the Table to CSV Exporter <a href=\"#ref17\" id=\"note17\">[17]<\/a> plugin supports the export of notes out of Obsidian as CSV. Extracting data and importing it to another system does require some knowledge and comfort with data transformation.<\/p>\n<h2>Reflections and Recommendations<\/h2>\n<p>Obsidian allows users to create and experiment with different and various arrangements of their own research data and observations, thus making visible the impacts of the different choices that confront anyone gathering and generating data. Without an understanding of how an information system will act on the given information (in this case, descriptive metadata), the system becomes, \u201ccapricious, untrustworthy, and unpredictable\u201d. <a href=\"#ref18\" id=\"note18\">[18]<\/a> For example, it makes visible the choice to model the data based on an event or a person, or when to create a new \u201cnote\u201d (i.e. entity), or when to allow that entity to simply be represented by a string&mdash;and what affordances that does or does not allow within the data set. These are concepts that can be difficult to explain and difficult to grasp quickly for users or operators of a system but can become more meaningful to a participant and co-creator of that system.<\/p>\n<p>In comparison to asking participants (either community partners or students and faculty) to \u201cfill in\u201d a spreadsheet with values that will later be passed on to a library specialist in data visualization or digital platform management&mdash;using Obsidian allows the participants to both plan an action (i.e. create a relationship or link between a person and a place) and execute that action. Even as an exercise, this can help build trust within research project teams, so the library and the way library systems operationalize data seem less like a \u201cblack box.\u201d It also contributes to team members\u2019 information literacy, setting a common environment to engage non-library specialists in discussions about how and when to use controlled vocabularies, when to form links between concepts, and how those relationships should be defined.<\/p>\n<p>Decisions made by the research team and modeled in Obsidian can be translated into library systems, if appropriate for the project. We tested using a Python script <a href=\"#ref19\" id=\"note19\">[19]<\/a> to generate a CSV file of all of the note filenames, properties, and text bodies in our Obsidian vaults as a way to \u201cextract\u201d the data added by team members. Because Obsidian vaults are essentially folders containing structured text files, the Python script relies on the consistent use of properties and tags to extract the file names and return the content of those files as structured data. We could then use the CSV export as part of an ETL pipeline to generate records in a digital collections platform or create entities in open knowledge bases, such as Wikidata or Wikibase.<\/p>\n<h2>Conclusion<\/h2>\n<p>Obsidian can be an effective bridge between narrative and structured data, between scholarly and community perspectives, and between material archives and digital tools. Obsidian proved to be a valuable tool for research teams seeking a way to create structured but flexible metadata, easy onboarding for contributors, transparent file storage in open formats, and ways to integrate narrative with metadata capture. As a prototyping tool, Obsidian allowed project team members with varying levels of proficiency with traditional data tools to experiment with high-level knowledge organization concepts.<\/p>\n<p>There are many options and resources available for learning more about Obsidian. Obsidian users and developers offer lots of good ideas on the official message board. The community plugins pages that are created and maintained by users offer a wide range of documentation and examples. Obsidian users are also easy to find on Reddit, YouTube, and other parts of the internet. There is no one, canonical source of information or inspiration for using Obsidian &#8211; unlike an enterprise system or more scholarly-focused tool. <a href=\"#ref20\" id=\"note20\">[20]<\/a> It is a constantly evolving landscape, which can be both exciting and challenging equally. We are looking forward to seeing how these tools continue to change and how and if PKM tools become more widely adopted in academic projects.<\/p>\n<h2>References<\/h2>\n<p>Bade, David. 2012. IT That Obscure Object of Desire: on French Anthropology, Museum Visitors, Airplane Cockpits, RDA, and the Next Generation Catalog. Cataloging and Classification Quarterly 50:316-334. <a href=\"https:\/\/doi.org\/10.1080\/01639374.2012.657606\">https:\/\/doi.org\/10.1080\/01639374.2012.657606<\/a><\/p>\n<p>Carroll, S. R., Garba, I., Figueroa-Rodr&iacute;guez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R., Sara, R., Walker, J. D., Anderson, J., &amp; Hudson, M. 2020. The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1). <a href=\"https:\/\/doi.org\/10.5334\/dsj-2020-043\">https:\/\/doi.org\/10.5334\/dsj-2020-043<\/a><\/p>\n<p>Enslaved.org. Enslaved.org Ontology [Internet]. [cited May 21, 2025]. Available from: <a href=\"https:\/\/enslaved.org\/ontology\">https:\/\/enslaved.org\/ontology<\/a><\/p>\n<p>Tuck, Eve. 2011. Rematriating Curriculum Studies. Journal of Curriculum and Pedagogy 8(1): 34&ndash;37. <a href=\"https:\/\/doi.org\/10.1080\/15505170.2011.572521\">https:\/\/doi.org\/10.1080\/15505170.2011.572521<\/a><\/p>\n<h2>Notes<\/h2>\n<p><a href=\"#note1\" id=\"ref1\">[1]<\/a> <a href=\"https:\/\/obsidian.md\/\">https:\/\/obsidian.md\/<\/a><\/p>\n<p><a href=\"#note2\" id=\"ref2\">[2]<\/a> Join the Community [Internet]. [updated 2025]. Obsidian; [cited May 21, 2025]. Available from: <a href=\"https:\/\/obsidian.md\/community\">https:\/\/obsidian.md\/community<\/a><\/p>\n<p><a href=\"#note3\" id=\"ref3\">[3]<\/a> The Rematriation Project [Internet]. [updated April 2025]. Blacksburg (VA): Virginia Tech University Libraries and the Aqqaluk Trust; cited [May 18, 2025]. Available from: <a href=\"https:\/\/rematriate.net\">https:\/\/rematriate.net<\/a><\/p>\n<p><a href=\"#note4\" id=\"ref4\">[4]<\/a> Carroll, et al., \u201cThe Care Principles for Indigenous Data Governance,\u201d 43.<\/p>\n<p><a href=\"#note5\" id=\"ref5\">[5]<\/a> Tuck, \u201cRematriating Curriculum Studies,\u201d 37.<\/p>\n<p><a href=\"#note6\" id=\"ref6\">[6]<\/a> <a href=\"https:\/\/www.calebscholars.org\/about-caleb\/\">https:\/\/www.calebscholars.org\/about-caleb\/<\/a><\/p>\n<p><a href=\"#note7\" id=\"ref7\">[7]<\/a> <a href=\"https:\/\/openrefine.org\/\">https:\/\/openrefine.org\/<\/a><\/p>\n<p><a href=\"#note8\" id=\"ref8\">[8]<\/a> The JSON\/CSV Importer plug-in was created by Obsidian user furling42 and is available on the Obsidian Community Plug-Ins page or on Github: <a href=\"https:\/\/github.com\/farling42\/obsidian-import-json\">https:\/\/github.com\/farling42\/obsidian-import-json<\/a><\/p>\n<p><a href=\"#note9\" id=\"ref9\">[9]<\/a> The Map View plugin was created by Obsidian user esm and is available on the Obsidian Community Plug-Ins page or on Github: <a href=\"https:\/\/github.com\/esm7\/obsidian-map-view\">https:\/\/github.com\/esm7\/obsidian-map-view<\/a><\/p>\n<p><a href=\"#note10\" id=\"ref10\">[10]<\/a> <a href=\"http:\/\/enslaved.org\">Enslaved.org<\/a> Ontology: <a href=\"https:\/\/docs.enslaved.org\/ontology\/\">https:\/\/docs.enslaved.org\/ontology\/<\/a><\/p>\n<p><a href=\"#note11\" id=\"ref11\">[11]<\/a> Handlebars templating language: <a href=\"https:\/\/handlebarsjs.com\/guide\/\">https:\/\/handlebarsjs.com\/guide\/<\/a><\/p>\n<p><a href=\"#note12\" id=\"ref12\">[12]<\/a> Wikibase.cloud instance: <a href=\"https:\/\/dangerousharbors.wikibase.cloud\/wiki\/Main_Page\">https:\/\/dangerousharbors.wikibase.cloud\/wiki\/Main_Page<\/a><\/p>\n<p><a href=\"#note13\" id=\"ref13\">[13]<\/a> Properties [Internet]. [updated August 21, 2020]. Obsidian Help; [cited May 21, 2025]. Available from: <a href=\"https:\/\/help.obsidian.md\/properties\">https:\/\/help.obsidian.md\/properties<\/a><\/p>\n<p><a href=\"#note14\" id=\"ref14\">[14]<\/a> Image Gallery was created by Obsidian user, Luca Orio. It is available on the Obsidian Community Plug-Ins page or on Github: <a href=\"https:\/\/github.com\/lucaorio\/obsidian-image-gallery\">https:\/\/github.com\/lucaorio\/obsidian-image-gallery<\/a><\/p>\n<p><a href=\"#note15\" id=\"ref15\">[15]<\/a> A copy of our Obsidian vault is available on the project team\u2019s Github: <a href=\"https:\/\/github.com\/rematriation\/Obsidian-Vault-documentation\/tree\/main\">https:\/\/github.com\/rematriation\/Obsidian-Vault-documentation<\/a>. The media files and the majority of the item-level metadata have been redacted. The available vault does contain some build notes, properties, templates, and plug-ins used by the Rematriation project team.<\/p>\n<p><a href=\"#note16\" id=\"ref16\">[16]<\/a> The Dataview plug-in was created by Obsidian user, Michael Brenan. It is available on the Obsidian Community Plug-ins page or on Github: <a href=\"https:\/\/github.com\/blacksmithgu\/obsidian-dataview\">https:\/\/github.com\/blacksmithgu\/obsidian-dataview<\/a>. Dataview allows users to use query language for sorting, filtering, and extracting data from their notes using a JavaScript API.<\/p>\n<p><a href=\"#note17\" id=\"ref17\">[17]<\/a> Table to CSV Exporter was created by Obsidian user Stefan Wolfrum. It is available on the Obsidian Community Plug-ins page or on Github: <a href=\"https:\/\/github.com\/metawops\/obsidian-table-to-csv-export\">https:\/\/github.com\/metawops\/obsidian-table-to-csv-export<\/a><\/p>\n<p><a href=\"#note18\" id=\"ref18\">[18]<\/a> Bade, \u201cIT That Obscures,\u201d 320.<\/p>\n<p><a href=\"#note19\" id=\"ref19\">[19]<\/a> The authors welcome requests to share the Python script. Please get in touch over email.<\/p>\n<p><a href=\"#note20\" id=\"ref20\">[20]<\/a> Obsidian user Nicole Van der Hoeven\u2019s tutorials on YouTube and on her Obsidian site have been especially helpful in learning how to use the application and in showing use cases for different plugins. You can find her work here: <a href=\"https:\/\/notes.nicolevanderhoeven.com\/Fork%2BMy%2BBrain\">https:\/\/notes.nicolevanderhoeven.com\/Fork+My+Brain<\/a> <\/p>\n<h2>About the authors<\/h2>\n<p>Kara Long is the Assistant Director for Metadata Services at the University Libraries at Virginia Tech. She provides metadata consultations, training, and development for library users and researchers. She brings a background in cataloging and digital collections to her work. Her current research is focused on collaborative, community-based approaches to knowledge preservation and information heritage.<\/p>\n<p>Dr. Erin Yunes is a Professor and Departmental Coordinator in the School of Art at the American College of the Mediterranean (ACM-IAU) in France. A former CLIR Postdoctoral Fellow in Community Data, she specializes in data management, digital cultural heritage, and equitable information practices. Her research focuses on community-led strategies for strengthening access to digital cultural resources.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we describe a novel use of the note-taking software Obsidian as a method for users without formal training in metadata creation to develop culturally relevant data literacies across two digital archiving projects. We explain how Obsidian\u2019s built-in use of linked data provides an open-source, flexible, and potentially scalable way for users to creatively interact with digitized materials, navigate and create metadata, and model relationships between digital objects. Furthermore, we demonstrate how Obsidian\u2019s local and offline hosting features can be leveraged to include team members with low or unreliable internet access.<\/p>\n","protected":false},"author":198,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"class_list":["post-18535","post","type-post","status-publish","format-standard","hentry","category-issue61"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18535","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/198"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18535"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18535\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18408,"date":"2025-04-14T10:11:59","date_gmt":"2025-04-14T14:11:59","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18408"},"modified":"2025-04-14T10:12:50","modified_gmt":"2025-04-14T14:12:50","slug":"editorial-7","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18408","title":{"rendered":"Editorial"},"content":{"rendered":"<p>By Mark Swenson<\/p>\n<p>In the week before finalizing the content for this sixtieth issue of the\u00a0<em>Code4Lib Journal<\/em>, I attended the <em>Computers in Libraries<\/em> conference in Arlington, Virginia.\u00a0 In the midst of dozens of presentations on GenAI and what it means for today&#8217;s libraries one session touched on issues core to this publication and its aims: <em>Empowering Libraries Through Code: Future Ready Digital Leadership<\/em>.\u00a0 The first presentation, from Austin Stroud at Indiana University Indianapolis, articulated the gulf between the widely held perception that learning a high-level programming language would improve career possibilities for MLIS students and the reality that, historically, ALA accredited programs do not provide this training.\u00a0 Following a second presentation, from Scott Hargrove of Fraser Valley Regional Library in British Columbia on how their new strategic plan focuses on engaging the staff and public on programming and AI literacy, the question was posed to those attending the session: why do librarians and library staff need programming skills anyway?<\/p>\n<p>If you have found your way to <em>The Code4Lib Journal<\/em> (dedicated to sharing coding solutions in libraries) you may already have many ideas of the ways that programming knowledge can be useful for library staff.\u00a0 If not, this issue has eight articles that demonstrate how competencies in software development are intimately intertwined with modern library operations:<\/p>\n<ul>\n<li>From Corinne Chatnik and James Gaskell at Union College we have a description on how writing a program in Python was able to improve the ability of undergraduate students to accurately enter data.<\/li>\n<li>Karen Coyle describes the OpenWEMI specification of Dublin Core explaining how it can expand the use of FRBR ideas into new non-library environments.<\/li>\n<li>Jennifer D&#8217;Souza (TIB Leibniz Centre for Science and Technology) demonstrates using knowledge graphs to analyze Large Language Models.<\/li>\n<li>Halie Kerns (Binghamton University) and Leah Fitzgerald (Amsterdam Free Library) describe how they made a video game to teach information literacy skills.<\/li>\n<li>Aerith Y. Netzer (Northwestern University) demonstrates a way to use Large Language Models to turn plain-text citations to BibTeX.<\/li>\n<li>Wilhelmina Randtke (Georgia Southern University) relates the journey they took to simplify and restructure a library system database that had become overcomplicated following a system merger and a migration.<\/li>\n<li>Andrew Weymouth (University of Idaho) details the use of Python and Google Apps Script to text mine and tag the University&#8217;s oral history collections.<\/li>\n<li>Olivia Wikle (Iowa State University) and Evan Peter Williamson (University of Idaho Library) show their success in using the CollectionBuilder framework and the static site generator Jekyll for creating easy to maintain digital collections websites.<\/li>\n<\/ul>\n<p>These articles showcase a variety of perspectives and challenges tackled through modern computational methods.\u00a0 The authors are united by their commitment to enhancing workflows and services, their eagerness to engage in innovative collaborations, and their dedication to acquiring new skills.\u00a0 Additionally, these contributions reflect a willingness to share knowledge and support the ongoing understanding of the skills needed in a contemporary library setting.\u00a0 Certainly not every library staffer needs to know how to write computer programs, but having someone on your library staff who understands computer programming and can apply that knowledge to library problems is something any library can use.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to the 60th issue of Code4Lib Journal.  We hope that you enjoy the assortment of articles we have assembled for this issue.<\/p>\n","protected":false},"author":202,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[],"class_list":["post-18408","post","type-post","status-publish","format-standard","hentry","category-issue60"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18408","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/202"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18408"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18408\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18408"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18408"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18408"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}},{"id":18340,"date":"2025-04-14T10:11:58","date_gmt":"2025-04-14T14:11:58","guid":{"rendered":"https:\/\/journal.code4lib.org\/?p=18340"},"modified":"2025-04-14T13:12:58","modified_gmt":"2025-04-14T17:12:58","slug":"quality-control-automation-for-student-driven-digitization-workflows","status":"publish","type":"post","link":"https:\/\/journal.code4lib.org\/articles\/18340","title":{"rendered":"Quality Control Automation for Student Driven Digitization Workflows"},"content":{"rendered":"<p>By Corinne Chatnik and James Gaskell<\/p>\n<h2>Introduction<\/h2>\n<p>Digitization of cultural heritage materials in academic libraries is important for increasing the access and visibility of its holdings, providing content and data for digital scholarship, and allowing for increased accessibility of library materials. The output of this work is valuable to researchers and the Union College community, but it is also resource intensive and time-consuming. Union College Schaffer Library experiences these challenges particularly in the context of a predominantly student-staffed digitization operation.<\/p>\n<p>The digitization program at Schaffer Library relies heavily on undergraduate student workers for the capture and quality control of the digital archival records. Union College is a solely undergraduate institution and the student work hours are limited to a few hours per week with varying schedules. Most undergraduates do not have prior experience with digitization and this is a new skill set for them that requires significant training. Successful training is made difficult due to the differences in the students schedules resulting in ad hoc training sessions and sometimes days between each work shift. There is also currently no staff member or librarian dedicated to the digitization lab full time; instead, supervision of students is divided between two librarians and one staff member.<\/p>\n<p>With these limitations, it was recognized that more QC can\u2019t be done but better QC can be achieved with automation. To do this, a collaborative project was initiated between the library&#8217;s Digital Collections and Preservation Librarian, Corinne Chatnik and a senior Union College Computer Science student, James Gaskell, who was initially hired and trained to do digitization. Together they planned and developed a quality control automation application. By leveraging Python programming and the Openpyxl library, the aim was to create a tool that could systematically verify metadata consistency and file management accuracy, thereby reducing the burden on manual review processes.<\/p>\n<p>This approach was informed by direct experience with the digitization workflow and an understanding of common error patterns observed over time. The resulting application needed to be both sophisticated enough to catch subtle errors and user-friendly enough to be integrated into the existing workflow, particularly considering the varying technical expertise of student workers and staff.<\/p>\n<p>This paper details the journey in developing this quality control automation solution, from initial concept to implementation. It will explore how the application addresses specific challenges in the digitization workflow and the technical decisions that shaped its development. Moreover, how this project served as an experiential learning opportunity, allowing James to apply classroom knowledge to solve real-world problems while contributing to the library&#8217;s digital initiatives.<\/p>\n<h2>Background<\/h2>\n<p>In 1795, Union College was established in upstate New York as the first non-denominational institution (not affiliated with a religious organization) of higher education in the United States. Today, it is a small undergraduate, liberal arts college committed to the integration of arts and humanities with science and engineering. Union College has a long history evidenced by the collections held by the Schaffer Library\u2019s Special Collections and Archives. These collections are not only valuable to the institution but to a broader audience. With the significance of these collections, Schaffer Library hopes to increase digitization. Automating the quality control aspects of the workflow will help this initiative.<\/p>\n<p>The digitization program at Union College has a relatively short history. Schaffer Library began digitizing small amounts of material from Special Collections in 2008. Then a loss of funding resulted in a pause in digitization efforts. In 2014, library leadership recognized the importance of digitization for access and accessibility and plans were made to restart the program. A staff member with digital projects experience was hired to take over the digitization lab and complete all aspects of the digitization workflow. These projects were small boutique collections chosen based on faculty and librarian interests. The resources allocated for these projects were not conducive to scaling up to larger initiatives.<\/p>\n<p>Taking advantage of various staff turnover, new positions were created with assembling a digital projects team in mind. An increase in technical skills and positions with digital projects responsibilities allowed Schaffer Library to start scaling up digitization efforts. This also came with the recognition that undergraduate student workers were a valuable resource in document imaging, so the library enacted plans to hire and train students for that purpose. WIth increased production, it identified some areas of the workflow that acted as bottlenecks to the process, quality control checks being one major area. This analysis served to signal to Schaffer the need to improve quality control for increased efficiency and realized for some aspects, automation was possible.<\/p>\n<h2>Methodology<\/h2>\n<h3>Quality Control workflow prior to automation<\/h3>\n<p>To consider automating the process, the digitization and quality control workflow was scrutinized. For the existing workflow, the digital projects librarian creates a metadata spreadsheet prior to pulling material for digitization. They then share the spreadsheet and pull boxes or folders of the corresponding records for the students to scan. While scanning, students read the metadata and make sure it matches the physical document. They should be verifying metadata fields like title, date, and creator match what they can see on the record. After they scan, they enter the extent, or number of pages of the document into the metadata spreadsheet. Once they have completed the scanning portion they move on to quality control.<\/p>\n<p>For quality control, students should check work scanned by other students. This is so that they are not desensitized to the records they\u2019ve already gone through. Additionally, if they required more training, their errors would be caught by someone else. When students begin the quality check, they navigate to the folder of completed scans. Each item has its own directory with the scans inside. So, going item by item in the metadata, they search for the item identifier and match it to a directory. They then open the folder to visually confirm that filenames in the parent directory match and all those filenames match the identifier in the spreadsheet. If there is anything that doesn\u2019t match, they enter \u201cFail\u201d in the QC Pass\/Fail column which is the last column of the spreadsheet.<\/p>\n<p>Next, if the item is a multipage record, they make sure page count matches the Extent column in the spreadsheet. If the values are not the same, they fail the check. If the document is less than 20 pages, they should look at each page individually. For expediency, if the document is longer than 20 pages, the students are to skim through the images, taking a random sample, approximately 10% of the total page count and verifying the quality of the images. When checking the images they evaluate the color balance, making sure the color looks approximately correct, not too blue or red, for example. Next they observe the orientation, checking if images are mistakenly disorientated. Overall, students are encouraged to use their best judgment.<\/p>\n<h3>The role of Quality Control in the digitization workflow<\/h3>\n<p>Quality control is vital, not just for accuracy for descriptive metadata it also ensures that administrative and technical metadata are accurate and complete, making materials findable and usable within digital collections. Additionally, poor quality digitization (like blurry text, missing pages, or incorrect color reproduction) can lead to misinterpretation or make materials unusable for research [<a id=\"ref1\" href=\"#note1\">1<\/a>].<\/p>\n<p>Not only should digitization result in a high quality product, the data is vital to the success of the digital repository workflow. If some aspects of the metadata are not exact, it will cause the repository part of the workflow to fail. Each batch of digitized records is over one hundred images and filenames. It\u2019s easy to skim hundreds of filenames and miss a transposed identifier or misread a date. Especially when the workflow for student workers ends with quality control and they don\u2019t work with the files in the digital collections repository. In many cases, the data errors were not found until the upload process. Several data points can act as a point of failure during upload, for instance, if the system is unable to find a file through the filepath, no digital object will display with the metadata. If the digital object file is too large, that will also cause display issues. Another is if the date is not in ISO format, the date facet will not work. The upload will also fail if required fields are not filled. These quality control failures will disrupt the workflow and if not caught before upload, undoing the work to repair it is time consuming and resource intensive.<\/p>\n<p>Union College\u2019s digital repository is Archipelago, a flexible, customizable, open source repository created by Metropolitan New York Library Council\u2019s Digital Services Team. The software is built on Drupal with custom modules and indexed with Solr [<a id=\"ref2\" href=\"#note2\">2<\/a>]. For the upload process, the digital files are transferred via FTP from a OneDrive directory to an Amazon S3 server and the metadata spreadsheet is uploaded through the Archipelago interface. The metadata contains the file\u2019s full url path in addition to the other fields. The metadata is processed through a PHP template engine called Twig and calls the files from the S3 path for display [<a id=\"ref3\" href=\"#note3\">3<\/a>]. If the file path is not an exact match, the upload fails. Additionally, there are other variables that, if not correct, will cause the upload or interface functionality to fail.<\/p>\n<p>Part of the responsibilities of the Digital Collections Librarian is uploading the metadata and scans to the digital repository. In doing so, Corinne was encountering these errors during the upload process. After getting several completed batches of digitized records but still encountering some of the same issues, she wrote small scripts to check on those elements. This ad hoc process was sufficient until the Digital Projects team began ramping up digitization. It was then clear that these errors needed to be caught earlier and data checking could be integrated into the workflow.<\/p>\n<p>James was part of a cohort of students hired for digitization. With his background in computer science, he pretty quickly realized that some aspects of the workflow could be automated. Especially in regards to saving time by eliminating some manual processes. With his hands-on experience implementing the workflow, he understood the goal of the application and how it could fit into the workflow going forward. Together, they analyzed the workflow. Corinne\u2019s perspective was from the digital collections repository and what was going wrong during upload and indexing. James\u2019s perspective was rooted in the actual digitization process. Through this process, the following variables were identified that could be systematically evaluated both within the spreadsheet and by comparing the scans to the metadata.<\/p>\n<p>The biggest error encountered during the upload was mismatched or missing filenames. As a result, one of the first checks is determining if the filenames listed in the metadata, have a corresponding file in the scans directory. Building on that, for Archipelago, the full file path is required for upload and rendering. Therefore, in addition to the filename, the metadata also has to have the full Amazon Web Services filepath. Both the filename to scan verification and the AWS path check are pattern matching strings which is something scripting is really useful for automating. Similarly, the number of pages in the extent field could be verified automatically. The input guidelines for that field are integer plus format; this means \u201cx pages\u201d with the rich text clarifying the integer represents a page count metric . So by identifying the extent field, if the \u201cpages\u201d string or other format string is stripped out, the integer can be isolated. Then by locating the images, it can determine the actual number of pages that belong to each object. At that point it&#8217;s just a matter of comparing two integers.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue60\/chatnik\/Figure1_C4L.png\"><br \/>\n<strong>Figure 1.<\/strong> An excerpt of a metadata spreadsheet in Excel showing the identifier field, the AWS path to the image, and the extent field with the page numbers.\n<\/p>\n<p>Schaffer Library\u2019s metadata schema has required fields. So for those fields like label, type, ismemberof, and rights_statements it was just a matter of confirming a multi-character string was there.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue60\/chatnik\/Figure2_C4L.png\"><br \/>\n<strong>Figure 2.<\/strong> An excerpt of a metadata spreadsheet in Excel showing the required metadata fields for upload. The ismemberof field holds the identifier of the parent object in Archipelago. The type field triggers Archipelago processing to use a mapped IIIF template. The label field is the same as the title but Archipelago uses it to name the digital object. Finally, rights_statements tells Archipelago which <a href=\"https:\/\/rightsstatements.org\/\">https:\/\/rightsstatements.org\/<\/a> icon to display.\n<\/p>\n<p>For the date_created field the QC check was a bit more complex. There is Apache Solr faceting functionality in archipelago for date faceting so the date needs to be formatted in YYYY-MM-DD (ISO 8601) format. An even more complex match is when the filename is derived from the analog version of the record\u2019s physical location within the collection. For example, an item located in Box 2, Folder 1, will have an identifier represented as ZWU_SCA0319.B02.F01 and those values need to be accurate.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue60\/chatnik\/Figure3_C4L.png\"><br \/>\n<strong>Figure 3.<\/strong> An excerpt of a metadata spreadsheet in Excel showing the more complex fields for the quality control check. Physical location informs the identifier and date_created needs to be in YYYY-MM-DD (ISO 8601) format for Apache solr indexing and faceting.\n<\/p>\n<p>Finally, with hundreds of scans and different people working on different digitization workstations and software, it is possible image file settings will get changed. Archipelago doesn\u2019t handle the display of massively large files well. This needs to be addressed prior to upload so it doesn\u2019t overload the Archipelago system. It was determined that if the size of the image file is over 500 MB it needed to be flagged and its size reduced. Considering the metadata and image variables in this way laid the groundwork for automation.<\/p>\n<h2>Application Design<\/h2>\n<h3>Technical Approach<\/h3>\n<p>Python was chosen as the programming language for this application because of shared knowledge, longevity, and Python\u2019s ability to interact with Excel files, which is the format the metadata is stored in. It is also the programming language that James and Corinne have in common. This is important because when James graduates he will no longer be able to support the software. But with Corinne\u2019s knowledge of Python, the program can be maintained and updated. Additionally, Python is the most widely taught programming language at Union College so there\u2019s a greater chance future students can work with the application.<\/p>\n<p>Python is also an attractive choice due to the vast array of packages available for data handling and I\/O operations. Openpyxl is the best choice for reading and writing to Excel files. It allows for more complex formatting than most CSV handlers and it is possible to highlight and change text formatting at a cellular level [<a id=\"ref4\" href=\"#note4\">4<\/a>]. This is important since some discrepancies will be more easily fixed with a manual review and these features will indicate problematic records and fields while maintaining the metadata format. The package also reads into pandas dataframes with column headings rather than Excel column references making it easier to extract and reference data. Since a user interface is provided to make the program more friendly for work study students, tkinter is valuable for providing helpful error messages, and PyQt5 to provide a rich, user-friendly experience for the main functions of the program [<a id=\"ref5\" href=\"#note5\">5<\/a>].<\/p>\n<p>Finally, Python supports object oriented programming. Since records in the metadata spreadsheet refer to physical holdings and digital copies of these holdings with attributes such as pagecount, permanent location and filename, it makes sense to store these records as objects. With a macro lens, sheets of a spreadsheet can be viewed as objects, and perhaps even individual spreadsheets should the scope of the project increase.<\/p>\n<h3>Technical Architecture<\/h3>\n<p>As discussed, it is imperative that the automated processes fit into the existing workflow to minimize disruption to the current digitization processes. Further, the limitations of automation are understood. Recording metadata and reviewing scanned images currently remain manual processes since the technology required to extract data or make human inferences from images is complex and inaccessible at this time. As such, James decided to split the program into three stages. All three of these functions are reliant on the same Excel spreadsheet and file system and will adopt the same object structure and add more fields at each step in the process as they become necessary.<\/p>\n<p>The file structure is as follows:<\/p>\n<ol>\n<li>A spreadsheet contains multiple sheets. These sheets correspond to a physical box which has individualized items (the rows of the spreadsheet).<\/li>\n<li>These items represent both the physical holdings and the files in which the digital copies are stored.<\/li>\n<li>The items have attributes based on their spreadsheet data. Attributes include date created, physical location, filename, page extent etc.<\/li>\n<\/ol>\n<h4>Data Validation<\/h4>\n<p>The first function of the program is data validation which should be initiated after the metadata spreadsheet is created. At this stage the file structure is created, the spreadsheet is represented as an object with a list of files, which are also objects with attributes like page count etc. To automate work with Microsoft Excel spreadsheets, James utilized Openpyxl. Openpyxl is a Python library that can read and write to Microsoft Excel files. The program reads the spreadsheet into a dataframe using the pandas Python package. Pandas organizes each row of the excel spreadsheet into a table and has search functions that make the imported data easier to navigate [<a id=\"ref6\" href=\"#note6\">6<\/a>]. The program goes through each record in the pandas dataframe and creates a file object then adds it to the spreadsheet wrapper. This allows a comparison of the physical location data entered for each object to the filename generated from that data which follows the name convention of \u201cBox.x.Folder.y.Item.z\u201d. This function also checks if the date is in ISO format, the international standard for representing dates, and checks the identifier and filepath columns for duplicate filenames.<\/p>\n<p>Some errors and their solutions can be anticipated. Most date errors are resolved by understanding alternate date formats like Month DD, YYYY or MM\/DD\/YYYY. Other variables, such as naming conventions and duplicate filenames, are more complicated and will require manual intervention. Still, this will reduce the time needed to resolve the errors should they be discovered after the physical document has been scanned.<\/p>\n<h4>File Location and Instant Fails<\/h4>\n<p>At the second step, assuming the initial discrepancies have been resolved and the scanning has been completed, the program will search for the scanned documents in OneDrive. This process is one of the most time intensive and, until now, a manual process of copying the filename from the spreadsheet into the search bar of OneDrive. At this stage, the application checks if the file exists, determines if the size of the file is below the 300mb threshold, and if the number of pages in the file matches the extent recorded in the metadata spreadsheet. This all occurs with a single click of a button. Using the glob Python package it recursively scans a user-selected folder with a much lower margin for error and much more quickly than the current manual method. Glob takes a root directory, provided by the user through the UI, and the list of file objects from the sheet, then checks every subfolder for the file path [<a id=\"ref7\" href=\"#note7\">7<\/a>]. If there are missing files, the item automatically fails the check, and a useful message is output to the spreadsheet to indicate to the user that the file could not be located.<\/p>\n<h4>Image Quality Checks<\/h4>\n<p>After searching the file structure, the scanned document\u2019s existence is confirmed or denied. If the file does exist, and thus can continue with the QC process, it also stores a file path that can be used to open the document from within the Python application. This is much quicker than manually searching the file structure, opening the file and following the steps of the QC Guidelines. Further, since the Python program has the ability to edit the metadata spreadsheet the application can mark the record as pass or fail, limiting human interactions with the metadata record which could introduce sources of error.<\/p>\n<h3>User Interface Considerations and Design principles<\/h3>\n<p>Command line scripting can seem overwhelming to those unfamiliar, and since this application needs to be integrated into a workflow primarily done by students, an interface was necessary. James designed a graphical user interface (GUI) to make the program more usable to everyone who is part of the digitization workflow. The Windows operating system is the primary OS for Schaffer Library student computers and digitization workstations. With Windows as an application parameter, tkinter will suffice for less detailed error messages. tkinter is a standard Python interface package for Tk GUI. It is used to create simple interactive applications for Python scripts [<a id=\"ref8\" href=\"#note8\">8<\/a>]. The application also utilizes PyQt5, because it is more feature rich than Tk and the drag and drop methods for designing the application in QtDesigner are more accessible to less experienced UI designers [<a id=\"ref9\" href=\"#note9\">9<\/a>]. PyQt5 also benefits from CSS support allowing for aesthetic improvements and hover-over button information which should make the program even more user friendly. Some effort has been made in this regard but it is still in its infancy, should there be a need for further aesthetic improvements to the GUI, this could be done without much complexity by adding to, or altering, the current CSS.<\/p>\n<p>The GUI is important for making sure the application is useful to as many people as possible and that it stays a sustainable part of the workflow even if the developers move on. Where possible the program can utilize existing Python GUI architectures such as EasyGUI for basic tasks like file selection. With a GUI, the program has both a front-end and back-end that are, for the most part, independent of one another. As such, the singleton design pattern can be adopted to link the two, allowing for future independent changes to be made to both whilst ensuring a clean project structure for future development. The singleton, which is considered the main program, is a class guaranteed to only have one instance [<a id=\"ref10\" href=\"#note10\">10<\/a>] and bridges the front-end and back-end operations. More importantly, there is a plan to implement new features and that can be done easily in this framework. When those features are written on the backend, modifying the interface is adding another button to the GUI to trigger the new behavior. For example; initially it may appear as though a spreadsheet can also be a singleton with just one instance of the class, however, this design pattern guarantees expandability by allowing the tracking of multiple spreadsheets should the need arise due to changes in the workflow or increased system requirements. A change like this would require a change to the front end to allow the user to select a spreadsheet from a given list.<\/p>\n<h3>Integration with the current workflow<\/h3>\n<p>Implementing this application with the existing workflow will not create too large of a shift. Though the data is read into the program and transformed for analysis, the program will convert the objects back into dataframes and output them back to the original spreadsheets. To the user, the spreadsheet seemingly remains unchanged after each check except for color coded flags to indicate errors. Automatic failures due to extent and file size issues alongside files that cannot be located in OneDrive will automatically be indicated in the pass\/fail column and a useful error message produced. It is important to maintain the original data format and retain Excel as the medium for data review. Though a lot of tasks are automated, some variables require human intervention and this method changes that portion of the workflow very little. For the work-study students this application requires very little additional training to adapt to the new process. Moreover, the user interface should be intuitive to use and the overall design is meant to reduce each process to a few mouse clicks.<\/p>\n<h2>Implementation<\/h2>\n<p>The first step in creating the program is implementing the Openpyxl read\/write functionality. At both the read and write stages the file data is ported into rows of a pandas dataframe, so the object file structure described acts as an intermediary medium for processing. Openpyxl handily uses the column headings for dataframe fields. Furthermore, iteration over the rows of the dataframe discards unnecessary data and initializes abstract file objects for each row. At this stage the file objects have attributes; permanent location, filename, extent and date created, and the list of files is stored in the parent spreadsheet. This is also an object with a \u201csheetName\u201d class variable to identify it in the full spreadsheet. Importantly, each file also has an \u201cerrors\u201d and a \u201cfailures\u201d dictionary which can be used at each step to identify issues with the record. The dictionaries contain date, filename, duplicate filename, extent, filesize and existence flags which are all instantiated to False. Updating these boolean flags indicates if an error has been found for the respective file for output to the spreadsheet. In order to translate these objects back into the data frame for writing, the filename can be considered a de-facto unique identifier.<\/p>\n<h3>Duplicates<\/h3>\n<p>Duplicate checks are the least complex but require the most human intervention to remediate. The duplicate check is implemented with a linear search. Since the number of records on each sheet is never more than 200 this method suffices but could be scaled up if the metadata sets grow or for other libraries with higher digitization output. If the function finds a duplicate filename within the list of files, the \u2018DupFilename\u2019 flag is set to True &#8211; this is done for each instance of the duplicate.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Checks for duplicate filenames in the list of files\r\nArgs:\r\n    sheet: excel sheet containing a list of files\r\n&quot;&quot;&quot;\r\ndef check_duplicate_filenames(sheet):\r\n    for file in sheet.fileList:\r\n        for comp_file in sheet.fileList:\r\n            if file.fileName == comp_file.fileName and file != comp_file:\r\n                file.errors&#x5B;&#039;DupFilename&#039;] = True\r\n                sheet.errors += 1\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 4.<\/strong> Checking the spreadsheet for duplicate values in the filenames field.\n<\/p>\n<h3>Physical Location<\/h3>\n<p>Each filename may not match its physical location. This is the first major check conducted, primarily because at this stage, it is the cause of most failures. The file location format in the spreadsheet is plaintext which is traditionally difficult to work with. Location descriptors in the filename, such as \u201cBox\u201d and \u201cFolder\u201d are reduced to \u201cB\u201d and \u201cF\u201d respectively. Increasing the complexity further, the location names also have item descriptors \u201cBulletin\u201d, \u201cSheet\u201d and \u201cItem\u201d &#8211; when translating to filenames, bulletins are reduced to \u201cBull\u201d, sheets remain \u201cSheet\u201d and the item identifier is dropped entirely. Since the aim is to maintain the existing workflow, the file naming conventions cannot change and these identifiers must be accounted for when matching locations with filenames. To effectively compare the location with the filename there are two options: one is to attempt to recover the filename from the location or vice versa. The former was chosen.<\/p>\n<p>The first part of the filename is computed as the most common prefix by taking the first string appearing before the period in the list of filenames in the metadata spreadsheet. Importantly, this ensures the program can be used with different collections. The William Stanley Jr. collection which was used to inform the development of this program has the prefix \u201cZWU_SCA0319\u201d. Files that do not conform to this are marked as errors for manual review accounting for a rare edge case in which an item from a different collection may be grouped incorrectly. The algorithm below is an averager, using a dictionary to track the number of examples of each prefix and returning the mode prefix to the filename predictor subroutine.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Determines the correct prefix for the files by finding the most common from the file\r\n    Means the script can be used for different collections\r\nArgs:\r\n    List&#x5B;ScanFile]: list of file objects from excel sheet\r\nReturns:\r\n    String: most likely prefix given all the entries\r\n&quot;&quot;&quot;\r\ndef find_file_prefix(fileList):\r\n    filenameDict = {} #Dictionary containing the prefixes in the document and their count\r\n    for file in fileList:\r\n        try: #ignore any funky filenames\r\n            prefix = file.fileName.split(&#039;.&#039;)&#x5B;0]\r\n            if filenameDict.get(prefix) == None:\r\n                filenameDict&#x5B;prefix] = 1\r\n            else:\r\n                filenameDict&#x5B;prefix] += 1\r\n            return(max(filenameDict)) #Returns the most common prefix - assumes this is correct\r\n        except:\r\n            pass\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 5.<\/strong> Checking the filenames for the collection prefix.\n<\/p>\n<p>Splitting the physical location field by commas, the program is able to discard filler words \u201cBox\u201d and \u201cFolder\u201d from physical locations thus converting the data into more usable, numeric form. The \u201cBulletin\u201d and \u201cSheet\u201d identifiers are retained by checking the full string for these substrings meaning item type identifiers are propagated through to the final filename prediction in format \u201cBxx.Fxx.Bullxx\u201d<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Checks that the filename matches the permanent location for each item in a given sheet\r\n    Reconstructs an expected filename from the location then compares it to what is recorded\r\n    Follows a precise naming convention. Box xx, Folder xx, Item type xx\r\n    Records the error in the file&#039;s errors dictionary if there is a discrepancy\r\nArgs:\r\n    sheet: excel sheet containing a list of files\r\n&quot;&quot;&quot;\r\ndef check_location_filename(sheet):\r\n    prefix = find_file_prefix(sheet.fileList)\r\n    for file in sheet.fileList:\r\n        pred_filename = prefix\r\n        if not file.location == None:\r\n\r\n            Location = list(filter(None, file.location.translate(str.maketrans(&#039;&#039;, &#039;&#039;, string.punctuation)).split(&quot; &quot;)))\r\n\r\n            #ignore any funky filenames\r\n            try:\r\n                file.fileName = file.fileName.replace(&quot; &quot;, &quot;&quot;) #Removes any spaces that shouldn&#039;t be in the filename\r\n            except:\r\n                pass\r\n\r\n            if &quot;Box&quot; in Location:\r\n                pred_filename += &quot;.B&quot; + Location&#x5B;1].zfill(2)\r\n\r\n            if &quot;Folder&quot; in Location:\r\n                pred_filename += &quot;.F&quot; + Location&#x5B;3].zfill(2)\r\n            elif len(Location) == 4:\r\n                pred_filename += &quot;.&quot; + Location&#x5B;3].zfill(2) #Accounts for items not in folders\r\n\r\n            if &quot;Bulletin&quot; in Location: #Assumes .Bull. for bulletins\r\n                pred_filename += &quot;.Bull.&quot; + Location&#x5B;5].zfill(2)\r\n            elif &quot;Sheet&quot; in Location: #Assumes .Sheet. for sheets\r\n                pred_filename += &quot;.Sheet.&quot; + Location&#x5B;5].zfill(2)   \r\n            elif len(Location) &gt;= 6: #Assumes no identifier for Items\r\n                pred_filename += &quot;.&quot; + Location&#x5B;5].zfill(2)\r\n\r\n            if pred_filename != file.fileName:\r\n                file.errors&#x5B;&#039;Filename&#039;] = True\r\n                sheet.errors += 1\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 6.<\/strong> Checks that the filename assigned matches its permanent location value for each item.\n<\/p>\n<p>After comparing the prediction to the filename recorded in the file object, the program can deduce if there is a mistake and, if so, flips the boolean flag for \u201cFilename\u201d in the error dictionary to True.<\/p>\n<h3>Date Format<\/h3>\n<p>The date formatter is designed to systematically attempt to convert from multiple expected date formats into ISO. The most common issues with formatting are years without months and days, and the use of commas rather than periods or backslashes as separators. Of course, some dates are already in the correct format so this is also accounted for.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Checks the date format and attempts to format the date if in unexpected form\r\nArgs:\r\n    sheet: excel sheet containing a list of files\r\nReturns:\r\n    Boolean: True to verify the process was executed successfully\r\n&quot;&quot;&quot;\r\ndef check_date_format(sheet):\r\n\r\n    spell = Speller(lang=&quot;en&quot;)\r\n    for file in sheet.fileList:\r\n        success = False\r\n        if file.date != None:\r\n            if not type(file.date) is datetime.datetime:\r\n                try:\r\n                    date = (parse(file.date.rstrip()))\r\n                    success = True\r\n                except:\r\n                    if type(file.date) is str and not success:\r\n                        success, date = attempt_format((file.date))\r\n                    elif type(file.date) is int and not success:\r\n                        success, date = year_to_date((file.date))\r\n                    if not success: #Last ditch effort, successful if incorrect spelling in date\r\n                        try:\r\n                            date = (parse(spell(file.date.rstrip())))\r\n                            date = date.strftime(&quot;%Y-%m-%d&quot;)\r\n                            file.date = date\r\n                            success = True\r\n                        except:\r\n                            file.errors&#x5B;&#039;Date&#039;] = True\r\n                            sheet.errors += 1\r\n                    else:\r\n                        file.date = date.strftime(&quot;%Y-%m-%d&quot;)\r\n            else:\r\n                file.date = file.date.strftime(&quot;%Y-%m-%d&quot;)\r\n    return True\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 7.<\/strong> Checks the date format and attempts to format the date if in unexpected form.\n<\/p>\n<p>The main formatter method uses helper functions such as the one shown below with a success flag indicating whether the returned date could be successfully converted. Since the methods use type casting and type specific operations, it is important to encapsulate them within exception handlers (in Python try, except), without this the program would crash due to the variability of formats. Should all the methods fail \u201cDate\u201d is added to the errors list so the record can be manually reviewed. Interestingly, a common date issue was the misspelling of written dates, again highlighting the abundance of human error. This was tackled by using the autocorrect Speller package after which we are able to convert into ISO using regular type casting.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Converts dates from year to year and day. E.g. 1980 to 1980-01-01\r\nArgs:\r\n    date: date in year form\r\nReturns:\r\n    &#x5B;Boolean, date]: success flag and date in ISO format\r\n&quot;&quot;&quot;\r\ndef year_to_date(date):\r\n    try:\r\n        date_new = datetime.datetime(date, 1, 1)\r\n        return &#x5B;True, date_new]\r\n    except:\r\n        return &#x5B;False, date]\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 8.<\/strong> Converts dates written as a date range to ISO format.\n<\/p>\n<h3>File Existance, Size and Extent<\/h3>\n<p>The largest source of failure came from missing files within the OneDrive file structure. Since individual scans are saved as jpeg images and multi-page scans are saved as PDFs the program searches for both cases using rglob, a recursive search package that can find files within a larger file structure. The parent directory, selected by the user, is stored in the program singleton meaning the file structure can be searched by glob and the full file paths can be stored into the file object. The method uses easyGUI which in turn uses Windows File Explorer to again ensure a familiar user interface for the quality control student.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Opens an EasyGUI window to allow the user to select the file they want to parse\r\nReturns:\r\n    String: filepath of the selected file\r\n&quot;&quot;&quot;\r\ndef get_file():\r\n        path = easygui.diropenbox()\r\n        return path\r\n\r\ndef find_file(folder):\r\n        for path in Path(folder).rglob(&#039;*.pyc&#039;):\r\n                print(path.name)\r\n        return True\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 9.<\/strong> Opens an EasyGUI window to allow the user to select the file they want to parse.\n<\/p>\n<p>Since the file structure is complicated and some files are at greater depths within subfolders, there is a moderate time requirement for this step. Helpful print messages display, showing the process is in fact still running for the approximate 1 minute it takes to look for every file. If the file is found, the \u201cexists\u201d flag is set to true. While the program verifies the existence of each file, it also checks the file size. The size of the file in megabytes, is read into the file_size variable within the file object. If the extent falls above the preset size threshold of 300MB, the too_large variable is set to true, otherwise it is set to false.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Conducts the preliminary QC checks\r\n    Checks if the file exists in the file structure, if the extent is correct \r\n    and if the file size is less than 300mb\r\n    Adds the respective failures to the count and to the file&#039;s failure dictionary\r\nArgs:\r\n    sheet: excel sheet containing a list of files\r\n    parent_directory: the OneDrive parent folder to search through\r\n&quot;&quot;&quot;\r\ndef check_files(sheet, parent_directory):\r\n\r\n    failures = 0\r\n\r\n    for file in sheet.fileList:\r\n        try:\r\n            for path in Path(parent_directory).rglob(file.fileName + &#039;.&#x5B;pdf jpg]*&#039;):\r\n\r\n                file.filePath = (parent_directory + &quot;\\\\&quot; + file.fileName + &quot;.pdf&quot;) # Can I do this using path above?\r\n                # We can&#039;t assume this will be a pdf as single pages stored as jpg\r\n\r\n                file.exists = True\r\n\r\n                # We perhaps can&#039;t assume this is correct since some don&#039;t have parent folders\r\n                if file.extent == len(os.listdir(path.parent.absolute())) - 1 or file.extent == None:\r\n                    file.failures&#x5B;&#039;Extent&#039;] = False\r\n                else:\r\n                    file.failures&#x5B;&#039;Extent&#039;] = True\r\n                    failures += 1\r\n\r\n                if not (os.path.getsize(file.filePath) &gt;&gt; 20) &lt; 300:\r\n                    file.failures&#x5B;&#039;Filesize&#039;] = True\r\n                    failures += 1\r\n\r\n        except:\r\n            pass\r\n        \r\n        if not file.exists:\r\n            file.failures&#x5B;&#039;Existence&#039;] = True\r\n            failures += 1\r\n\r\n    sheet.failures = failures # Add number of failures to the sheet object\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 10.<\/strong> Checks if the file listed exists and the file size.\n<\/p>\n<p>Checking page count proved to be more complex than initially anticipated. Since the files are hosted using OneDrive, to use the built-in page enumeration tools provided in Python\u2019s PDF handling packages, the PDFs would have to be downloaded. This is not possible given the size and number of files. Fortunately the scanning software used at Union College maintains a copy of each page in jpeg form inside each subfolder. Utilizing this structure, the program is able to count the number of jpegs and compare this with the recorded value, but this process is highly tailored to the digitization process at the college and may need adapting elsewhere.<\/p>\n<p>For every type of failure a helpful error message is added to the data frame along with a \u201cfail\u201d in the necessary column. The program tracks the failure rates for each of the three categories which can be used to evaluate the accuracy of different parts of the digitization process.<\/p>\n<h3>Indicating Errors<\/h3>\n<p>As described, there are many errors that are too complex to be remedied by the program and must be highlighted for human review. With Openpyxl we can do this literally by selecting a color, highlighting the problematic row, and overwriting the metadata spreadsheet. The program currently uses orange (Hex #EFBE7D) to indicate file naming issues, blue (Hex #8BD3E6) to indicate duplicates, and yellow (Hex #E9EC6B) to indicate date errors. These color codes are assigned but a new method of selecting colors is also being considered.<\/p>\n<p>The program will likely be run more than once at each stage, allowing for errors to be manually rectified and the application run again. As a result, the first step is resetting the color of rows within the spreadsheet. The automation program is only one source of highlighting since the Digital Projects and Metadata Librarian also use various colors to convey information about records. So those messages aren\u2019t affected; the spreadsheet only resets the colors with the precise hex values used by the program. The function that does this is shown below.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n&quot;&quot;&quot;Removes specific highlight colors from the spreadsheet to allow the program to be run\r\n    continually after each error is rectified\r\nArgs:\r\n    ExcelFile: Excel file to remove colors on - contains the error colors dictionary\r\n    wb: Work book opened with Openpyxl\r\n    colors_to_remove: allows specification of individual colors so running spreadsheetChecks alone\r\n        (for example) doesn&#039;t remove failure highlighting\r\n&quot;&quot;&quot;\r\ndef reset_colors(ExcelFile, wb, colors_to_remove):\r\n    fill_reset = openpyxl.styles.PatternFill(fill_type=None)\r\n    for sheet in ExcelFile.sheetList:\r\n        ws = wb&#x5B;sheet.sheetName]\r\n        for row in ws.iter_rows():\r\n            for cell in row:\r\n                if cell.fill.start_color.index in colors_to_remove.values():\r\n                    cell.fill = fill_reset\r\n\r\n\r\n\r\n\r\n&quot;&quot;&quot;Highlights rows with errors and\/or failures with their corresponding hex color\r\n    Since the function pulls from both the error dictionary and the failure dictionary\r\n    it can be used in both parts of the program\r\nArgs:\r\n    ExcelFile: Contains the failure\/error information and the highlighting colors\r\nReturns:\r\n    Boolean: True if the save is successful, False if not\r\n&quot;&quot;&quot;\r\ndef highlight_errors(ExcelFile):\r\n\r\n    xl_file = pd.ExcelFile(ExcelFile.filePath)       \r\n    wb = openpyxl.load_workbook(xl_file)\r\n\r\n    reset_colors(ExcelFile, wb, ExcelFile.errorColors)\r\n\r\n    for sheet in ExcelFile.sheetList:\r\n        dt = pd.read_excel(xl_file, sheet.sheetName)\r\n        ws = wb&#x5B;sheet.sheetName]\r\n\r\n        errors = sheet.getSheetErrorDict()\r\n        errors.update(sheet.getSheetFailureDict())\r\n\r\n        colors = ExcelFile.errorColors\r\n        colors.update (ExcelFile.failColors)\r\n\r\n        for file in errors:\r\n            error_color = colors&#x5B;errors&#x5B;file]]\r\n        \r\n            if error_color != None:\r\n                 fill = openpyxl.styles.PatternFill(start_color=error_color, end_color=error_color, fill_type=&quot;solid&quot;)\r\n                 for index, row in dt.iterrows():\r\n                    try:\r\n                        if file == dt&#x5B;&#039;Filename&#039;]&#x5B;index]:\r\n                            for y in range(1, ws.max_column+1):\r\n                                ws.cell(row=index+2, column=y).fill = fill\r\n                    except:\r\n                        pass\r\n\r\n    try: \r\n        wb.save(ExcelFile.filePath)\r\n        wb.close()\r\n        xl_file.close()\r\n        return True\r\n    except:\r\n        wb.close()\r\n        xl_file.close()\r\n        return False\r\n<\/pre>\n<p class=\"caption\">\n<strong>Figure 11.<\/strong> Highlights rows with errors and\/or failures with their corresponding hex color.\n<\/p>\n<p>Since the failure and error dictionaries share the same keys as the color dictionaries, these can be referenced alongside one another to determine which color to highlight which record. The above function loops over each error and failure, selects the color, locates it in the workbook and fills the corresponding row. Again, exception handling is of paramount importance &#8211; if the Excel file is currently open Openpyxl does not have the permissions required to edit it. In cases throughout the program, including this, tkinter windows are used to inform the user of errors since Python\u2019s error outputs are difficult to read for the untrained eye.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue60\/chatnik\/Figure12_C4L.png\"><br \/>\n<strong>Figure 12.<\/strong> An unfortunate shift causing mismatched filenames has been caught and highlighted by the QC program.\n<\/p>\n<p>In some cases no user input is required &#8211; this is mainly the case for automatically fixed dates, which are not indicated with color on the spreadsheet, and automatic fails. To ensure consistency with the current workflow the program uses the \u201cQC Results\u201d, \u201cQC Initials\u201d and \u201cQC Comments\u201d columns to indicate records that have failed with descriptive output messages and an AUTO flag to indicate this was an automated fail. Similar output messages are used when the file size is too large or the pagecount is incorrect. The whole record for auto fails is highlighted red for immediate visibility after opening the spreadsheet.<\/p>\n<p class=\"caption\">\n<img decoding=\"async\" src=\"\/media\/issue60\/chatnik\/Figure13_C4L.png\"><br \/>\n<strong>Figure 13.<\/strong> Two files with auto fails. One because the file cannot be located, the other for incorrect page count.\n<\/p>\n<h2>Results and Evaluation<\/h2>\n<p>In testing the Quality Control Automation program\u2019s first step, the application was run on a spreadsheet where the items had already been scanned. This immediately highlighted an issue where filenames were duplicated for different records. While this led to additional labor requirements to remediate the scanned batch, it may not have been caught in the upload because technically the filename was correct and had a scanned image but it would have been inaccurate for users of the repository. Further, an unfortunate shift in cells, not obvious to the human eye, meant item n was saved with the filename for item n+1 which would cause issues when uploading items to Archipelago. If the data type was inaccurate, like a date format in a field expecting a string, the upload would have failed. If the data type was acceptable, the upload would have carried on and the metadata would display in the wrong field or not at all, leading to inaccuracies and poor presentation.<\/p>\n<p>Dates can be tricky in Excel, even if entered correctly, if the format settings are off, Excel will automatically change the date pattern. Also if many people have different methods of recording the date and that can be an issue if metadata goes through many hands. The application was able to resolve 100% of dates that were in the incorrect format. While developing the program, James discovered that dates were usually in a predictable and readable format, hence the success rate, but not one that is useful for Archipelago\u2019s SOLR index. This application can reformat the dates to the ISO 8601 standard automatically, so this feature reduces manual labor requirement for post processing.<\/p>\n<p>The second stage of the automation workflow highlighted a major issue. In the spreadsheet, 32% of items were marked as scanned but could not be located in the OneDrive folder. Evidently something went wrong in the digitization workflow and could mean scans are stored incorrectly in a different directory and as a result scanned multiple times unnecessarily. Even though the solution to this issue requires manual intervention, this report still represents a massive time saving effort. Prior to this, students were taking approximately 2 minutes to attempt to find a file, enter a fail, and a note in the spreadsheet for a supervising librarian to attempt the same check for 32% of the files. In the test file this equates to 131 failures, saving time and providing invaluable insight into how the QC process can be improved outside of this automation task.<\/p>\n<p>The program is able to identify files that are too large to be uploaded during this step. Again, an error that requires manual intervention once discovered but eliminates the frustration of finding this out during the upload process. It also detects discrepancies in the spreadsheet\u2019s Extent column versus the scan&#8217;s actual number of pages. In one recorded case, this zeroed in on an item where a page had not uploaded correctly so the item needed rescanning. It may seem superfluous, but when the record is a PDF and remembering to check the number of pages while a document is open amongst the other checks, that item can fall by the wayside.<\/p>\n<div class=\"caption\" style=\"width: 100%\">\n<strong>Table 1.<\/strong> Preliminary ErrorRates table: Representing 23 record failures across the spreadsheet. <\/p>\n<table style=\"margin: 0 auto;\">\n<tr>\n<td><strong>Box<\/strong><\/td>\n<td><strong>Failure Rate<\/strong><\/td>\n<\/tr>\n<tr>\n<td>3 &#038; 4<\/td>\n<td>3.8%<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>8.3%<\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td>6.3%<\/td>\n<\/tr>\n<tr>\n<td>7<\/td>\n<td>2.8%<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p>The failure rate for the second step on the test data is 36.84%. This accounts for approximately 144 automatic fails and suggests there is a problem somewhere else in the workflow. The majority of these failures are due to missing files, but a handful are due to oversized files and incorrect page counts.<\/p>\n<h2>Conclusion<\/h2>\n<p>The collaboration between the Digital Collections and Preservation Librarian and Computer Science major resulted in an application and outcomes that exceeded expectations. Not only was a solid product developed to improve and expedite parts of the digitization workflow, but the process was a valuable experiential learning opportunity for the computer science student. The librarian brought the high level workflow needs and analysis and experience with the post digitization workflow issues, while the student provided more sophisticated programming skills and hands on digitization experience. Through analysis the amount of human error in manual data entry is evident. Of course, many aspects are better handled by people and it is a risk removing the human aspect of libraries by automating everything, but using it as a tool to check over manual work is highly valuable. It was noted that it is much easier to write software to do a job after having been trained on the manual workflow. Often, computer programmers do not have extensive knowledge of the system requirements and therefore require a consultancy period. This method of development is an interesting model and can be considered in university libraries with CS students. This integration of the programmer into the original workflow highlighted the importance of fitting any new tool into the existing workflow and not trying to overhaul the entire process.<\/p>\n<p>This application is able to find things that are invisible to the human eye, and provide helpful statistics such as the 2-8% failure rate for filenames and duplicates and the 36% upload failure. It helps identify where errors are created with statistics to support it. The process ensures consistency and creates a common standard for Quality Control for every collection that passes through it. Without, one wrong character could grind the process to a halt.<\/p>\n<p>The implementation of this application leaves room for further technological developments. Other areas of quality control are being scrutinized for automation. This team is especially interested in some of the visual aspects of the checks like color balance, skewed images, and even cropping of images. The application itself is relatively new to the workflow but over time, Schaffer Library hopes to determine the real time cost savings in producing a system like this. It\u2019s important to determine if the time taken to build the program was worth the quality control hours saved before more resources are put into expanding the application.<\/p>\n<h2>References<\/h2>\n<p>[<a id=\"note1\" href=\"#ref1\">1<\/a>] Federal Agencies Digital Initiatives. 2023 May. Technical Guidelines for Digitizing Cultural Heritage Materials: Third Edition. U.S. National, Archives and Records Administration. <a href=\"https:\/\/www.digitizationguidelines.gov\">https:\/\/www.digitizationguidelines.gov<\/a><\/p>\n<p>[<a id=\"note10\" href=\"#ref10\">10<\/a>] Gamma E, Helm R. 1994. Design patterns: Elements of reusable object-oriented software. 40th printed ed. Boston: Addison-Wesley Professional.<\/p>\n<p>[<a id=\"note4\" href=\"#ref4\">4<\/a>][<a id=\"note6\" href=\"#ref6\">6<\/a>]  Gazoni, E, Clark, C. 2024. openpyxl &#8211; A Python library to read\/write Excel 2010 xlsx\/xlsm files; [accessed 2025 Jan 9]. Available from: <a href=\"https:\/\/openpyxl.readthedocs.io\/en\/stable\/\">https:\/\/openpyxl.readthedocs.io\/en\/stable\/<\/a>.<\/p>\n<p>[<a id=\"note7\" href=\"#ref7\">7<\/a>] Python Software Foundation. 2025. glob &#8211; Unix style pathname pattern expansion; [accessed 2025 Jan 9] Available from: <a href=\"https:\/\/docs.Python.org\/3\/library\/glob.html\">https:\/\/docs.Python.org\/3\/library\/glob.html<\/a>.<\/p>\n<p>[<a id=\"note5\" href=\"#ref5\">5<\/a>][<a id=\"note8\" href=\"#ref8\">8<\/a>] Python Software Foundation. 2025. Graphical User Interfaces with Tk; [accessed 2025 Jan 9] Available from: <a href=\"https:\/\/docs.Python.org\/3\/library\/tk.html\">https:\/\/docs.Python.org\/3\/library\/tk.html<\/a>.<\/p>\n<p>[<a id=\"note2\" href=\"#ref2\">2<\/a>] Pino, D. 2024. Archipelago Commons Intro. [accessed 2025 Jan 9] Available from: <a href=\"https:\/\/docs.archipelago.nyc\/1.4.0\/\">https:\/\/docs.archipelago.nyc\/1.4.0\/<\/a>.<\/p>\n<p>[<a id=\"note9\" href=\"#ref9\">9<\/a>] The Qt Company. 2023. PyQt5 Reference Guide; [accessed 2025 Jan 9] Available from: <a href=\"https:\/\/www.riverbankcomputing.com\/static\/Docs\/PyQt5\/introduction.html\">https:\/\/www.riverbankcomputing.com\/static\/Docs\/PyQt5\/introduction.html<\/a>.<\/p>\n<p>[<a id=\"note3\" href=\"#ref3\">3<\/a>] Twig Team. 2025. Twig | The flexible, fast, and secure template engine for PHP; [accessed 2025 Jan 9] Available from: <a href=\"https:\/\/twig.symfony.com\/\">https:\/\/twig.symfony.com\/<\/a>.<\/p>\n<h2>Notes<\/h2>\n<p>The working code referenced in this article is available at: <a href=\"https:\/\/github.com\/schaffer-library\/QualityControlAutomation\">https:\/\/github.com\/schaffer-library\/QualityControlAutomation<\/a><\/p>\n<h2>About the authors<\/h2>\n<p><em>Corinne Chatnik<\/em> is the Digital Collections and Preservation Librarian at Union College in Schenectady, NY. She earned her MLIS from the University of Alabama. She was previously a professional archivist specializing in digital archiving at the New York State Archives.<\/p>\n<p>Author email: chatnikc@union.edu<br \/>\nAuthor URL: <a href=\"https:\/\/orcid.org\/0009-0004-7229-5431\">https:\/\/orcid.org\/0009-0004-7229-5431<\/a><\/p>\n<p><em>James Gaskell<\/em> is a Senior at Union College, majoring in Computer Science and minoring in Electrical Engineering. His main areas of study are evolutionary algorithms and software verification. He is also currently a work-study student at Union College\u2019s Schaffer Library.<\/p>\n<p>Author email: gaskellj@union.edu<br \/>\nAuthor URL: <a href=\"https:\/\/orcid.org\/0009-0002-9361-6172\">https:\/\/orcid.org\/0009-0002-9361-6172<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At Union College Schaffer Library, the digitization lab is mostly staffed by undergraduates who only work a handful of hours a week. While they do a great job, the infrequency of their work hours and lack of experience results in errors in digitization and metadata. Many of these errors are difficult to catch during quality control checks because they are so minute, such as a missed counted page number here, or a transposed character in a filename there. So, a Computer Science student and a librarian collaborated to create a quality control automation application for the digitization workflow. The application is written in Python and relies heavily on using Openpyxl libraries to check the metadata spreadsheet and compare metadata with the digitized files. This article discusses the purpose and theory behind the Quality Control application, how hands-on experience with the digitization workflow informs automation, the methodology, and the user interface decisions. The goal of this application is to make it usable by other students and staff and to build it into the workflow in the future. This collaboration resulted in an experiential learning opportunity that has benefited the student&#8217;s ability to apply what they have learned in class to a real-world problem. <\/p>\n","protected":false},"author":508,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[],"class_list":["post-18340","post","type-post","status-publish","format-standard","hentry","category-issue60"],"_links":{"self":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18340","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/users\/508"}],"replies":[{"embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/comments?post=18340"}],"version-history":[{"count":0,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/posts\/18340\/revisions"}],"wp:attachment":[{"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/media?parent=18340"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/categories?post=18340"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journal.code4lib.org\/wp-json\/wp\/v2\/tags?post=18340"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}]