Search Results

Showing 26 articles matching "hydra"

Building and Maintaining Metadata Aggregation Workflows Using Apache Airflow

Issue 52 | 2021-09-22

Leanne Finnigan and Emily Toner

PA Digital is a Pennsylvania network that serves as the state’s service hub for the Digital Public Library of America (DPLA). The group developed a homegrown aggregation system in 2014, used to harvest digital collection records from contributing institutions, validate and transform their metadata, and deliver aggregated records to the DPLA. Since our initial launch, PA Digital has expanded significantly, harvesting from an increasing number of contributors with a variety of repository systems. With each new system, our highly customized aggregator software became more complex and difficult to maintain. By 2018, PA Digital staff had determined that a new solution was needed. From 2019 to 2021, a cross-functional team implemented a more flexible and scalable approach to metadata aggregation for PA Digital, using Apache Airflow for workflow management and Solr/Blacklight for internal metadata review. In this article, we will outline how we use this group of applications and the new workflows adopted, which afford our metadata specialists more autonomy to contribute directly to the ongoing development of the aggregator. We will discuss how this work fits into our broader sustainability planning as a network and how the team leveraged shared expertise to build a more stable approach to maintenance.

Always Be Migrating

Issue 50 | 2021-02-10

Elizabeth McAulay

At the University of California, Los Angeles, the Digital Library Program is in the midst of a large, multi-faceted migration project. This article presents a narrative of migration and a new mindset for technology and library staff in their ever-changing infrastructure and systems. This article posits that migration from system to system should be integrated into normal activities so that it is not a singular event or major project, but so that it is a process built into the core activities of a unit.

Extending and Adapting Metadata Audit Tools for Mountain West Digital Library Members

Issue 41 | 2018-08-09

Teresa K. Hebron

As a DPLA regional service hub, Mountain West Digital Library harvests metadata from 16 member repositories representing over 70 partners throughout the Western US and hosts over 950,000 records in its portal. The collections harvested range in size from a handful of records to many thousands, presenting both quality control and efficiency issues. To assist members in auditing records for metadata required by the MWDL Metadata Application Profile before harvesting, MWDL hosts a metadata auditing tool adapted from North Carolina Digital Heritage Center’s original DPLA OAI Aggregation Tools project, available on GitHub. The tool uses XSL tests of the OAI-PMH stream from a repository to check conformance of incoming data with the MWDL Metadata Application Profile. Use of the tool enables student workers and non-professionals to perform large-scale metadata auditing even if they have no prior knowledge of application profiles or metadata auditing workflows.

In the spring of 2018, we further adapted and extended this tool to audit collections coming from a new member, Oregon Digital. The OAI-PMH provision from Oregon Digital’s Samvera repository is configured differently than that of the CONTENTdm repositories used by existing MWDL members, requiring adaptation of the tool. We also extended the tool by adding the Dublin Core Facet Viewer, which gives the ability to view and analyze values used in both required and recommended fields by frequency.

Use of this tool enhances metadata completeness, correctness, and consistency. This article will discuss the technical challenges of project, offer code samples, and offer ideas for further updates.

Recount: Revisiting the 42nd Canadian Federal Election to Evaluate the Efficacy of Retroactive Tweet Collection

Issue 37 | 2017-07-18

Anthony T. Pinter and Ben Goldman

In this paper, we report the development and testing of a methodology for collecting tweets from periods beyond the Twitter API’s seven-to-nine day limitation. To accomplish this, we used Twitter’s advanced search feature to search for tweets from past the seven to nine day limit, and then used JavaScript to automatically scan the resulting webpage for tweet IDs. These IDs were then rehydrated (tweet metadata retrieved) using twarc. To examine the efficacy of this method for retrospective collection, we revisited the case study of the 42nd Canadian Federal Election. Using comparisons between the two datasets, we found that our methodology does not produce as robust results as real-time streaming, but that it might be useful as a starting point for researchers or collectors. We close by discussing the implications of these findings.

The Semantics of Metadata: Avalon Media System and the Move to RDF

Issue 37 | 2017-07-18

Juliet L. Hardesty and Jennifer B. Young

The Avalon Media System (Avalon) provides access and management for digital audio and video collections in libraries and archives. The open source project is led by the libraries of Indiana University Bloomington and Northwestern University and is funded in part by grants from The Andrew W. Mellon Foundation and Institute of Museum and Library Services.

Avalon is based on the Samvera Community (formerly Hydra Project) software stack and uses Fedora as the digital repository back end. The Avalon project team is in the process of migrating digital repositories from Fedora 3 to Fedora 4 and incorporating metadata statements using the Resource Description Framework (RDF) instead of XML files accompanying the digital objects in the repository. The Avalon team has worked on the migration path for technical metadata and is now working on the migration paths for structural metadata (PCDM) and descriptive metadata (from MODS XML to RDF). This paper covers the decisions made to begin using RDF for software development and offers a window into how Semantic Web technology functions in the real world.

Editorial: Reflecting on the success and risks to the Code4Lib Journal

Issue 36 | 2017-04-20

Peter E. Murray

At the Code4Lib 2017 conference, I gave a short lightning talk about the Code4Lib Journal and in the process realized that we will soon be closing out the 10th year of "foster[ing] community and share[ing] information among those interested in the intersection of libraries, technology, and the future." That quote comes from the Code4Lib Journal […]

Autoload: a pipeline for expanding the holdings of an Institutional Repository enabled by ResourceSync

Issue 36 | 2017-04-20

James Powell, Martin Klein and Herbert Van de Sompel

Providing local access to locally produced content is a primary goal of the Institutional Repository (IR). Guidelines, requirements, and workflows are among the ways in which institutions attempt to ensure this content is deposited and preserved, but some content is always missed. At Los Alamos National Laboratory, the library implemented a service called LANL Research Online (LARO), to provide public access to a collection of publicly shareable LANL researcher publications authored between 2006 and 2016. LARO exposed the fact that we have full text for only about 10% of eligible publications for this time period, despite a review and release requirement that ought to have resulted in a much higher deposition rate. This discovery motivated a new effort to discover and add more full text content to LARO. Autoload attempts to locate and harvest items that were not deposited locally, but for which archivable copies exist. Here we describe the Autoload pipeline prototype and how it aggregates and utilizes Web services including Crossref, SHERPA/RoMEO, and oaDOI as it attempts to retrieve archivable copies of resources. Autoload employs a bootstrapping mechanism based on the ResourceSync standard, a NISO standard for resource replication and synchronization. We implemented support for ResourceSync atop the LARO Solr index, which exposes metadata contained in the local IR. This allowed us to utilize ResourceSync without modifying our IR. We close with a brief discussion of other uses we envision for our ResourceSync-Solr implementation, and describe how a new effort called Signposting can replace cumbersome screen scraping with a robust autodiscovery path to content which leverages Web protocols.

Outside The Box: Building a Digital Asset Management Ecosystem for Preservation and Access

Issue 36 | 2017-04-20

Andrew Weidner, Sean Watkins, Bethany Scott, Drew Krewer, Anne Washington, Matthew Richardson

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box, Archivematica, and ArchivesSpace. This article describes the work that the UH Libraries implementation team has completed to date, including open source tools for streamlining digital curation workflows, minting and resolving identifiers, and managing SKOS vocabularies. These systems, workflows, and tools, collectively known as the Bayou City Digital Asset Management System (BCDAMS), represent a novel effort to solve common issues in the digital curation lifecycle and may serve as a model for other institutions seeking to implement flexible and comprehensive systems for digital preservation and access.

An Interactive Map for Showcasing Repository Impacts

Issue 36 | 2017-04-20

Hui Zhang and Camden Lopez

Digital repository managers rely on usage metrics such as the number of downloads to demonstrate research visibility and impacts of the repositories. Increasingly, they find that current tools such as spreadsheets and charts are ineffective for revealing important elements of usage, including reader locations, and for attracting the targeted audiences. This article describes the design and development of a readership map that provides an interactive, near-real-time visualization of actual visits to an institutional repository using data from Google Analytics. The readership map exhibits the global impacts of a repository by displaying the city of every view or download together with the title of the scholarship being read and a hyperlink to its page in the repository. We will discuss project motivation and development issues such as authentication with Google API, metadata integration, performance tuning, and data privacy.

Overly Honest Data Repository Development

Issue 34 | 2016-10-25

Colleen Fallaw, Elise Dunham, Elizabeth Wickes, Dena Strong, Ayla Stein, Qian Zhang, Kyle Rimkus, Bill Ingram, Heidi J. Imker

After a year of development, the library at the University of Illinois at Urbana-Champaign has launched a repository, called the Illinois Data Bank (https://databank.illinois.edu/), to provide Illinois researchers with a free, self-serve publishing platform that centralizes, preserves, and provides persistent and reliable access to Illinois research data. This article presents a holistic view of development by discussing our overarching technical, policy, and interface strategies. By openly presenting our design decisions, the rationales behind those decisions, and associated challenges this paper aims to contribute to the library community’s work to develop repository services that meet growing data preservation and sharing needs.

OSS4EVA: Using Open-Source Tools to Fulfill Digital Preservation Requirements

Issue 34 | 2016-10-25

Janet Carleton, Heidi Dowding, Marty Gengenbach, Blake Graham, Sam Meister, Jessica Moran, Shira Peltzman, Julie Seifert, and Dorothy Waugh

This paper builds on the findings of a workshop held at the 2015 International Conference on Digital Preservation (iPRES), entitled, “Using Open-Source Tools to Fulfill Digital Preservation Requirements” (OSS4PRES hereafter). This day-long workshop brought together participants from across the library and archives community, including practitioners, proprietary vendors, and representatives from open-source projects. The resulting conversations were surprisingly revealing: while OSS’ significance within the preservation landscape was made clear, participants noted that there are a number of roadblocks that discourage or altogether prevent its use in many organizations. Overcoming these challenges will be necessary to further widespread, sustainable OSS adoption within the digital preservation community. This article will mine the rich discussions that took place at OSS4PRES to (1) summarize the workshop’s key themes and major points of debate, (2) provide a comprehensive analysis of the opportunities, gaps, and challenges that using OSS entails at a philosophical, institutional, and individual level, and (3) offer a tangible set of recommendations for future work designed to broaden community engagement and enhance the sustainability of open source initiatives, drawing on both participants’ experience as well as additional research.

Metadata Analytics, Visualization, and Optimization: Experiments in statistical analysis of the Digital Public Library of America (DPLA)

Issue 33 | 2016-07-19

Corey A. Harper

This paper presents the concepts of metadata assessment and “quantification” and describes preliminary research results applying these concepts to metadata from the Digital Public Library of America (DPLA). The introductory sections provide a technical outline of data pre-processing, and propose visualization techniques that can help us understand metadata characteristics in a given context. Example visualizations are shown and discussed, leading up to the use of “metadata fingerprints” — D3 Star Plots — to summarize metadata characteristics across multiple fields for arbitrary groupings of resources. Fingerprints are shown comparing metadata characteristics for different DPLA “Hubs” and also for used versus not used resources based on Google Analytics “pageview” counts. The closing sections introduce the concept of metadata optimization and explore the use of machine learning techniques to optimize metadata in the context of large-scale metadata aggregators like DPLA. Various statistical models are used to predict whether a particular DPLA item is used based only on its metadata. The article concludes with a discussion of the broad potential for machine learning and data science in libraries, academic institutions, and cultural heritage.

An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

Issue 32 | 2016-04-25

Nick Ruest and Ian Milligan

This article examines the tools, approaches, collaboration, and findings of the Web Archives for Historical Research Group around the capture and analysis of about 4 million tweets during the 2015 Canadian Federal Election. We hope that national libraries and other heritage institutions will find our model useful as they consider how to capture, preserve, and analyze ongoing events using Twitter.

While Twitter is not a representative sample of broader society – Pew research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income) – Twitter still represents an exponential increase in the amount of information generated, retained, and preserved from 'everyday' people. Therefore, when historians study the 2015 federal election, Twitter will be a prime source.

On August 3, 2015, the team initiated both a Search API and Stream API collection with twarc, a tool developed by Ed Summers, using the hashtag #elxn42. The hashtag referred to the election being Canada's 42nd general federal election (hence 'election 42' or elxn42). Data collection ceased on November 5, 2015, the day after Justin Trudeau was sworn in as the 42nd Prime Minister of Canada. We collected for a total of 102 days, 13 hours and 50 minutes.

To analyze the data set, we took advantage of a number of command line tools, utilities that are available within twarc, twarc-report, and jq. In accordance with the Twitter Developer Agreement & Policy, and after ethical deliberations discussed below, we made the tweet IDs and other derivative data available in a data repository. This allows other people to use our dataset, cite our dataset, and enhance their own research projects by drawing on #elxn42 tweets.

Our analytics included:

  • breaking tweet text down by day to track change over time;
  • client analysis, allowing us to see how the scale of mobile devices affected medium interactions;
  • URL analysis, comparing both to Archive-It collections and the Wayback Availability API to add to our understanding of crawl completeness;
  • and image analysis, using an archive of extracted images.

Our article introduces our collecting work, ethical considerations, the analysis we have done, and provides a framework for other collecting institutions to do similar work with our off-the-shelf open-source tools. We conclude by ruminating about connecting Twitter archiving with a broader web archiving strategy.

Editorial Introduction: New Year Resolutions

Issue 31 | 2016-01-28

Terry Reese

While New Year’s day came and went with very little fanfare at my house (well, if you don’t count our Star Wars marathon), I think I’d be remiss if I didn’t take the time to mark the passing of the new year, with a look ahead to the future.  And I think it is fitting, then, […]

Data Munging Tools in Preparation for RDF: Catmandu and LODRefine

Issue 30 | 2015-10-15

Christina Harlow

Data munging, or the work of remediating, enhancing and transforming library datasets for new or improved uses, has become more important and staff-inclusive in many library technology discussions and projects. Many times we know how we want our data to look, as well as how we want our data to act in discovery interfaces or when exposed, but we are uncertain how to make the data we have into the data we want. This article introduces and compares two library data munging tools that can help: LODRefine (OpenRefine with the DERI RDF Extension) and Catmandu.

The strengths and best practices of each tool are discussed in the context of metadata munging use cases for an institution’s metadata migration workflow. There is a focus on Linked Open Data modeling and transformation applications of each tool, in particular how metadataists, catalogers, and programmers can create metadata quality reports, enhance existing data with LOD sets, and transform that data to a RDF model. Integration of these tools with other systems and projects, the use of domain specific transformation languages, and the expansion of vocabulary reconciliation services are mentioned.

Barriers to Initiation of Open Source Software Projects in Libraries

Issue 29 | 2015-07-15

Curtis Thacker and Charles Knutson

Libraries share a number of core values with the Open Source Software (OSS) movement, suggesting there should be a natural tendency toward library participation in OSS projects. However Dale Askey’s 2008 Code4Lib column entitled “We Love Open Source Software. No, You Can’t Have Our Code,” claims that while libraries are strong proponents of OSS, they are unlikely to actually contribute to OSS projects. He identifies, but does not empirically substantiate, six barriers that he believes contribute to this apparent inconsistency. In this study we empirically investigate not only Askey’s central claim but also the six barriers he proposes. In contrast to Askey’s assertion, we find that initiation of and contribution to OSS projects are, in fact, common practices in libraries. However, we also find that these practices are far from ubiquitous; as Askey suggests, many libraries do have opportunities to initiate OSS projects, but choose not to do so. Further, we find support for only four of Askey’s six OSS barriers. Thus, our results confirm many, but not all, of Askey’s assertions.

Feminism and the Future of Library Discovery

Issue 28 | 2015-04-15

Bess Sadler and Chris Bourg

This paper discusses the various ways in which the practices of libraries and librarians influence the diversity (or lack thereof) of scholarship and information access. We examine some of the cultural biases inherent in both library classification systems and newer forms of information access like Google search algorithms, and propose ways of recognizing bias and applying feminist principles in the design of information services for scholars, particularly as libraries re-invent themselves to grapple with digital collections.

Training the Next Generation of Open Source Developers: A Case Study of OSU Libraries & Press’ Technology Training Program

Issue 27 | 2015-01-21

Evviva Weinraub Lajoie, Trey Terrell and Mike Eaton

The Emerging Technologies & Services department at Oregon State University Libraries & Press has implemented a training program for our technology student employees on how and why they should engage in Open Source community development. This article will outline what they’ve done to implement this program, discuss the benefits they’ve seen as a result of these changes, and will talk about what they viewed as necessary to build and promote a culture of engagement in open communities.

Opening the Door: A First Look at the OCLC WorldCat Metadata API

Issue 25 | 2014-07-21

Terry Reese

Libraries have long relied on OCLC’s WorldCat database as a way to cooperatively share bibliographic data and declare library holdings to support interlibrary loan services. As curator, OCLC has traditionally mediated all interactions with the WorldCat database through their various cataloging clients to control access to the information. As more and more libraries look for new ways to interact with their data and streamline metadata operations and workflows, these clients have become bottlenecks and an inhibitor of library innovation. To address some of these concerns, in early 2013 OCLC announced the release of a set of application programming interfaces (APIs) supporting read and write access to the WorldCat database. These APIs offer libraries their first opportunity to develop new services and workflows that directly interact with the WorldCat database, and provide opportunities for catalogers to begin redefining how they work with OCLC and their data.

A Metadata Schema for Geospatial Resource Discovery Use Cases

Issue 25 | 2014-07-21

Darren Hardy and Kim Durante

We introduce a metadata schema that focuses on GIS discovery use cases for patrons in a research library setting. Text search, faceted refinement, and spatial search and relevancy are among GeoBlacklight’s primary use cases for federated geospatial holdings. The schema supports a variety of GIS data types and enables contextual, collection-oriented discovery applications as well as traditional portal applications. One key limitation of GIS resource discovery is the general lack of normative metadata practices, which has led to a proliferation of metadata schemas and duplicate records. The ISO 19115/19139 and FGDC standards specify metadata formats, but are intricate, lengthy, and not focused on discovery. Moreover, they require sophisticated authoring environments and cataloging expertise. Geographic metadata standards target preservation and quality measure use cases, but they do not provide for simple inter-institutional sharing of metadata for discovery use cases. To this end, our schema reuses elements from Dublin Core and GeoRSS to leverage their normative semantics, community best practices, open-source software implementations, and extensive examples already deployed in discovery contexts such as web search and mapping. Finally, we discuss a Solr implementation of the schema using a “geo” extension to MODS.

Using Open Source Tools to Create a Mobile Optimized, Crowdsourced Translation Tool

Issue 24 | 2014-04-16

Evviva Weinraub Lajoie, Trey Terrell, Susan McEvoy, Eva Kaplan, Ariel Schwartz, and Esther Ajambo

In late 2012, OSU Libraries and Press partnered with Maria’s Libraries, an NGO in Rural Kenya, to provide users the ability to crowdsource translations of folk tales and existing children’s books into a variety of African languages, sub-languages, and dialects. Together, these two organizations have been creating a mobile optimized platform using open source libraries such as Wink Toolkit (a library which provides mobile-friendly interaction from a website) and Globalize3 to allow for multiple translations of database entries in a Ruby on Rails application. Research regarding successes of similar tools has been utilized in providing a consistent user interface. The OSU Libraries & Press team delivered a proof-of-concept tool that has the opportunity to promote technology exploration, improve early childhood literacy, change the way we approach foreign language learning, and to provide opportunities for cost-effective, multi-language publishing.

Breaking Up With CONTENTdm: Why and How One Institution Took the Leap to Open Source

Issue 20 | 2013-04-17

Heather Gilbert and Tyler Mobley

In 2011, College of Charleston found itself at a digital asset management crossroads. The Lowcountry Digital Library (LCDL), a multi-institution cooperative founded less than three years prior, was rapidly approaching its CONTENTdm license limit of 50,000 items. Understaffed and without a programmer, the College assessed their options and ultimately began construction on a Fedora Commons repository with a Blacklight discovery layer, an installation of Rutgers’ OpenWMS for Fedora ingestion and a Drupal front end as a replacement for their existing digital library. The system has been built and over 20,000 items have been migrated. The project was a success but a lot of hard lessons were learned.

Building a Library App Portfolio with Redis and Django

Issue 19 | 2013-01-15

Jeremy Nelson

The Tutt Library at Colorado College is developing a portfolio of library applications for use by patrons and library staff. Developed under an iterative and incremental agile model, these single-use HTML5 applications target multiple devices while using Bootstrap and Django to deliver fast and responsive interfaces to underlying FRBR datastores running on Redis, an advanced NoSQL database server. Two types are delineated: applications for access and discovery, which are available to everyone; and productivity applications, which are primarily for library staff to administer and manage the FRBR-RDA records. The access portfolio includes Book Search, Article Search, Call Number, and Library Hours applications. The productivity side includes an Orders App and a MARC Batch application for ingesting MARC records as FRBR entities using RDA Core attributes. When a critical threshold is reached, the Tutt Library intends to replace its legacy ILS with this library application portfolio.

The Martha Berry Digital Archive Project: A Case Study in Experimental pEDagogy

Issue 17 | 2012-06-01

Stephanie A. Schlitz and Garrick S. Bodine

Using the Martha Berry Digital Archive Project as an exploratory case study, this article discusses experimental methods in digital archive development, describing how and why a small project team is leveraging undergraduate student support, a participatory (crowdsourced) editing model, and free and open source software to digitize and disseminate a large documentary collection.

Conference Reports: Code4Lib 2011

Issue 13 | 2011-04-11

Bohyun Kim and Elias Tzoc

Conference reports from the 6th Code4Lib Conference, held in Bloomington, IN, from February 7 to 10, 2011. The Code4Lib conference is a collective volunteer effort of the Code4Lib community of library technologists. Included are two brief reports on the conference from some recipients of conference scholarships.

CONFERENCE REPORT: Code4Lib 2010

Issue 9 | 2010-03-22

Birong Ho, Banurekha Lakshminarayanan, and Vanessa Meireles

Conference reports from the 5th Code4Lib Conference, held in Asheville, NC, from February 22 to 25, 2010. The Code4Lib conference is a collective volunteer effort of the Code4Lib community of library technologists. Included are three brief reports on the conference from the recipients of conference scholarships.

ISSN 1940-5758