Issue 47, 2020-02-17

Editorial

Péter Király

on diversity and mentoring

Scraping BePress: Downloading Dissertations for Preservation

Stephen Zweibel

This article will describe our process developing a script to automate downloading of documents and secondary materials from our library’s BePress repository. Our objective was to collect the full archive of dissertations and associated files from our repository into a local disk for potential future applications and to build out a preservation system.

Unlike at some institutions, our students submit directly into BePress, so we did not have a separate repository of the files; and the backup of BePress content that we had access to was not in an ideal format (for example, it included “withdrawn” items and did not effectively isolate electronic theses and dissertations). Perhaps more importantly, the fact that BePress was not SWORD-enabled and lacked a robust API or batch export option meant that we needed to develop a data-scraping approach that would allow us to both extract files and have metadata fields populated. Using a CSV of all of our records provided by BePress, we wrote a script to loop through those records and download their documents, placing them in directories according to a local schema. We dealt with over 3,000 records and about three times that many items, and now have an established process for retrieving our files from BePress. Details of our experience and code are included.

Persistent identifiers for heritage objects

Lukas Koster

Persistent identifiers (PID’s) are essential for getting access and referring to library, archive and museum (LAM) collection objects in a sustainable and unambiguous way, both internally and externally. Heritage institutions need a universal policy for the use of PID’s in order to have an efficient digital infrastructure at their disposal and to achieve optimal interoperability, leading to open data, open collections and efficient resource management.

Here the discussion is limited to PID’s that institutions can assign to objects they own or administer themselves. PID’s for people, subjects etc. can be used by heritage institutions, but are generally managed by other parties.

The first part of this article consists of a general theoretical description of persistent identifiers. First of all, I discuss the questions of what persistent identifiers are and what they are not, and what is needed to administer and use them. The most commonly used existing PID systems are briefly characterized. Then I discuss the types of objects PID’s can be assigned to. This section concludes with an overview of the requirements that apply if PIDs should also be used for linked data.

The second part examines current infrastructural practices, and existing PID systems and their advantages and shortcomings. Based on these practical issues and the pros and cons of existing PID systems a list of requirements for PID systems is presented which is used to address a number of practical considerations. This section concludes with a number of recommendations.

Dimensions & VOSViewer Bibliometrics in the Reference Interview

Brett Williams

The VOSviewer software provides easy access to bibliometric mapping using data from Dimensions, Scopus and Web of Science. The properly formatted and structured citation data, and the ease in which it can be exported open up new avenues for use during citation searches and reference interviews. This paper details specific techniques for using advanced searches in Dimensions, exporting the citation data, and drawing insights from the maps produced in VOS Viewer. These search techniques and data export practices are fast and accurate enough to build into reference interviews for graduate students, faculty, and post-PhD researchers. The search results derived from them are accurate and allow a more comprehensive view of citation networks embedded in ordinary complex boolean searches.

Automating Authority Control Processes

Stacey Wolf

Authority control is an important part of cataloging since it helps provide consistent access to names, titles, subjects, and genre/forms. There are a variety of methods for providing authority control, ranging from manual, time-consuming processes to automated processes. However, the automated processes often seem out of reach for small libraries when it comes to using a pricey vendor or expert cataloger. This paper introduces ideas on how to handle authority control using a variety of tools, both paid and free. The author describes how their library handles authority control; compares vendors and programs that can be used to provide varying levels of authority control; and demonstrates authority control using MarcEdit.

Managing Electronic Resources Without Buying into the Library Vendor Singularity

James Fournie

Over the past decade, the library automation market has faced continuing consolidation. Many vendors in this space have pushed towards monolithic and expensive Library Services Platforms. Other vendors have taken “walled garden” approaches which force vendor lock-in due to lack of interoperability. For these reasons and others, many libraries have turned to open-source Integrated Library Systems (ILSes) such as Koha and Evergreen. These systems offer more flexibility and interoperability options, but tend to be developed with a focus on public libraries and legacy print resource functionality. They lack tools important to academic libraries such as knowledge bases, link resolvers, and electronic resource management systems (ERMs). Several open-source ERM options exist, including CORAL and FOLIO. This article analyzes the current state of these and other options for libraries considering supplementing their open-source ILS either alone, hosted or in a consortial environment.

Shiny Fabric: A Lightweight, Open-source Tool for Visualizing and Reporting Library Relationships

Atalay Kutlay, Cal Murgu

This article details the development and functionalities of an open-source application called Fabric. Fabric is a simple to use application that renders library data in the form of network graphs (sociograms). Fabric is built in R using the Shiny package and is meant to offer an easy-to-use alternative to other software, such as Gephi and UCInet. In addition to being user friendly, Fabric can run locally as well as on a hosted server. This article discusses the development process and functionality of Fabric, use cases at the New College of Florida’s Jane Bancroft Cook Library, as well as plans for future development.

Analyzing and Normalizing Type Metadata for a Large Aggregated Digital Library

Joshua D. Lynch, Jessica Gibson, and Myung-Ja Han

The Illinois Digital Heritage Hub (IDHH) gathers and enhances metadata from contributing institutions around the state of Illinois and provides this metadata to the Digital Public Library of America (DPLA) for greater access. The IDHH helps contributors shape their metadata to the standards recommended and required by the DPLA in part by analyzing and enhancing aggregated metadata. In late 2018, the IDHH undertook a project to address a particularly problematic field, Type metadata. This paper walks through the project, detailing the process of gathering and analyzing metadata using the DPLA API and OpenRefine, data remediation through XSL transformations in conjunction with local improvements by contributing institutions, and the DPLA ingestion system’s quality controls.

Scaling IIIF Image Tiling in the Cloud

Yinlin Chen, Soumik Ghosh, Tingting Jiang, James Tuttle

The International Archive of Women in Architecture, established at Virginia Tech in 1985, collects books, biographical information, and published materials from nearly 40 countries that are divided into around 450 collections. In order to provide public access to these collections, we built an application using the IIIF APIs to pre-generate image tiles and manifests which are statically served in the AWS cloud. We established an automatic image processing pipeline using a suite of AWS services to implement microservices in Lambda and Docker. By doing so, we reduced the processing time for terabytes of images from weeks to days.

In this article, we describe our serverless architecture design and implementations, elaborate the technical solution on integrating multiple AWS services with other techniques into the application, and describe our streamlined and scalable approach to handle extremely large image datasets. Finally, we show the significantly improved performance compared to traditional processing architectures along with a cost evaluation.

Where Do We Go From Here: A Review of Technology Solutions for Providing Access to Digital Collections

Kelli Babcock, Sunny Lee, Jana Rajakumar, Andy Wagner

The University of Toronto Libraries is currently reviewing technology to support its Collections U of T service. Collections U of T provides search and browse access to 375 digital collections (and over 203,000 digital objects) at the University of Toronto Libraries. Digital objects typically include special collections material from the university as well as faculty digital collections, all with unique metadata requirements. The service is currently supported by IIIF-enabled Islandora, with one Fedora back end and multiple Drupal sites per parent collection (see attached image). Like many institutions making use of Islandora, UTL is now confronted with Drupal 7 end of life and has begun to investigate a migration path forward. This article will summarise the Collections U of T functional requirements and lessons learned from our current technology stack. It will go on to outline our research to date for alternate solutions. The article will review both emerging micro-service solutions, as well as out-of-the-box platforms, to provide an overview of the digital collection technology landscape in 2019. Note that our research is focused on reviewing technology solutions for providing access to digital collections, as preservation services are offered through other services at the University of Toronto Libraries.