Converting the Bliss Bibliographic Classification to SKOS RDF using Python RDFLib

Harry Bartholomew

This article discusses the project undertaken by the library of Queens’ College, Cambridge, to migrate its classification system to RDF applying the SKOS data model using Python. Queens’ uses the Bliss Bibliographic Classification alongside 18 other UK libraries, most of which are small libraries of the colleges at the Universities of Oxford and Cambridge. Though a flexible and universal faceted classification system, Bliss faces challenges due to its unfinished state, leading to the evolution in many Bliss libraries of divergent, in-house adaptations of the system to fill in its gaps. For most of the official, published parts of Bliss, a uniquely formatted source code used to generate a typeset version is available online. This project focused on converting this source code into a SKOS RDF linked-data format using Python: first by parsing the source code, then using RDFLib to write the concepts, notation, relationships, and notes in RDF. This article suggests that the RDF version has the potential to prevent further divergence and unify the various Bliss adaptations and reflects on the limitations of SKOS when applied to complex, faceted systems.

Simplifying Subject Indexing: A Python-Powered Approach in KBR, the National Library of Belgium

Hannes Lowagie and Julie Van Woensel

This paper details the National Library of Belgium’s (KBR) exploration of automating the subject indexing process for their extensive collection using Python scripts. The initial exploration involved creating a reference dataset and automating the classification process using MARCXML files. The focus is on demonstrating the practicality, adaptability, and user-friendliness of the Python-based solution. The authors introduce their unique approach, emphasizing the semantically significant words in subject determination. The paper outlines the Python workflow, from creating the reference dataset to generating enriched bibliographic records. Criteria for an optimal workflow, including ease of creation and maintenance of the dataset, transparency, and correctness of suggestions, are discussed. The paper highlights the promising results of the Python-powered approach, showcasing two specific scripts that create a reference dataset and automate subject indexing. The flexibility and user-friendliness of the Python solution are emphasized, making it a compelling choice for libraries seeking efficient and maintainable solutions for subject indexing projects.

Extra Editorial: On the Release of Patron Data in Issue 58 of Code4Lib Journal

Code4Lib Editorial Board

We, the editors of the Code4Lib Journal, sincerely apologize for the recent incident in which Personally Identifiable Information (PII) was released through the publication of an article in issue 58.

Enhancing Serials Holdings Data: A Pymarc-Powered Clean-Up Project

Minyoung Chung and Phani Chaitanya Pendyala

Following the recent transition from Inmagic to Ex Libris Alma, the Technical Services department at the University of Southern California (USC) in Los Angeles undertook a post-migration cleanup initiative. This article introduces methodologies aimed at improving irregular summary holdings data within serials records using Pymarc, regular expressions, and the Alma API in MarcEdit. The challenge identified was the confinement of serials’ holdings information exclusively to the 866 MARC tag for textual holdings.

To address this challenge, Pymarc and regular expressions were leveraged to parse and identify various patterns within the holdings data, offering a nuanced understanding of the intricacies embedded in the 866 field. Subsequently, the script generated a new 853 field for captions and patterns, along with multiple instances of the 863 field for coded enumeration and chronology data, derived from the existing data in the 866 field.

The final step involved utilizing the Alma API via MarcEdit, streamlining the restructuring of holdings data and updating nearly 5,000 records for serials. This article illustrates the application of Pymarc for both data analysis and creation, emphasizing its utility in generating data in the MARC format. Furthermore, it posits the potential application of Pymarc to enhance data within library and archive contexts.

Developing a Multi-Portal Digital Library System: A Case Study of the new University of Florida Digital Collections

Todd Digby, Cliff Richmond, Dustin Durden, and Julio Munoz

The University of Florida (UF) launched the UF Digital Collections in 2006. Since this time, the system has grown to over 18 million pages of content. The locally developed digital library system consisted of an integrated public frontend interface and a production backend. As with other monoliths, being able to adapt and make changes to the system became increasingly difficult as time went on and the size of the collections grew. As production processes changed, the system was modified to make improvements on the backend, but the public interface became dated and increasingly not mobile responsive. A decision was made to develop a new system, starting with decoupling the public interface from the production system. This article will examine our experience in rearchitecting our digital library system and deploying our new multi-portal, public-facing system. After an environmental scan of digital library technologies, it was decided to not use a current open-source digital library system. A relatively new programming team, who were new to the library ecosystem, allowed us to rethink many of our existing assumptions and provided new insights and development opportunities. Using technologies that include Python, APIs, ElasticSearch, ReactJS, PostgreSQL, and more, has allowed us to build a flexible and adaptable system that allows us to hire developers in the future who may not have experience building digital library systems.

Using Event Notifications, Solid and Orchestration for Decentralizing and Decoupling Scholarly Communication

Patrick Hochstenbach, Ruben Verborgh and Herbert Van de Sompel

The paper presents the case for a decentralized and decoupled architecture for scholarly communication. An introduction to the Event Notifications protocol will be provided as being applied in projects such as the international COAR Notify Initiative and the NDE-Usable program by memory institutions in The Netherlands. This paper provides an implementation of Event Notifications using a Solid server. The processing of notifications can be automated using an orchestration service called Koreografeye. Koreografeye will be applied to a citation extraction and relay experiment to show all these tools fit together.

Editorial: Big code, little code, open code, old code

Péter Király

Paraphrasing the title of Christine L. Borgman’s inaugural lecture in Göttingen some years ago “Big data, little data, open data” I could say that the current issue of Code4Lib is about big code, little code, open code, old code. The good side of coding is that effective contribution could be done with different levels and types of background knowledge. The issue proves to us that even small modifications or sharing knowledge about command line usage of a tool might be very useful for the user community. Let’s see what we have!

ChronoNLP: Exploration and Analysis of Chronological Textual Corpora

Erin Wolfe

This article introduces ChronoNLP, a free and open-source web application designed to enable the application of Natural Language Processing (NLP) techniques to textual datasets with a time-based component. This interactive Python platform allows users to filter, search, explore, and visualize this data, allowing the temporal aspect to play a central role in data analysis. ChronoNLP makes use of several powerful NLP libraries to facilitate various text analysis techniques including topic modeling, term/TF-IDF frequency evaluation, automated keyword extraction, named entity recognition and other tasks through a graphical interface without the need for coding or technical knowledge. By highlighting the temporal aspect of specific types of corpora, ChronoNLP provides access to methods of parsing and visualizing the data in a user-friendly format to help uncover patterns and trends in text-based materials.

An introduction to using metrics to assess the health and sustainability of library open source software projects

Jenn Colt

In LYRASIS 2021 Open Source Software Report: Understanding the Landscape of Open Source Software Support in American Libraries (Rosen & Grogg, 2021), responding libraries indicated the sustainability of OSS projects to be an important concern when making decisions about adoption. However, methods libraries might use to gather information about sustainability is not discussed. Metrics defined by the Linux Foundation’s CHAOSS project (https://chaoss.community/) are designed to measure the health and sustainability of open source software (OSS) communities and may be useful for libraries who are making decisions about adopting particular OSS applications. I demonstrate the use of cauldron.io as one method to gather and visualize the data for these metrics, and discuss the benefits and limitations of using them for decision-making.

Strategies for Digital Library Migration

Justin Littman, Mike Giarlo, Peter Mangiafico, Laura Wrubel, Naomi Dushay, Aaron Collier, Arcadia Falcone

A migration of the datastore and data model for Stanford Digital Repository’s digital object metadata was recently completed. This paper describes the motivations for this work and some of the strategies used to accomplish the migration. Strategies include: adopting a validatable data model, abstracting the datastore behind an API, separating concerns, testing metadata mappings against real digital objects, using reports to understand the data, templating unit tests, performing a rolling migration, and incorporating the migration into ongoing project work. These strategies may be useful to other repository or digital library application migrations.

Building CyprusArk a Web Content Management System for Small Museums Collections Online

Avgoustinos Avgousti, Georgios Papaioannou, and Feliz Ribeiro Gouveia

This article introduces CyprusArk, a work-in-progress solution to the problems that small museums in Cyprus have in providing online access to their collections. CyprusArk is an open-source web content management system for small museums’ online collections. Developed as part of Avgousti’s Ph.D. thesis, based on qualitative data collected from six small museums in Cyprus.

Supporting open access, integrating distributed research platforms, and building a research information management platform

Daniel M. Coughlin, Cynthia Hudson Vitale

Academic libraries are often called upon by their university communities to collect, manage, and curate information about the research activity produced at their campuses. Proper research information management (RIM) can be leveraged for multiple institutional contexts, including networking, reporting activities, building faculty profiles, and supporting the reputation management of the institution.

In the last ten to fifteen years the adoption and implementation of RIM infrastructure has become widespread throughout the academic world. Approaches to developing and implementing this infrastructure have varied, from commercial and open-source options to locally developed instances. Each piece of infrastructure has its own functionality, features, and metadata sources. There is no single application or data source to meet all the needs of these varying pieces of research information, many of these systems together create an ecosystem to provide for the diverse set of needs and contexts.

This paper examines the systems at Pennsylvania State University that contribute to our RIM ecosystem; how and why we developed another piece of supporting infrastructure for our Open Access policy and the successes and challenges of this work.

How We Built a Spatial Subject Classification Based on Wikidata

Adrian Pohl

From the fall of 2017 to the beginning of 2020 a project had been carried out to upgrade spatial subject indexing in North Rhine-Westphalian Bibliography (NWBib) from uncontrolled strings to controlled values. For this purpose, a spatial classification with around 4,500 entries was created from Wikidata and published as SKOS (Simple Knowledge Organization System) vocabulary. The article gives an overview over the initial problem and outlines the different implementation steps.

Advancing ARKs in the Historical Ontology Space

Mat Kelly, Christopher B. Rauch, Jane Greenberg, Sam Grabus, Joan Boone, John Kunze and Peter M. Logan

This paper presents the application of Archival Resource Keys (ARKs) for persistent identification and resolution of concepts in historical ontologies. Our use case is the 1910 Library of Congress Subject Headings (LCSH), which we have converted to the Simple Knowledge Organization System (SKOS) format and will use for representing a corpus of historical Encyclopedia Britannica articles. We report on the steps taken to assign ARKs in support of the Nineteenth-Century Knowledge Project, where we are using the HIVE vocabulary tool to automatically assign subject metadata from both the 1910 LCSH and the contemporary LCSH faceted, topical vocabulary to enable the study of the evolution of knowledge.

Open Source Tools for Scaling Data Curation at QDR

Nicholas Weber, Sebastian Karcher, and James Myers

This paper describes the development of services and tools for scaling data curation services at the Qualitative Data Repository (QDR). Through a set of open-source tools, semi-automated workflows, and extensions to the Dataverse platform, our team has built services for curators to efficiently and effectively publish collections of qualitatively derived data. The contributions we seek to make in this paper are as follows:

1. We describe ‘human-in-the-loop’ curation and the tools that facilitate this model at QDR;

2. We provide an in-depth discussion of the design and implementation of these tools, including applications specific to the Dataverse software repository, as well as standalone archiving tools written in R; and

3. We highlight the role of providing a service layer for data discovery and accessibility of qualitative data.

Keywords: Data curation; open-source; qualitative data

Data reuse in linked data projects: a comparison of Alma and Share-VDE BIBFRAME networks

Jim Hahn

This article presents an analysis of the enrichment, transformation, and clustering used by vendors Casalini Libri/@CULT and Ex Libris for their respective conversions of MARC data to BIBFRAME. The analysis considers the source MARC21 data used by Alma then the enrichment and transformation of MARC21 data from Share-VDE partner libraries. The clustering of linked data into a BIBFRAME network is a key outcome of data reuse in linked data projects and fundamental to the improvement of the discovery of library collections on the web and within search systems.

Experimenting with a Machine Generated Annotations Pipeline

Joshua Gomez, Kristian Allen, Mark Matney, Tinuola Awopetu, and Sharon Shafer

The UCLA Library reorganized its software developers into focused subteams with one, the Labs Team, dedicated to conducting experiments. In this article we describe our first attempt at conducting a software development experiment, in which we attempted to improve our digital library’s search results with metadata from cloud-based image tagging services. We explore the findings and discuss the lessons learned from our first attempt at running an experiment.

IIIF by the Numbers

Joshua Gomez, Kevin S. Clarke, Anthony Vuong

The UCLA Library began work on building a suite of services to support IIIF for their digital collections. The services perform image transformations and delivery as well as manifest generation and delivery. The team was unsure about whether they should use local or cloud-based infrastructure for these services, so they conducted some experiments on multiple infrastructure configurations and tested them in scenarios with varying dimensions.

Editorial

Péter Király

on diversity and mentoring

Persistent identifiers for heritage objects

Lukas Koster

Persistent identifiers (PID’s) are essential for getting access and referring to library, archive and museum (LAM) collection objects in a sustainable and unambiguous way, both internally and externally. Heritage institutions need a universal policy for the use of PID’s in order to have an efficient digital infrastructure at their disposal and to achieve optimal interoperability, leading to open data, open collections and efficient resource management.

Here the discussion is limited to PID’s that institutions can assign to objects they own or administer themselves. PID’s for people, subjects etc. can be used by heritage institutions, but are generally managed by other parties.

The first part of this article consists of a general theoretical description of persistent identifiers. First of all, I discuss the questions of what persistent identifiers are and what they are not, and what is needed to administer and use them. The most commonly used existing PID systems are briefly characterized. Then I discuss the types of objects PID’s can be assigned to. This section concludes with an overview of the requirements that apply if PIDs should also be used for linked data.

The second part examines current infrastructural practices, and existing PID systems and their advantages and shortcomings. Based on these practical issues and the pros and cons of existing PID systems a list of requirements for PID systems is presented which is used to address a number of practical considerations. This section concludes with a number of recommendations.

Factor Analysis For Librarians in R

Michael Carlozzi

This paper offers a primer in the programming language R for library staff members to perform factor analysis. It presents a brief overview of factor analysis and walks users through the process from downloading the software (R Studio) to performing the actual analysis. It includes limitations and cautions against improper use.

Design reusable SHACL shapes and implement a linked data validation pipeline

Emidio Stani

In July 2017, W3C published SHACL as the standard to validate RDF. Since then, data modellers have the possibility to provide validation services based on SHACL shapes together with their models, however there are considerations to be taken in account when creating them. This paper aims to list such considerations and shows an example of a validation pipeline to address them.

Visualizing Fedora-managed TEI and MEI documents within Islandora

Raffaele Viglianti, Marcus Emmanuel Barnes, Natkeeran Ledchumykanthan, Kirsta Stapelfeldt

The Early Modern Songscapes (EMS) project [1] represents a development partnership between the University of Toronto Scarborough’s Digital Scholarship Unit (DSU), the University of Maryland, and the University of South Carolina. Developers, librarians and faculty from both institutions have collaborated on an intermedia online platform designed to support the scholarly investigation of early modern English song. The first iteration of the platform, launched at the Early modern Songscapes Conference, held February 8-9, 2019 at the University of Toronto’s Centre for Reformation and Renaissance Studies, serves Fedora-held Text Encoding Initiative (TEI) and Music Encoding Initiative (MEI) documents through a JavaScript viewer capable of being embedded within the Islandora digital asset management framework. The viewer presents versions of a song’s musical notation and textual underlay followed by the entire song text.

This article reviews the status of this technology, and the process of developing an XML framework for TEI and MEI editions that would serve the requirements of all stakeholder technologies. Beyond the applicability of this technology in other digital scholarship contexts, the approach may serve others seeking methods for integrating technologies into Islandora or working across institutional development environments.

Improving the discoverability and web impact of open repositories: techniques and evaluation

George Macgregor

In this contribution we experiment with a suite of repository adjustments and improvements performed on Strathprints, the University of Strathclyde, Glasgow, institutional repository powered by EPrints 3.3.13. These adjustments were designed to support improved repository web visibility and user engagement, thereby improving usage. Although the experiments were performed on EPrints it is thought that most of the adopted improvements are equally applicable to any other repository platform. Following preliminary results reported elsewhere, and using Strathprints as a case study, this paper outlines the approaches implemented, reports on comparative search traffic data and usage metrics, and delivers conclusions on the efficacy of the techniques implemented. The evaluation provides persuasive evidence that specific enhancements to technical aspects of a repository can result in significant improvements to repository visibility, resulting in a greater web impact and consequent increases in content usage. COUNTER usage grew by 33% and traffic to Strathprints from Google and Google Scholar was found to increase by 63% and 99% respectively. Other insights from the evaluation are also explored. The results are likely to positively inform the work of repository practitioners and open scientists.

Automated Playlist Continuation with Apache PredictionIO

Jim Hahn

The Minrva project team, a software development research group based at the University of Illinois Library, developed a data-focused recommender system to participate in the creative track of the 2018 ACM RecSys Challenge, which focused on music recommendation. We describe here the large-scale data processing the Minrva team researched and developed for foundational reconciliation of the Million Playlist Dataset using external authority data on the web (e.g. VIAF, WikiData). The secondary focus of the research was evaluating and adapting the processing tools that support data reconciliation. This paper reports on the playlist enrichment process, indexing, and subsequent recommendation model developed for the music recommendation challenge.

Analyzing EZproxy SPU Logs Using Python Data Analysis Tools

Brighid M. Gonzales

Even with the assortment of free and ready-made tools for analyzing EZproxy log files, it can be difficult to get useful, meaningful data from them. Using the Python programming language with its collection of modules created specifically for data analysis can help with this task, and ultimately result in better and more useful data customized to the needs of the library using it. This article describes how Our Lady of the Lake University used Python to analyze its EZproxy log files to get more meaningful data, including a walk-through of the code needed to accomplish this task.

WMS, APIs and LibGuides: Building a Better Database A-Z List

Veronica Ramshaw, Véronique Lecat and Thomas Hodge

At the American University of Sharjah, our Databases by title and by subject pages are the 3rd and 4th most visited pages on our website. When we changed our ILS from Millennium to OCLC’s WorldShare Management Services (WMS), our previous automations which kept our Databases A-Z pages up-to-date were no longer usable and needed to be replaced. Using APIs, a Perl script, and LibGuides’ database management interface, we developed a workflow that pulls database metadata from WMS Collection Manager into a clean public-facing A-Z list. This article will discuss the details of how this process works, the advantages it provides, and the continuing issues we are facing.

Using R and the Tidyverse to Generate Library Usage Reports

Andy Meyer

Gathering, analyzing, and communicating library usage data provides a foundation for thoughtful assessment. However, the amount of time and expertise required creates a barrier to actually using this data. By using the statistical programming language R and the tools and approach of the Tidyverse, the process of gathering, analyzing, and communicating data can be automated in ways that reduce the amount of time and energy required. At the same time, this approach increases staff capacity for other data science projects and creates a shareable model and framework for other libraries. This article focuses on electronic resource usage reports – especially Counter DB1 Reports – but this approach could be extended to other data sources and needs.

The FachRef-Assistant: Personalised, subject specific, and transparent stock management

Eike T. Spielberg, Frank Lützenkirchen

We present in this paper a personalized web application for the weeding of printed resources: the FachRef-Assistant. It offers an extensive range of tools for evidence based stock management, based on the thorough analysis of usage statistics. Special attention is paid to the criteria individualization, transparency of the parameters used, and generic functions. Currently, it is designed to work with the Aleph-System from ExLibris, but efforts were spent to keep the application as generic as possible. For example, all procedures specific to the local library system have been collected in one Java package. The inclusion of library specific properties such as collections and systematics has been designed to be highly generic as well by mapping the individual entries onto an in-memory database. Hence simple adaption of the package and the mappings would render the FachRef-Assistant compatible to other library systems.

The personalization of the application allows for the inclusion of subject specific usage properties as well as of variations between different collections within one subject area. The parameter sets used to analyse the stock and to prepare weeding and purchase proposal lists are included in the output XML-files to facilitate a high degree of transparency, objectivity and reproducibility.

OpeNumisma: A Software Platform Managing Numismatic Collections with A Particular Focus On Reflectance Transformation Imaging

Avgoustinos Avgousti, Andriana Nikolaidou, Ropertos Georgiou

This paper describes OpeNumisma; a reusable web-based platform focused on digital numismatic collections. The platform provides an innovative merge of digital imaging and data management systems that offer great new opportunities for research and the dissemination of numismatic knowledge online. A unique feature of the platform is the application of Reflectance Transformation Imaging (RTI), a computational photographic method that offers tremendous image analysis and possibilities for numismatic research. This computational photography technique allows the user to observe on browser minor details, unseen with the naked eye just by holding the computer mouse rather than the actual object. The first successful implementation of OpeNumisma has been the creation of a digital library for the medieval coins from the collection of the Bank of Cyprus Cultural Foundation.

ISSN 1940-5758