Issue 36, 2017-04-20

Editorial: Reflecting on the success and risks to the Code4Lib Journal

Peter E. Murray

At the Code4Lib 2017 conference, I gave a short lightning talk about the Code4Lib Journal and in the process realized that we will soon be closing out the 10th year of "foster[ing] community and share[ing] information among those interested in the intersection of libraries, technology, and the future." That quote comes from the Code4Lib Journal […]

Linked Data is People: Building a Knowledge Graph to Reshape the Library Staff Directory

Jason A. Clark and Scott W. H. Young

One of our greatest library resources is people. Most libraries have staff directory information published on the web, yet most of this data is trapped in local silos, PDFs, or unstructured HTML markup. With this in mind, the library informatics team at Montana State University (MSU) Library set a goal of remaking our people pages by connecting the local staff database to the Linked Open Data (LOD) cloud. In pursuing linked data integration for library staff profiles, we have realized two primary use cases: improving the search engine optimization (SEO) for people pages and creating network graph visualizations. In this article, we will focus on the code to build this library graph model as well as the linked data workflows and ontology expressions developed to support it. Existing linked data work has largely centered around machine-actionable data and improvements for bots or intelligent software agents. Our work demonstrates that connecting your staff directory to the LOD cloud can reveal relationships among people in dynamic ways, thereby raising staff visibility and bringing an increased level of understanding and collaboration potential for one of our primary assets: the people that make the library happen.

Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana

Richard Wallis, Antoine Isaac, Valentine Charles, and Hugo Manguinhas

Europeana provides access to more than 54 million cultural heritage objects through its portal Europeana Collections. It is crucial for Europeana to be recognized by search engines as a trusted authoritative repository of cultural heritage objects. Indeed, even though its portal is the main entry point, most Europeana users come to it via search engines.

Europeana Collections is fuelled by metadata describing cultural objects, represented in the Europeana Data Model (EDM). This paper presents the research and consequent recommendations for publishing Europeana metadata using the Schema.org vocabulary and best practices. Schema.org html embedded metadata to be consumed by search engines to power rich services (such as Google Knowledge Graph). Schema.org is an open and widely adopted initiative (used by over 12 million domains) backed by Google, Bing, Yahoo!, and Yandex, for sharing metadata across the web It underpins the emergence of new web techniques, such as so called Semantic SEO.

Our research addressed the representation of the embedded metadata as part of the Europeana HTML pages and sitemaps so that the re-use of this data can be optimized.

The practical objective of our work is to produce a Schema.org representation of Europeana resources described in EDM, being the richest as possible and tailored to Europeana’s realities and user needs as well the search engines and their users.

Autoload: a pipeline for expanding the holdings of an Institutional Repository enabled by ResourceSync

James Powell, Martin Klein and Herbert Van de Sompel

Providing local access to locally produced content is a primary goal of the Institutional Repository (IR). Guidelines, requirements, and workflows are among the ways in which institutions attempt to ensure this content is deposited and preserved, but some content is always missed. At Los Alamos National Laboratory, the library implemented a service called LANL Research Online (LARO), to provide public access to a collection of publicly shareable LANL researcher publications authored between 2006 and 2016. LARO exposed the fact that we have full text for only about 10% of eligible publications for this time period, despite a review and release requirement that ought to have resulted in a much higher deposition rate. This discovery motivated a new effort to discover and add more full text content to LARO. Autoload attempts to locate and harvest items that were not deposited locally, but for which archivable copies exist. Here we describe the Autoload pipeline prototype and how it aggregates and utilizes Web services including Crossref, SHERPA/RoMEO, and oaDOI as it attempts to retrieve archivable copies of resources. Autoload employs a bootstrapping mechanism based on the ResourceSync standard, a NISO standard for resource replication and synchronization. We implemented support for ResourceSync atop the LARO Solr index, which exposes metadata contained in the local IR. This allowed us to utilize ResourceSync without modifying our IR. We close with a brief discussion of other uses we envision for our ResourceSync-Solr implementation, and describe how a new effort called Signposting can replace cumbersome screen scraping with a robust autodiscovery path to content which leverages Web protocols.

Outside The Box: Building a Digital Asset Management Ecosystem for Preservation and Access

Andrew Weidner, Sean Watkins, Bethany Scott, Drew Krewer, Anne Washington, Matthew Richardson

The University of Houston (UH) Libraries made an institutional commitment in late 2015 to migrate the data for its digitized cultural heritage collections to open source systems for preservation and access: Hydra-in-a-Box, Archivematica, and ArchivesSpace. This article describes the work that the UH Libraries implementation team has completed to date, including open source tools for streamlining digital curation workflows, minting and resolving identifiers, and managing SKOS vocabularies. These systems, workflows, and tools, collectively known as the Bayou City Digital Asset Management System (BCDAMS), represent a novel effort to solve common issues in the digital curation lifecycle and may serve as a model for other institutions seeking to implement flexible and comprehensive systems for digital preservation and access.

Medici 2: A Scalable Content Management System for Cultural Heritage Datasets

Constantinos Sophocleous, Luigi Marini, Ropertos Georgiou, Mohammed Elfarargy, Kenton McHenry

Digitizing large collections of Cultural Heritage (CH) resources and providing tools for their management, analysis and visualization is critical to CH research. A key element in achieving the above goal is to provide user-friendly software offering an abstract interface for interaction with a variety of digital content types. To address these needs, the Medici content management system is being developed in a collaborative effort between the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, Bibliotheca Alexandrina (BA) in Egypt, and the Cyprus Institute (CyI). The project is pursued in the framework of European Project “Linking Scientific Computing in Europe and Eastern Mediterranean 2” (LinkSCEEM2) and supported by work funded through the U.S. National Science Foundation (NSF), the U.S. National Archives and Records Administration (NARA), the U.S. National Institutes of Health (NIH), the U.S. National Endowment for the Humanities (NEH), the U.S. Office of Naval Research (ONR), the U.S. Environmental Protection Agency (EPA) as well as other private sector efforts.

Medici is a Web 2.0 environment integrating analysis tools for the auto-curation of un-curated digital data, allowing automatic processing of input (CH) datasets, and visualization of both data and collections. It offers a simple user interface for dataset preprocessing, previewing, automatic metadata extraction, user input of metadata and provenance support, storage, archiving and management, representation and reproduction. Building on previous experience (Medici 1), NCSA, and CyI are working towards the improvement of the technical, performance and functionality aspects of the system. The current version of Medici (Medici 2) is the result of these efforts. It is a scalable, flexible, robust distributed framework with wide data format support (including 3D models and Reflectance Transformation Imaging-RTI) and metadata functionality. We provide an overview of Medici 2’s current features supported by representative use cases as well as a discussion of future development directions

An Interactive Map for Showcasing Repository Impacts

Hui Zhang and Camden Lopez

Digital repository managers rely on usage metrics such as the number of downloads to demonstrate research visibility and impacts of the repositories. Increasingly, they find that current tools such as spreadsheets and charts are ineffective for revealing important elements of usage, including reader locations, and for attracting the targeted audiences. This article describes the design and development of a readership map that provides an interactive, near-real-time visualization of actual visits to an institutional repository using data from Google Analytics. The readership map exhibits the global impacts of a repository by displaying the city of every view or download together with the title of the scholarship being read and a hyperlink to its page in the repository. We will discuss project motivation and development issues such as authentication with Google API, metadata integration, performance tuning, and data privacy.