Issue 25, 2014-07-21

Editorial introduction: On libraries, code, support, inspiration, and collaboration

Dan Scott

Reflections on the occasion of the 25th issue of the Code4Lib Journal: sustaining a community for support, inspiration, and collaboration at the intersection of libraries and information technology.

Getting What We Paid for: a Script to Verify Full Access to E-Resources

Kristina M. Spurgin

Libraries regularly pay for packages of e-resources containing hundreds to thousands of individual titles. Ideally, library patrons could access the full content of all titles in such packages. In reality, library staff and patrons inevitably stumble across inaccessible titles, but no library has the resources to manually verify full access to all titles, and basic URL checkers cannot check for access. This article describes the E-Resource Access Checker—a script that automates the verification of full access. With the Access Checker, library staff can identify all inaccessible titles in a package and bring these problems to content providers’ attention to ensure we get what we pay for.

Opening the Door: A First Look at the OCLC WorldCat Metadata API

Terry Reese

Libraries have long relied on OCLC’s WorldCat database as a way to cooperatively share bibliographic data and declare library holdings to support interlibrary loan services. As curator, OCLC has traditionally mediated all interactions with the WorldCat database through their various cataloging clients to control access to the information. As more and more libraries look for new ways to interact with their data and streamline metadata operations and workflows, these clients have become bottlenecks and an inhibitor of library innovation. To address some of these concerns, in early 2013 OCLC announced the release of a set of application programming interfaces (APIs) supporting read and write access to the WorldCat database. These APIs offer libraries their first opportunity to develop new services and workflows that directly interact with the WorldCat database, and provide opportunities for catalogers to begin redefining how they work with OCLC and their data.

Docker: a Software as a Service, Operating System-Level Virtualization Framework

John Fink

Docker is a relatively new method of virtualization available natively for 64-bit Linux. Compared to more traditional virtualization techniques, Docker is lighter on system resources, offers a git-like system of commits and tags, and can be scaled from your laptop to the cloud.

A Metadata Schema for Geospatial Resource Discovery Use Cases

Darren Hardy and Kim Durante

We introduce a metadata schema that focuses on GIS discovery use cases for patrons in a research library setting. Text search, faceted refinement, and spatial search and relevancy are among GeoBlacklight’s primary use cases for federated geospatial holdings. The schema supports a variety of GIS data types and enables contextual, collection-oriented discovery applications as well as traditional portal applications. One key limitation of GIS resource discovery is the general lack of normative metadata practices, which has led to a proliferation of metadata schemas and duplicate records. The ISO 19115/19139 and FGDC standards specify metadata formats, but are intricate, lengthy, and not focused on discovery. Moreover, they require sophisticated authoring environments and cataloging expertise. Geographic metadata standards target preservation and quality measure use cases, but they do not provide for simple inter-institutional sharing of metadata for discovery use cases. To this end, our schema reuses elements from Dublin Core and GeoRSS to leverage their normative semantics, community best practices, open-source software implementations, and extensive examples already deployed in discovery contexts such as web search and mapping. Finally, we discuss a Solr implementation of the schema using a “geo” extension to MODS.

Ebooks without Vendors: Using Open Source Software to Create and Share Meaningful Ebook Collections

Matt Weaver

The Community Cookbook project began with wondering how to take local cookbooks in the library’s collection and create a recipe database. The final website is both a recipe website and collection of ebook versions of local cookbooks. This article will discuss the use of open source software at every stage in the project, which proves that an open source publishing model is possible for any library.

Within Limits: mass-digitization from scratch

Pieter De Praetere

The provincial library of West-Vlaanderen (Belgium) is digitizing a large part of its iconographic collection. Due to various (technical and financial) reasons no specialist software was used. FastScan is a set of VBS-scripts that was developed by the author using off-the-shelf software that was either included in MS Windows (XP & 7) or already installed (imageMagick, Irfanview, littlecms, exiv2). This scripting package has increased the digitization efforts immensely. The article will show what software was used, the problems that occurred and how they were scripted together.

A Web Service for File-Level Access to Disk Images

Sunitha Misra, Christopher A. Lee and Kam Woods

Digital forensics tools have many potential applications in the curation of digital materials in libraries, archives and museums (LAMs). Open source digital forensics tools can help LAM professionals to extract digital contents from born-digital media and make more informed preservation decisions. Many of these tools have ways to display the metadata of the digital media, but few provide file-level access without having to mount the device or use complex command-line utilities. This paper describes a project to develop software that supports access to the contents of digital media without having to mount or download the entire image. The work examines two approaches in creating this tool: First, a graphical user interface running on a local machine. Second, a web-based application running in web browser. The project incorporates existing open source forensics tools and libraries including The Sleuth Kit and libewf along with the Flask web application framework and custom Python scripts to generate web pages supporting disk image browsing.

Processing Government Data: ZIP Codes, Python, and OpenRefine

Frank Donnelly

While there is a vast amount of useful US government data on the web, some of it is in a raw state that is not readily accessible to the average user. Data librarians can improve accessibility and usability for their patrons by processing data to create subsets of local interest and by appending geographic identifiers to help users select and aggregate data. This case study illustrates how census geography crosswalks, Python, and OpenRefine were used to create spreadsheets of non-profit organizations in New York City from the IRS Tax-Exempt Organization Masterfile. This paper illustrates the utility of Python for data librarians and should be particularly insightful for those who work with address-based data.

Indexing Bibliographic Database Content Using MariaDB and Sphinx Search Server

Arie Nugraha

Fast retrieval of digital content has become mandatory for library and archive information systems. Many software applications have emerged to handle the indexing of digital content, from low-level ones such Apache Lucene, to more RESTful and web-services-ready ones such Apache Solr and ElasticSearch. Solr’s popularity among library software developers makes it the “de-facto” standard software for indexing digital content. For content (full-text content or bibliographic description) already stored inside a relational DBMS such as MariaDB (a fork of MySQL) or PostgreSQL, Sphinx Search Server (Sphinx) is a suitable alternative. This article will cover an introduction on how to use Sphinx with MariaDB databases to index database content as well as some examples of Sphinx API usage.

Solving Advanced Encoding Problems with FFMPEG

Josh Romphf

Previous articles in the Code4Lib Journal touch on the capabilities of FFMPEG in great detail, and given these excellent introductions, the purpose of this article is to tackle some of the common problems users might face, dissecting more complicated commands and suggesting their possible uses.

HathiTrust Ingest of Locally Managed Content: A Case Study from the University of Illinois at Urbana-Champaign

Kyle R. Rimkus & Kirk M. Hess

In March 2013, the University of Illinois at Urbana-Champaign Library adopted a policy to more closely integrate the HathiTrust Digital Library into its own infrastructure for digital collections. Specifically, the Library decided that the HathiTrust Digital Library would serve as a trusted repository for many of the library’s digitized book collections, a strategy that favors relying on HathiTrust over locally managed access solutions whenever this is feasible. This article details the thinking behind this policy, as well as the challenges of its implementation, focusing primarily on technical solutions for “remediating” hundreds of thousands of image files to bring them in line with HathiTrust’s strict specifications for deposit. This involved implementing HTFeed, a Perl 5 application developed at the University of Michigan for packaging content for ingest into Hathi Trust, and its many helper applications (JHOVE to detect metadata problems, Exiftool to detect metadata issues and repair missing image metadata, and Kakadu to create JP2000 files), as well as a file format conversion process using ImageMagick. Today, Illinois has over 1600 locally managed volumes queued for ingest, and has submitted over 2300 publicly available titles to the HathiTrust Digital Library.