Issue 33, 2016-07-19

Editorial Introduction – Summer Reading List

Ron Peterson

New additions for your summer reading list!

Emflix – Gone Baby Gone

Netanel Ganin

Enthusiasm is no replacement for experience. This article describes a tool developed at the Emerson College Library by an eager but overzealous cataloger. Attempting to enhance media-discovery in a familiar and intuitive way, he created a browseable and searchable Netflix-style interface. Though it may have been an interesting idea, many of the crucial steps that are involved in this kind of high-concept work were neglected. This article will explore and explain why the tool ultimately has not been maintained or updated, and what should have been done differently to ensure its legacy and continued use.

Introduction to Text Mining with R for Information Professionals

Monica Maceli

The ‘tm: Text Mining Package’ in the open source statistical software R has made text analysis techniques easily accessible to both novice and expert practitioners, providing useful ways of analyzing and understanding large, unstructured datasets. Such an approach can yield many benefits to information professionals, particularly those involved in text-heavy research projects. This article will discuss the functionality and possibilities of text mining, as well as the basic setup necessary for novice R users to employ the RStudio integrated development environment (IDE). Common use cases, such as analyzing a corpus of text documents or spreadsheet text data, will be covered, as well as the text mining tools for calculating term frequency, term correlations, clustering, creating wordclouds, and plotting.

Data for Decision Making: Tracking Your Library’s Needs With TrackRef

Michael Carlozzi

Library services must adapt to changing patron needs. These adaptations should be data-driven. This paper reports on the use of TrackRef, an open source and free web program for managing reference statistics.

Are games a viable solution to crowdsourcing improvements to faulty OCR? – The Purposeful Gaming and BHL experience

Max J. Seidman; Dr. Mary Flanagan;Trish Rose-Sandler; Mike Lichtenberg

The Missouri Botanical Garden and partners from Dartmouth, Harvard, the New York Botanical Garden, and Cornell recently wrapped up a project funded by IMLS called Purposeful Gaming and BHL: engaging the public in improving and enhancing access to digital texts (http://biodivlib.wikispaces.com/Purposeful+Gaming). The goals of the project were to significantly improve access to digital texts through the applicability of purposeful gaming for the completion of data enhancement tasks needed for content found within the Biodiversity Heritage Library (BHL). This article will share our approach in terms of game design choices and the use of algorithms for verifying the quality of inputs from players as well as challenges related to transcriptions and marketing. We will conclude by giving an answer to the question of whether games are a successful tool for analyzing and improving digital outputs from OCR and whether we recommend their uptake by libraries and other cultural heritage institutions.

From Digital Commons to OCLC: A Tailored Approach for Harvesting and Transforming ETD Metadata into High-Quality Records

Marielle Veve

The library literature contains many examples of automated and semi-automated approaches to harvest electronic theses and dissertations (ETD) metadata from institutional repositories (IR) to the Online Computer Library Center (OCLC). However, most of these approaches could not be implemented with the institutional repository software Digital Commons because of various reasons including proprietary schema incompatibilities and high level programming expertise requirements our institution did not want to pursue. Only one semi-automated approach was found in the library literature which met our requirements for implementation, and even though it catered to the particular needs of the DSpace IR, it could be implemented to other IR software if further customizations were applied.

The following paper presents an extension of this semi-automated approach originally created by Deng and Reese, but customized and adapted to address the particular needs of the Digital Commons community and updated to integrate the latest Resource Description & Access (RDA) content standards for ETDs. Advantages and disadvantages of this workflow are discussed and presented as well.

Checking the identity of entities by machine algorithms: the next step to the Hungarian National Namespace

Zsolt Bánki, Tibor Mészáros, Márton Németh, András Simon

The redundancy of entities coming from different sources caused problems during the building of the personal name authorities for the Petőfi Museum of Literature. It was a top priority to cleanse and unite classificatory records which have different data content but pertain to the same person without losing any data. As a first step in 2013, we found identities in approximately 80,000 name records so we merged the data content of these records. In the second phase a much more complicated algorithm had to be applied to show these identities. We cleansed the database by uniting approximately 36,000 records. The workflow for automatic detection of authority data tries to follow human intelligence. The database scripts normalize and examine about 20 kinds of data elements according to information about dates, localities, occupation and name variations. The result of creating pairs from the database authority records, as potential redundant elements, was a graph, which was condensed to a tree, by human efforts of the curators of the museum. With this, the limit of technological identification was reached. For further data cleansing human intelligence that can be assisted by computerized regular monitoring is needed, based upon the developed algorithm. As a result, the service containing about 620,000 authority name records will be an indispensable foundation to the establishment of the National Name Authorities. This article shows the work process of unification.

Metadata Analytics, Visualization, and Optimization: Experiments in statistical analysis of the Digital Public Library of America (DPLA)

Corey A. Harper

This paper presents the concepts of metadata assessment and “quantification” and describes preliminary research results applying these concepts to metadata from the Digital Public Library of America (DPLA). The introductory sections provide a technical outline of data pre-processing, and propose visualization techniques that can help us understand metadata characteristics in a given context. Example visualizations are shown and discussed, leading up to the use of “metadata fingerprints” — D3 Star Plots — to summarize metadata characteristics across multiple fields for arbitrary groupings of resources. Fingerprints are shown comparing metadata characteristics for different DPLA “Hubs” and also for used versus not used resources based on Google Analytics “pageview” counts. The closing sections introduce the concept of metadata optimization and explore the use of machine learning techniques to optimize metadata in the context of large-scale metadata aggregators like DPLA. Various statistical models are used to predict whether a particular DPLA item is used based only on its metadata. The article concludes with a discussion of the broad potential for machine learning and data science in libraries, academic institutions, and cultural heritage.