Issue 52, 2021-09-22

Core Concepts and Techniques for Library Metadata Analysis

Metadata analysis is a growing need in libraries of all types and sizes, as demonstrated in many recent job postings. Data migration, transformation, enhancement, and remediation all require strong metadata analysis skills. But there is no well-defined body of knowledge or competencies list for library metadata analysis, leaving library staff with analysis-related responsibilities largely on their own to learn how to do the work effectively. In this paper, two experienced metadata analysts will share what they see as core knowledge areas and problem solving techniques for successful library metadata analysis. The paper will also discuss suggested tools, though the emphasis is intentionally not to prescribe specific tools, software, or programming languages, but rather to help readers recognize tools that will meet their analysis needs. The goal of the paper is to help library staff and their managers develop a shared understanding of the skill sets required to meet their library’s metadata analysis needs. It will also be useful to individuals interested in pursuing a career in library metadata analysis and wondering how to enhance their existing knowledge and skills for success in analysis work.

by Stacie Traill and Martin Patrick

Introduction

Metadata analysis, or the application of data manipulation and analysis tools and skills to library metadata, has become an increasingly important task in libraries. Library job postings and job descriptions show a growing need for metadata analysis skills. Metadata analysis, even though it is rarely labeled as such, is a necessary skill for staff in libraries of all types and sizes, and across a variety of technical functions, including cataloging and metadata, acquisitions, electronic resources management, collection management, and discovery, as well as traditionally information technology-heavy functions such as library systems management (Binici 2021; Gonzales 2019; NASIG 2019; Ratledge and Sproles 2017; Hall-Ellis 2015; Mathews and Pardue 2009). Metadata analysis can also be an important skillset for library staff in more recently emerging specialties, such as research data services and digital scholarship (Hannah et al. 2020; Thielen and Neeser 2020; Skene 2018; Xia and Wang 2014).

Despite the need for metadata analysis skills, few jobs use the phrase “metadata analyst” in their titles. But many library staff are doing this work, no matter what their official job titles might be. This is abundantly clear from even a cursory sampling of conference presentations and publications, including many published in this journal. But perhaps because metadata analysis work is distributed across so many library functions and job titles, there is little clear guidance for learning these skills, or even understanding what they are. As people who came to the work of metadata analysis because of some combination of experience, aptitude, and interest, but without the base of skills and knowledge that we have today, the authors know very well how challenging it can be to piece together both the technical and library-specific knowledge required to do metadata analysis well. To put it another way, there are a huge, almost overwhelming, number of case studies and project descriptions from which the aspiring analyst can glean useful information about what to learn and (sometimes) how to learn it, but there are few resources that synthesize and generalize that information into something digestible and actionable. In this article, we will describe metadata analysis and its typical tasks, outline the skills and knowledge required to execute those tasks successfully, and discuss several key analysis and problem solving approaches that a good metadata analyst relies upon.

Metadata Analysis Scope and Tasks

Before we can meaningfully explore the skills and knowledge required for good metadata analysis, we will attempt to clarify what we mean by metadata analysis in the library context. Data analysis broadly defined includes processes such as data cleaning, exploration, transformation, and statistical analysis. Library metadata analysis may include all of these activities, embedded in the understanding of metadata in context as a central, living component of an ecosystem enabling many core library functions. Here, “ecosystem” means a library’s larger environment of internal and external repositories, services, and tools, in which planning, analysis, design, deployment, and maintenance of metadata management processes are frequent and ongoing needs. A large number of library business processes — in areas ranging from discovery to access to collection management — depend on accurate, high-quality metadata, meaning that metadata analysis is essential work in most libraries.

Working from that definition, we can generalize about the tasks involved in completing typical metadata analysis projects as a helpful first step toward identifying the specific skills and knowledge areas an analyst needs. First, the analyst needs to know (or be able to find out) what data is available, and the sources from which it is available. Second, the analyst needs to know how the data can be accessed programmatically or in batch with automation tools. Third, the analyst needs to know how the data are used, and how they should be used, if that differs from current usage. Fourth, the analyst needs to determine what data changes, merges, or enhancements are needed to produce the desired outcomes. Finally, if the same process needs to be repeated, the analyst must be able to determine how to do it efficiently, sustainably, and at whatever scale the library requires. With these tasks as a framework, we can start to see what the analyst needs to know in order to work through the tasks successfully.

Core Knowledge Areas

Based on the generalized tasks, we developed a list of core knowledge areas for metadata analysis. The list of seven core knowledge areas presented here is extensive, but not comprehensive; it would be difficult to include every knowledge area that library metadata analysis jobs might require, in part because library staff frequently have multiple intersecting areas of responsibility. The authors have chosen to focus on these seven areas based on their own experience and informal observation of the field. A future project to survey library staff with metadata analysis responsibilities to learn what knowledge areas they find essential would be valuable.

Seven core knowledge areas for metadata analysis

  1. Library Metadata Standards, Applications, and Systems
  2. Data Cleaning and Text Normalization
  3. Data Serializations
  4. Interoperability
  5. Database Design and Querying
  6. Web Technologies and Services
  7. Workflow Analysis and Documentation Practices

Because there are many different educational and experience backgrounds that might make someone interested in metadata analysis work, it should be expected that an analyst will come to the job with more expertise in some areas than others. The authors have relatively similar backgrounds: both had deep experience in library technical services, especially cataloging and electronic resources management, along with technical aptitude and interest, but little formal education in programming, computer science, or business analysis. As a result, some of these knowledge areas were more challenging than others for us to learn, and we both continue to learn every day on the job. Most metadata analysts will not need deep expertise in all of these areas, and some may need skills in areas not among these seven, such as data visualization and project management. The key is to know enough to get the job done, and to know how and where to learn more when the need arises. In the next section of the article, we will go into some detail about each area of knowledge.

Library metadata standards, applications, and systems

The first core knowledge area represents the domain-specific knowledge that is a requirement for almost all metadata analysis: library metadata standards, how those standards are applied, and library systems. The analyst should have a broad familiarity with library metadata standards, including content standards, and encoding and transmission standards. Additionally, the analyst should have expertise in widely-used standards, which may vary from library to library. Some fluency with very widely-used standards, such as MARC, is likely to be necessary for most library metadata analysts. The analyst needs a good understanding of the library management system(s), discovery system(s), and other repositories in use at their library, comprising both a high-level understanding of the systems’ capabilities and limitations, and a detailed understanding of their metadata management processes. Because metadata is the foundation for many core library business processes, the analyst should be able to identify and articulate the cross-functional impacts of data changes, and how internal data changes may affect external services and tools that rely on metadata. Finally, the analyst needs to be competent in using whatever built-in reporting and analysis capabilities are offered by the library’s systems.

Key domain-specific knowledge that is sometimes overlooked includes legacy standards and practices, and local practices and schema. It is not always easy to gather such knowledge, because documentation of past decisions and practices may be incomplete or inaccessible. Depending on how thoroughly a library recorded such information in the past, documenting past practice can be an important part of an analyst’s work.

Data cleaning and text normalization

The second core knowledge area is data cleaning and text normalization, also known as data wrangling. This is the practice of evaluating and transforming data to ensure its validity, accuracy, completeness, and consistency. Along with library metadata standards and systems, this is the most important knowledge area for metadata analysis work. It is critical for data remediation and enhancement projects, but also for devising mappings and crosswalks between metadata standards, and for creating any kind of analysis that combines data from varying sources, standards, or application profiles.

Data wrangling encompasses many specific skills. Most important for metadata analysis work are programmatic file and text manipulation, recognizing and expressing patterns (often through tools such as regular expressions), using conditional statements and actions, and isolating elements for evaluation, modification, or both. To successfully wrangle textual metadata, the analyst also needs to understand how to use and convert between character encodings, know where to find documentation of any standardized values, and be able to identify where the library’s metadata departs from those standards. Without dwelling on specific tools, the ability to use programmatic manipulations is undeniably important. Although learning a scripting or programming language is a likely path for any metadata analyst who does not come to the job with that proficiency, there are many tools that do not require programming ability that can be used to achieve the same goals, and which can also serve as a gentle introduction to programming concepts and ways of thinking. Specific tool choices are less important than conceptual understanding.

Data serializations

The third core knowledge area is data serializations. Data serializations are ways to package data structures or data objects for storage or transmission. The most common serializations for library metadata (like much other data available on the internet) are XML, JSON, and CSV, but there are many others. A metadata analyst should be well acquainted with those three serializations, and also able to work with others as needed. The keys to working with any serialization are to understand whether and how the data can be processed with familiar toolsets, and to be able to output data, once processed, in other serializations. When working with less common serializations, it may be necessary to explore new or unfamiliar tools to assist in transforming the data to a more familiar format.

Interoperability

The fourth core knowledge area is interoperability. There are two types of interoperability with which a metadata analyst should be familiar: metadata interoperability and systems interoperability. The Dublin Core Metadata Initiative glossary describes metadata interoperability as “the ability of … systems or components to exchange descriptive data about things, and to interpret the data … in a way that is consistent with the interpretation of the creator of the data.”[1] Metadata interoperability is about semantic compatibility: do these two elements really have the same meaning? Do these two records really represent the same resource? The ability to formulate and answer questions like these is critical to the work of a metadata analyst when it involves mapping across metadata schemas, incorporating enrichments from one metadata source into another, or creating data mashups to address core business needs, such as collection analysis. Another key concept of metadata interoperability is identifiers, which are essential to almost all automated metadata processing. Reliable identifiers are important for the matching, reconciliation, and enhancement tasks that comprise much of the work of metadata analysis. Reliable identifiers are also critical for functional linked data. A metadata analyst should understand what identifiers are available in the metadata they work with, which entities and levels of granularity each type of identifier signifies, and how the identifiers are maintained.

Systems interoperability can be understood as the context surrounding metadata: how is it stored, accessed, communicated between systems? How can the analyst fetch the data when needed, or use it to power external systems or tools? A metadata analyst may not need deep technical expertise in this area, but rather needs to have a clear understanding of how metadata will be used by and communicated between systems, especially if and when full automation of a workflow or integration is the goal. Even if an analyst is not writing the code for their own Extract-Transform-Load (ETL) pipelines, they will likely need to be able to conceptualize the process at a high level and devise requirements to communicate with a developer.

Database design and querying

The fifth core knowledge area is database design and querying. Again, deep knowledge here may not be necessary, but a basic understanding of how relational databases are structured is crucial. An analyst should have at least a high-level understanding of table relationships, logical structure, and data modeling. Knowing specific query languages such as SQL, XQuery, XPath, or SPARQL may be important depending on the systems and data an analyst works with. But more than fluency in any particular query language, it is important for an analyst to understand the core concepts of relational algebra that those languages implement, such as set operations, unions, differences, joins, and so on. Those concepts underlie many of the fundamental tasks in data analysis beyond database querying.

Web technologies and services

Web technologies and services comprise the sixth core knowledge area. Much of what a metadata analyst needs to know in this area overlaps with interoperability, but beyond that, an analyst needs to understand some specifics of how to work with data on the internet. A high-level understanding of how web applications are created and managed, web technology stacks, HTML, CSS, and Javascript are useful. In addition, an analyst may need a deeper understanding of web APIs, the request-response cycle, and library-specific internet standards and protocols such as OpenURL, OAI-PMH, and Z39.50/SRU. A working knowledge of semantic web and linked data concepts, standards, and technologies such as RDF, BIBFRAME, JSON-LD, and SPARQL are also essential; even if these are not currently required for a metadata analyst’s work, they may well be in the near future.

Workflow analysis and documentation practices

The seventh and final core knowledge area is workflow analysis and documentation practices. Important skills in this area include business and requirements analysis, process improvement analysis, and flowcharting. Understanding how to devise and communicate workflows and dataflows is especially important for an analyst’s work to be shareable and replicable, and also important when a metadata analyst collaborates with a developer. Even when, as is common, an analyst acts as both analyst and developer on small projects, these skills help an analyst understand exactly what they are implementing in code, and how to communicate that to others. Finally, because documentation is often a major part of an analyst’s work, they must also understand documentation best practices, such as plain language and documentation accessibility.

Practical Analysis Techniques and Problem Solving Approaches

Libraries create, acquire, maintain, and share significant quantities of metadata, in a wide variety of types and formats, for various functions, including bibliographic, authorities, acquisitions, administrative, holdings, and others. In most libraries the amount of metadata constantly increases, and each year brings new pieces of metadata that are not present in older records, or new standards that create records substantially different from older records. The increasing reliance on externally-sourced metadata and growing demand to share metadata with external partners creates in turn an increasing reliance on varying quality and approaches to the metadata the library manages. Both human and computer generated metadata are subject to errors that need to be addressed. It is within this context that the need for metadata analysis, with a goal to solve metadata problems, arises.

We recognize at least 6 areas where metadata analysis is frequently needed: workflow and automation, configuration, batch creation and updates, import and export, troubleshooting, and documentation creation. These areas are not necessarily independent of each other, and often overlap or form a sequence within a larger project. The knowledge areas and skills described above, implemented through tools and techniques with which the analyst is comfortable, inform the final shape of the approach taken for any particular analysis.

Techniques

In order to address a metadata problem, the metadata analyst must begin by gathering information about the problem. We will first discuss the basic techniques a metadata analyst can use, and then we will turn to an overview of the kinds of information that are useful for completing a metadata analysis. While not exhaustive, common techniques include observation, reading documentation, advanced queries and searches, conversation, and process mapping and workflow diagrams. Sometimes one of these techniques will be enough to resolve the problem; other times the analyst may need to rely on all of them, along with others.
As an example, Martin was tasked with updating cataloging macros for both a migration to Sierra from Millennium, and for RDA, a workflow and automation problem. He chose to use observation in order to understand how catalogers were using macros, as well as to see where things were different enough in Sierra that the macros broke. Then, he turned to the AutoIt – a scripting language for automating Windows tasks and macros – documentation in order to understand what was possible. Finally, he used a kind of process mapping to map the logic that the macros used in order to get them working again.

Approaching a metadata problem

One high-level way to organize an approach to a metadata problem is by thinking through four aspects of the problem: context, extent, urgency, and importance. In the macro cataloging example above, the context and extent was determined through observation, while the urgency and importance were related to both the move to a new ILS, and the recently released RDA standards for cataloging. Context includes such things as local practices and policies, national and international standards, and the constraints and possibilities of local systems and the tools the analyst is comfortable with. Context also includes other, pending job duties and projects on your plate. Other important contexts can include whether there is direct, local control over the metadata at issue or not. For instance, as customers of Ex Libris, we rely on Ex Libris’s Central Discovery Index for a significant amount of discovery metadata. This means that article level metadata, for example, is not something we can directly edit.

Before an analyst can formulate a plan to address the metadata problem, it is important to understand if, for example, there is a need to deal with one subfield in one record, or multiple subfields across 25,000 records. This is the extent of the problem. In the modern library, it is likely that one record with an issue means there are more with the same issue. This work involves using preliminary analysis tools at your immediate disposal, such as using a global search in the library catalog, or doing queries through other means such as SQL or another tool that the catalog might interact with. If the problem was raised by someone else, the analyst can follow up with that person to find out if they have useful knowledge of the problem’s extent. Understanding the extent helps the analyst to determine if the problem would be best addressed programmatically, for example, through the use of a scripting language such as Python. Depending on the extent of a problem, a one-time solution with many manual steps may be sufficient; for larger or ongoing problems, partially or fully automated solutions may be required.

Another important aspect to evaluate is the urgency of the problem. Urgency could be determined by a number of factors: library administrators, whether access to content is broken, project deadlines, and more. This is subjective, but there could be a vast difference between needing to fix a typo in note fields vs. a typo in title fields. One record with a typo in the title could be fixed immediately, or it could be considered low urgency and put off to a later time. This is why it is so important to put the context, extent, and urgency together with importance. Only then does it become clear when, and how, the analyst needs to address the issue.

Importance is also a subjective determination. That typo in the notes field might seem urgent to a doctoral student who could not find a source while working on their dissertation, but it may not seem urgent to the metadata librarian who already has a backlog of projects. This is where the question of importance comes in. In a large research university, a doctoral student missing out on a key source is pretty important, so while the analyst may disagree on the urgency, there may be agreement on the importance.

Practically speaking, once all of these aspects are taken into consideration, it becomes clear that there is just not enough time to tackle every problem under the sun. This is an important step in the process because none of these four areas are binary choices, and they all also influence one another. It is also important to remember that the person who reports a metadata issue may believe that the problem is important, urgent, and extensive; it is up to the analyst to see the problem in the context of other pending analysis work. The authors have never yet encountered a life-or-death situation involving metadata.

Tools

Solving a metadata problem may sometimes be as simple as adjusting a journal coverage date in the knowledge base, but more often it will involve using a software tool of some kind. The big three tools (external to library system tools) familiar to most librarians are MarcEdit, Python, especially with pymarc and pandas libraries, and OpenRefine. All can be run on Mac, Windows, and Linux systems. Python also benefits from its widespread use in other domains. Advanced metadata analyses and transformations can be performed in all of them, and their ability to do so is often only limited by the analyst’s creativity.

MarcEdit is perhaps the best known tool for metadata transformations. It is a powerful tool with an easy to use interface, and can handle simple find and replace tasks as well as extremely complicated transformations. MarcEdit’s developer, Terry Reese, continues to actively expand the tool’s functionality to include working with linked data, library catalog APIs, OCLC APIs, authorizing bibliographic headings, processing and outputting metadata in various serializations, among many other helpful functions. Users can also develop “tasks” that meet their specific needs that work like scripts and can include many steps to transform MARC fields and subfields, including using regular expressions. There are many, many examples of projects completed with MarcEdit, but one excellent example of using MarcEdit with XSLT to perform advanced transformations is described in Marielle Veve’s article “From Digital Commons to OCLC: A Tailored Approach for Harvesting and Transforming ETD Metadata into High-Quality Records.” (Veve, 2016).

OpenRefine presents metadata in an interface reminiscent of a spreadsheet, and offers many useful functions for metadata analysis and transformation. One of its most useful out-of-the-box functions for the metadata analyst is clustering, which is especially useful when “columns that have a reasonable amount of overlap but sometimes suffer from transcription errors, such as names” are at issue in a metadata analysis project.[2]

Python is excellent for efficiency’s sake, and for its ability to handle different kinds of data at the same time. With enough experience, an analyst using Python can replicate functions from OpenRefine and MarcEdit while also bringing in additional functionality from Python’s extensive, modular ecosystem. For instance, an analyst can work with a spreadsheet and MARC records in the same script, and can extend that efficiency further if their library systems support APIs by using those APIs to fetch and update batches of records. Another frequently-used Python tool is Jupyter Notebooks. A Notebook is a fully-functional Python editor that can also record explanatory notes, instructions, and commentary as well as include visualizations. Jupyter Notebooks are extremely useful when learning Python, and also for times when the analyst wants to see the results of each piece of code, and either review it or act on it before moving on. The authors have several processes we use regularly that are executed through Jupyter notebooks rather than as standalone scripts, including processes to produce holdings reports for ingest by HathiTrust, and to derive lists of print and electronic library holdings from vendor offer lists.[3][4]

Solving the problem

When a metadata problem comes up, regardless of its nature, it is important to gather information to determine its context, extent, urgency, and importance using the techniques discussed above. When the analyst combines that information with knowledge of relevant tools, the solution often becomes clear. The important thing is to understand and use the tools at hand, while building knowledge and skill with other tools. Here is a helpful analogy: there are multiple ways to dig a tunnel. You can dig a tunnel with a boring machine, or a spoon. One option might take longer than another, but you can usually get to the same place eventually. While we are capable of using Python and other sophisticated tools to analyze and solve problems, if a tool like MarcEdit can already do what needs to be done, there is no need to write new code.

The only time being unable to write code will prevent an analyst from successfully solving a metadata problem is when there is a need to change thousands of unique things across thousands of records, or when the task at hand is labor intensive and needs to be done more than once. In those situations, it is important to eventually find a programmatic solution. But being able to effectively use the tools with which the analyst is already familiar is more important than trying to learn how to use every tool, even if it means needing to run a long sequence of find and replace commands in MarcEdit. That will likely still be faster than learning Python.

Conclusion

The knowledge areas, techniques, and tools of metadata analysis discussed in this paper will no doubt sound familiar to many of the readers of this journal. Besides providing a high-level overview of the knowledge, skills, and practices of metadata analysis for those who are already engaged in it, our hope is that this exploration will benefit those newer to the field who seek to learn metadata analysis skills. There is much overlap between metadata analysis and more discrete library functional areas of cataloging, e-resources management, and library systems management, among others. Because the need for metadata analysis is not limited to one specialty within libraries, it has been difficult to pin down precisely what a metadata analyst does, and consequently for the library community to provide a concrete action plan for learning to do this kind of work. Our own experiences confirm this. Through our discussion of metadata analysis and its common tasks, approaches, and tools, we have sought to articulate a baseline of knowledge that can be used by library staff at all levels of the organization to build coherent development plans, write accurate job descriptions, and develop a common understanding of metadata analysis work.

Notes

[1] Dublin Core Metadata Initiative. “Metadata Interoperability,” https://www.dublincore.org/resources/glossary/metadata_interoperability/
[2] Phillips M, Tarver H, Frakes S. 2014. “Implementing a collaborative workflow for metadata analysis, quality improvement, and mapping” The Code4lib Journal (23). https://journal.code4lib.org/articles/9199
[3] UMNLibraries/hathitrust-inventories: https://github.com/UMNLibraries/hathitrust-inventories
[4] UMNLibraries/holdings-from-isbns: https://github.com/UMNLibraries/holdings-from-isbns

References

Binici K. 2021. What are the information technology skills needed in information institutions? The case of “code4lib” job listings. The Journal of Academic Librarianship, 47(3):102360. doi:10.1016/j.acalib.2021.102360

Gonzales BM. 2019. Computer programming for librarians: a study of job postings for library technologists. Journal of Web Librarianship, 13(1), 20-36. doi:10.1080/19322909.2018.1534635

Hall-Ellis SD. 2015. Metadata competencies for entry-level positions: what employers expect as reflected in position descriptions, 2000-2013. Journal of Library Metadata, 15(2):102-134. doi:10.1080/19386389.2015.1050317

Hannah M, Heyns EP, Mulligan R. 2020. Inclusive infrastructure: digital scholarship centers and the academic library liaison. Portal 20(4):693-714. doi:10.1353/pla.2020.0033

Mathews JM, Pardue H. 2009. The presence of IT skill sets in librarian position announcements. College & Research Libraries, 70(3):250-257. doi:10.5860/0700250

NASIG. 2019. NASIG core competencies for electronic resources librarians. https://www.nasig.org/Competencies-Eresources

Ratledge D, Sproles C. 2017. An analysis of the changing role of systems librarians. Library Hi Tech, 35(2):303-311. doi:10.1108/LHT-08-2016-0092

Skene E. 2018. Shooting for the moon: an analysis of digital initiatives librarian job advertisements. Digital Library Perspectives, 34(2):84-90. doi:10.1108/DLP-06-2017-0019

Thielen J, Neeser A. 2020. Making job postings more equitable: evidence based recommendations from an analysis of data professionals job postings between 2013-2018. Evidence Based Library and Information Practice, 15(3):103-156. doi:10.18438/eblip29674

Veve M. 2016. From Digital Commons to OCLC: A Tailored Approach for Harvesting and Transforming ETD Metadata into High-Quality Records. The Code4lib Journal (33), 2016-07-01. https://journal.code4lib.org/articles/11676

Xia J, Wang M. 2014. Competencies and responsibilities of social science data librarians: an analysis of job descriptions. College & Research Libraries, 75(3):362-388. doi:10.5860/crl13-435

About the Author

Stacie Traill (trail001@umn.edu) is Metadata and Discovery Analyst at the University of Minnesota Libraries.
Martin Patrick (patri299@umn.edu) is Metadata Analyst at the University of Minnesota Libraries.

Leave a Reply

ISSN 1940-5758