Issue 19, 2013-01-15

SPRUCE Mashup London

SPRUCE digital preservation mashups are a series of unique events that are being organized in the United Kingdom to bring together digital preservation practitioners and developers to work on real-world digital preservation challenges. During the 3-day event the digital preservation developers work to create practical solutions to real-world challenges the practitioners are having related to digital preservation. Meanwhile, the practitioners work to create compelling business cases for digital preservation at their institution. This article describes the SPRUCE Mashup London event held in September 2012.

by Edward M. Corrado

Introduction

The SPRUCE Project is a JISC funded partnership led by Leeds University Library. Other partners include The British Library, Digital Preservation Coalition, London School of Economics, and Open Planets Foundation. The SPRUCE Project is a partnership designed to “collaborate to develop a strategy for engaging the academic community around digital preservation and will also contribute technical expertise and solutions.” [1] Besides organizing digital preservation mashups and other events, SPRUCE has a number of other initiatives. Two other major initiatives are providing funding awards of up to £5,000 for developing “practical digital preservation outcomes and/or development of digital preservation business cases” [2] (available to institutions in the UK-only) and a community project know as Crowd sourced Representation Information for Supporting Preservation (cRIsp) that is aiming to combat the challenges of digital preservation Representative Information (RI) using crowd-sourcing techniques and the collective wisdom and knowledge of digital preservation experts. [3]

SPRUCE digital preservation mashups are a series of events that are being organized in the United Kingdom that are intended to bring digital preservation practitioners and developers “together to discuss, test, code […], plan, and share challenges related to the new types of content entrusted to libraries, archives, and museums to preserve and manage.” [4] Practitioners bring real world digital preservation challenges and are then paired with a developer, who works in a sprint-like software development fashion to come up with a working solution before the end of the three-day event. It is hoped that the solution will not only help the practitioner with their specific problem, but also be reusable by other practitioners who experience similar issues in the future. The solutions and other outcomes of the event are documented on the SPRUCE Project Wiki.

While the developers work to create practical solutions based on real-world problems, the practitioners work to create compelling business cases for their digital preservation work with the goal of helping them request and receive funding for their work and digital collection management duties. The structure of the SPRUCE Mashups are rather unique in how they are structured and how they bring both developers and practitioners together at the event. Therefore, the structure of SPRUCE Mashup London will be described in some detail below.

SPRUCE Mashup London

The SPRUCE Mashup London was held at the Hubworking Centre in London from September 18-20, 2012. The event was free to attend and attendees were provided food and lodging throughout the event free of charge.

The event started with an introduction to the SPRUCE Digital Preservation Mashup and why the organizers felt they are necessary. They said they felt that, in some cases, there was a mismatch of solutions to digital preservation problems with software solutions and that better coordination between practitioners and developers was needed. They believed that there were possibly some open source solutions available to practitioners but that they might not be aware of them or they might not have enough software knowledge to implement them. The organizers also identified a need for more education about digital preservation sustainability and how to create a business case that could lead to additional institutional support and funding for digital preservation projects.

There were two main goals for SPRUCE Mashup London to help address these problems. The first goal was to solve some concrete digital preservation challenges. This was to be done by “capturing” some real-world preservation challenges and then solving the problems using an agile software development process.

The second goal was to prepare a generic business case for digital preservation. Digital preservation is not often seen as “exciting” to upper-level administrators. Therefore, practitioners need to build a business case to present to administration that outlines the reasons why they should support digital preservation. The purpose of the generic business cases that the practitioners created were to help identify the benefits and stakeholders of digital preservation services at the practitioner’s institution. During the mashup practitioners would also work on an organizational digital preservation skill gaps analysis and on a digital preservation elevator speech that they could present to administrators at their organization.

After the organizers described the purpose, goals and agenda of the mashup, the participants introduced themselves and discussed their experiences with digital preservation. The practitioners also described the particular problem that they brought with them to the event in hopes of having a developer provide a working, practical solution. The issues presented ranged from organizational and where to start to highly technical issues. For example, Rachel MacGregor from the Birmingham Archives and Heritage Service stood up and said “Help – I’ve got digital content and I don’t know how to manage it” [5] while Maurice de Rooj of the National Archives of the Netherlands (NANETH) had an issue where TIFF files would not render correctly (or at all) even though JHOVE marked them as valid and well-formed. [6] Other issues presented at the London Mashup can be viewed on the Open Planets Knowledge Base Wiki. Indeed, the wiki [7] was used throughout the event to document what was being discussed and created during the mashup.

After the practitioners gave their introductions, the developers followed with their own introductions and responded to some of the practitioner’s issues. In a few cases attendees were both developers and practitioners, but for the purposes of the mashup they choose one hat or the other (normally developer, but not always). In total there were approximately twenty attendees, not counting the organizers. Most of the participants were from the United Kingdom although some came from other European countries and one (the author of this article) came from the United States.

Once introductions were complete the organizers broke the participants into small groups that each contained a couple of practitioners and developers. In these small groups the practitioners and developers discussed the issues presented earlier so that the developers could get an understanding of what type of solution was needed. A summary was then presented to everyone and after a short break, the developers went off to begin working on the solutions while the practitioners got together and described their issues on the mashup wiki, based on their conversations with the developers.

While the developers were developing solutions to the practitioners’ problems, the practitioners began to build their business cases for digital preservation. The first step of building the business case was to identify the benefits of digital preservation. Some of the benefits identified included 1) Increased access to digital objects (allow for multiple people accessing documents, fulfilling legal and other institutional obligations, and making objects more discoverable through enhanced metadata); 2) Developing expertise and knowledge that increases the status and creditability of the organization and enables the organization to provide digital preservation services and accept digital collections; 3) Financial benefits that could be gained from providing digital preservation services to other organizations and the possibility of receiving grants; 4) Automation of preservation tasks (thus saving valuable staff time); 5) Fulfilling the institutional mission which in many libraries and archives can not be done without being able to handle digital material; and 6) Preserving institutional memory and local/regional history created utilizing digital media.

The second day began with a brief meeting between the developers and the practitioners that they were creating tools for. This was followed by a brief update from the developers on where the stood with the solutions they were working on to all of the participants. After this, once again the developers broke off and started developing while the practitioners went back to work independently on their business plans – this time focusing on stakeholder analysis. The day continued with developers working on projects while practitioners worked on their business cases with time provided in between the sessions for the developers to consult with the practitioners. During the day, practitioners also worked on a digital preservation skills gap analysis for their organization.

During the third and final day of the mashup, the main agenda item was to finish what everyone had started, commit code, and finish the descriptions on the wiki of the developers’ solutions as well as the business cases that were created by the practitioners. The practitioners also created, and presented to everyone in attendance, short elevator pitches that they could give to administrators at their organizations.

Some of the solutions developed include:

  1. Peter May wrote a Java program that makes “use of a custom Apache Tika wrapper to extract file format identification and metadata from a directory of files and present aggregated data for identifying which files have full descriptive metadata and which don’t.” [8] It was agreed that this program worked well at summarizing information about various files within a directory. This can be used for identifying possible duplicate files and could also be used to determine the amount of descriptive metadata that is embedded in a set of files.
  2. Rob Talbot and Peter Cliff developed a solution for archivist practitioners Thom Carter and Rebecca Webster that built on previous work from Peter May and Carl Wilson which extracted metadata from files utilizing Python and Apache Tika to produce output in JSON and a simple XML format. Talbot and Cliff felt that “Tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats – mostly MS Office and PDF – both formats that Tika handles very well for both metadata and text extraction.” [9] Cliff also participated in this solution by creating three further scripts that used this output to create n-gram word clouds. A related project at the mashup utilized Perl scripts “that used the metadata […] extracted using Apache Tika to help locate duplicates and different versions of the same document” [10] (See Correction)
  3. Dominic Ivaldi worked on a solution that used FFMPEG as a video transcoder. The practitioner had numerous extremely large video files that need to be stored and preserved. For example, he has a 28GB file that contained a 18 minute long promotional movie created for Ford Motor Comapny in the 1950’s. This appears to be excessive “considering entire movies are sold commercially on 9.4GB DVDs.” [11]
  4. Maurice de Rooj worked on a PHP script that reads EXIF data from a image file file using ExifTool and then normalizing the output into a Dublin Core compatible XML file. It also adds specific metadata to the Dublin Core XML which is contained in an .ini file. [12]
  5. Maurice de Rooj also took the lead on solving an issue about how to move records from Microsoft Sharepoint to Eprints for digital preservation. It was not feasible to come up with a direct solution during the mashup but they were able to create a new Sharepoint view that contained all of the fields and could be exported as a Microsoft Excel file. Another issue identified was that future content should be properly formatted which means “users need to be educated and made aware” [13] that if future documents are not properly formatted thet may not be able to be preserved.
  6. Bram Lohman, Dirk von Suchodoletz, and Tom Woolley investigated how to preserve video games and how to provide public access to the preserved games. Instead of developing code, they created a list of issues involved and available emulators.[14]
  7. Maurice de Rooj added NeXus file recognition to Fido. [15] FIDO (Format Identification for Digital Objects) is a simple, command line tool written in Python that can be used “to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.” [16]
  8. A team of developers worked on a problem the National Archives of the Netherlands had with corrupt TIFF images. The TIFF images were unusable although tools such as JHOVE marked them as well-formed and valid. The team discovered that the problem was the images claimed to be 16-bit greyscale files however they were really 8-bit greyscale files. The team used exiftool to detect and correct the problem files. This situation leads to important questions and takeaways included what does it truly mean for a file to be valid and well-formed. [17]

Mashup participants voted on awards for the best developer and the best practitioner during the event. The winner of the best developer award was Maurice de Rooij from the National Archives of the Netherlands while Rachel MacGregor won the award for best practitioner.

All of the materials from the event, including the business plans, the digital preservation issues the practitioners had, and the solutions the developers came up with are available on the event wiki.

Impressions

Although I occasionally write shell scripts and other small computer programs, I participated in this event as a practitioner. Overall I found this to be an excellent event and I would encourage people that have real-world problems to solve relating to digital preservation or are developers that can solve some of these types of problems to participate in future SPRUCE Digital Preservation mashups or similar events. As Rachel MacGregor reported in a blog post, “I suspect that any of the developers could have offered solutions for my datasets” and at the end of the event “I already had an answer to my question!” [18] Another practitioner mentioned how useful the mashup was to them and I imagine most of the practitioners would agree. There were multiple benefits to attending this SPRUCE Mashup London as a practitioner. The first, and probably the most obvious, benefit was that I came away with working code created by an expert developer that solved a real problem I was having with creating and maintaining a list of metadata mappings for a large photograph collection. I also was able to work on a digital preservation business plan with the help and feedback of other people involved in digital preservation. Lastly, the networking opportunities were tremendous. I met a number of digital presentation developers and practitioners that I can consult with and ask questions of.

While I wasn’t participating as a developer I imagine that they equally benefited. Not only did they get to solve real-world problems, they got to collaborate with other, highly skilled developers. In some cases, they also brought digital preservation issues with them and were able to get other developers to help them figure out solutions to difficulties they were having. In many cases developers working on digital preservation projects do not have other developers at their workplace that they can go to when they are trying to discover the best way to solve a problem. By getting together at events like this mashup they can get other sets of eyes to look at their problems and also will know who they can consult when they have additional questions in the future. I believe this cooperation and the networking that occurred during the event will help build a strong UK community of developers and practitioners interested in digital preservation.

The funding for this event was covered by JISC. I believe this was a key factor to the success as the only expenses to participants and their employers were related to traveling to the mashup. In many organizations, especially for developers, it may have been difficult to get funding to attend the event if there were food and lodging expenses associated with the event that the organization had to cover. In order for this type of event to be replicated elsewhere, I think receiving some level of outside funding for the event will be crucial to having a critical mass at the event. In the United States, because of the geographic size, the travel costs may be a hindrance to attendance although that could be mitigated somewhat by having regional mashups and choosing locations that are easy and inexpensive for enough people to attend.

Event Calendar

  • September 2012: SPRUCE Mashup (London, UK)
  • April 2012: SPRUCE Mashup (Glasgow, Scotland)
  • April 30-May 2, 2013: SPRUCE Mashup (Leeds, UK)
  • July 2013: SPRUCE Mashup (London, UK)

Endnotes

[1] http://www.dpconline.org/advocacy/spruce/786-spruce
[2] http://wiki.opf-labs.org/display/SPR/SPRUCE+Awards+-+funding+opportunity+for+digital+preservation
[3] http://wiki.opf-labs.org/display/SPR/Crowd+sourced+Representation+Information+for+Supporting+Preservation+%28CRISP%29
[4] http://www.dpconline.org/advocacy/spruce/878-2nd-spruce-mashup-london-tuesday-18-20-september-2012-?format=pdf
[5] http://openplanetsfoundation.org/blogs/2012-10-01-spruce-mashup-london-2012-practitioners-tale
[6] http://wiki.opf-labs.org/display/SPR/Valid+and+well-formed+TIFF%27s+with+scanline+corruption
[7] The main page for the SPRUCE Mashup London event is available at http://wiki.opf-labs.org/display/SPR/SPRUCE+Mashup+London. Readers of this article are encouraged to review the week for more details about the event.
[8] http://wiki.opf-labs.org/display/SPR/Distinguishing+Files+with+Descriptive+Metadata
[9] http://wiki.opf-labs.org/display/SPR/Extracting+and+aggregating+metadata+with+Apache+Tika
[10] http://wiki.opf-labs.org/display/SPR/Using+Perl+to+write+scripts+for+reporting+on+the+content+of+the+collection
[11] http://wiki.opf-labs.org/display/SPR/FFMPEG+as+Video+Transcoder
[12] http://wiki.opf-labs.org/display/SPR/Maintain+a+list+of+metadata+mappings+outside+of+the+script
[13] http://wiki.opf-labs.org/display/SPR/Moving+records+from+Sharepoint+to+Eprints+for+preservation+solution
[14] http://wiki.opf-labs.org/pages/viewpage.action?pageId=16713667
[15] http://wiki.opf-labs.org/display/SPR/NeXus+Data+Collection+ISIS+-+STFC+-+solution
[16] http://www.openplanetsfoundation.org/software/fido
[17] http://wiki.opf-labs.org/display/SPR/Solving+TIFF+malformation+using+exiftool
[18] http://openplanetsfoundation.org/blogs/2012-10-01-spruce-mashup-london-2012-practitioners-tale

Correction

16 January 2013: The original version of this article misidentified some of the people and/or their roles on Solution #2. The author apologizes for this error.

About the Author

Edward M. Corrado is Director of Library Technology at Binghamton University located in Binghamton, NY (USA). At Binghamton, he provides leadership for information technologies and digital initiatives and overall direction, administration and management of computer resources, systems, and networking in the Libraries. Corrado also supervises the Systems Department; oversee the Libraries’ technology infrastructure, web services, and other information access and production technologies; responsible for the Libraries’ ILS (Ex Libris ALEPH); work with the Library faculty and staff to research and develop new and innovative technologies and services; recommend policies; plan upgrades; maintain current awareness of digital library technologies; work with University Information Technology Services; and represent the Libraries’ information technology interests within the University and in SUNY-wide initiatives.

Leave a Reply

ISSN 1940-5758