Seeking New Editors

Issue 40, 2018-05-04

Centralized Accessioning Support for Born Digital Archives

Archives often receive obsolete digital storage media alongside paper acquisitions: CDs and DVDs mixed in with folders of correspondence, Zip disks, and floppy disks set aside by the donor with the intention to review the content later. Archives must not only have the expertise to work with digital media, but also the hardware and software to capture the content without the risk of altering the files merely by viewing them. This article will describe how Yale University Libraries and Museums addressed accessioning of born-digital archival content on physical media through a centralized digital accessioning support service. Centralizing the hardware and expertise required for working with physical media made it possible to accession media more quickly and return the files to the originating archives for arrangement and description.

by Alice Sara Prael

Archives often receive obsolete digital storage media alongside paper acquisitions: CDs and DVDs mixed in with folders of correspondence, Zip disks, and floppy disks set aside by the donor with the intention to review the content later. However, when “later” arrives, the obsolete drive necessary to view the content is more difficult to find and so these media find their way to the archive, often with the note “I’m not sure what’s on this.”

The longer the files remain on the media, the more difficult and potentially expensive it is to capture the content. Archives must not only have the expertise to work with digital media, but also the hardware and software to capture the content without the risk of altering the files merely by viewing them. Although there are vendors specializing in capturing content from obsolete media, these services come at a cost that increases for every additional disk.

This article will describe how Yale University Libraries and Museums addressed accessioning of born-digital archival content on physical media through a centralized digital accessioning support service. Centralizing the hardware and expertise required for working with physical media made it possible to accession media more quickly and return the files to the originating archives for arrangement and description.

Special Collections at Yale

There are many special collection units at Yale University, each with their own collecting focus. Eight special collection units agreed to participate in the pilot digital accessioning service – Irving S. Gilmore Music Library, Robert B. Haas Family Arts Library, Divinity Special Collections, Beinecke Rare Book and Manuscript Unit, Yale Center for British Art Institutional Archives, Manuscripts and Archives, Medical Historical Library, and Peabody Museum Archives.

Prior to the creation of the service, each unit worked with physical media independently and with varying levels of success. Some units were able to programmatically capture content from disks, some worked with born-digital content on physical media only when access was requested by a researcher, and some units did not have the capacity to work with digital content and avoided collecting physical media.

In 2014, Yale University Library adopted ArchivesSpace as the system of record for archival descriptions. In 2015, YUL selected the Preservica Digital Preservation System (DPS) for managing long-term preservation of born-digital and digitized content. However, when the centralized service to support digital accessioning launched in 2016, special collection units were at varying levels of implementation of both systems. While some units created and edited finding aids entirely in ArchivesSpace, some were still working to import existing finding aids from legacy systems. At the time of this article’s publication, some units are ingesting born-digital content into the Digital Preservation System while others are not participating in the DPS, relying on other preservation solutions instead. These differing levels of implementation and use of both systems affected the readiness of each unit to participate in the service. Yale’s implementation of ArchivesSpace is integrated with the DPS, which provides the opportunity for synced description of digital files and the potential for automated access. However, working with integrated systems is also challenging since mistakes made in one system may be replicated in the other.

Identifying the Problem

Initial estimates provided by special collection units identified a backlog of at least 4,000 individual pieces of media located in special collections at Yale. Since a significant portion of that content has not been accessed since before its acquisition, the exact volume of this at-risk data was unknown. The estimates provided were based on known acquisitions and have since proven inaccurate. At the time of this article’s publication, the service has received nearly 6,000 pieces of media.

In January 2015, the Born Digital Archives Working Group (BDAWG) was formed to provide leadership in the area of born digital archives [1]. The group’s vision is to provide the same level of stewardship for born-digital archival holdings as is devoted to our physical collections. One of its first priorities was to address the growing backlog of born digital archives and determine how Yale libraries and museums can pool resources and expertise to find a path forward. In 2016, BDAWG recommended the creation of a centralized service and the hiring of a two-year Digital Accessioning Archivist as the best way forward for capturing born-digital content on physical media.

In 2015, an off-site facility was renovated and repurposed to address the need for additional space (both for archival work and other departmental needs within Yale). This facility included the new Di Bonaventura Family Digital Archaeology and Preservation Lab. The lab space is shared by the YUL Digital Preservation Unit and the Beinecke Rare Book and Manuscript Unit. The lab was equipped with two digital forensic machines, custom built by the Digital Preservation Manager, Euan Cochrane, and other machines and equipment moved from a previous shared lab space on central campus. In 2016, a new position, Digital Accessioning Archivist for Yale Special Collections, was created to develop and manage this centralized service. The service is staffed primarily by myself as Digital Accessioning Archivist with guidance from Gabby Redwine, Beinecke’s Digital Archivist, and in collaboration with BDAWG. The service also employs student assistants to help with data entry and capturing content from media. The service was launched in July 2016 as a two-year pilot project to eliminate the backlog of physical media in archival collections.
Figure 1

Figure 1.The lab shared by the YUL Digital Preservation Unit and the Beinecke Rare Book and Manuscript Unit

Building a Service

The first step in creating the digital accessioning service was to meet with the stakeholders to identify their needs. We identified the stakeholders as members of BDAWG and each of the participating special collection units. In order to simplify communication, each unit selected a staff member to serve as the digital liaison. This position is responsible for determining the order of media sent for accessioning, communicating updates between the service and the special collection unit, and providing the service staff with access to the necessary systems (shared storage location, ArchivesSpace instance, etc.). There is some overlap in stakeholder positions as three out of six members of BDAWG also served as the digital liaison for their unit.

To guide conversations with stakeholders, I created a rough sketch of how a centralized service for born digital accessioning support might run. This included a strategy for shared documentation and ongoing communication, description of the equipment already available for use in the lab, and a list of tasks and materials that were already identified as out-of-scope, such as data tapes and audio cassettes.

During interviews stakeholders identified the following needs:

  • Capture content from physical media (removing the need for specialized hardware/expertise within the unit)
  • Scan content for personally identifying information (PII)
  • Create manifest of all files captured
  • Create item level descriptions for physical media
  • Update existing item level descriptions for physical media
  • Prepare content for long term preservation
  • Accommodate privacy concerns during processing
  • Reformat non-born-digital audio-visual material

Although we had anticipated many of these needs, and in fact many of them were the reason for the service’s creation, there were also a few surprises. Although we expected the need to scan content for PII, we did not anticipate privacy concerns regarding staff working with sensitive material in a shared space. Although the lab is in a secure facility, there is a large window through which material might be viewed by staff outside the service. This need for additional privacy could be met with a policy regarding use of the window blind while working with such material. Some needs could not be addressed, for instance the need to create digital derivatives of non-born-digital A/V material was already defined as outside the scope of the service. Another surprise was the need to update existing item level descriptions in ArchivesSpace. Some special collection units have already created item level descriptions for much of their physical media and these descriptions are often more thorough and descriptive than the minimal description required by the service. For instance, the service uses the label written on the physical disk as the item title, but the label from the record creator may not be as relevant as the title created by the processing archivist. The need for two description options was considered during the workflow creation process.

To contain the scope of the service we identified a list of media types that will be accepted for accessioning. Accepted media types were limited to ensure that all incoming material could be accessioned in a timely manner with the equipment available in the lab. Since the extent field is a controlled value in ArchivesSpace, the system required a controlled list of media types for the item level description. The service accepts the following media types: 3.5 floppy disks, 5.25 floppy disks, JAZ disks, ZIP disks, Optical Media, Hard Drives, Flash Drives, and Laptops. Desktop computers are not accepted because of the limited space in the lab and potential for damage during transportation, but the service offers consultations for units that need help to remove the hard drive.

Creating Workflows

After discussing the needs of our stakeholders, we began defining the workflow. For the sake of clarity, this is presented as a linear process, but in reality some stakeholder needs did not emerge until the workflow was fully implemented. The workflow presented here required tweaks to address concerns as they surfaced. This section offers a high-level description of the work done within the service, the reason for completing each step, and the inherent organizational concerns.

Preparing the Submission

Prior to submitting material, the digital accessioning service, the digital liaison must provide the Digital Accessioning Archivist with access to the special collection unit’s ArchivesSpace repository. Access is required to create/update item level descriptions for the physical media. The liaison must also provide access to a network storage share with sufficient space for processing media. The service saves the born-digital content on this network share prior to ingest to the DPS. The service recommends 3-lock security network storage particularly for units collecting personal papers or institutional records with additional security concerns[2].

For each submission to the service, the digital liaison must complete a metadata spreadsheet (whose columns are mapped to AS fields) describing each disk in the shipment. This spreadsheet is submitted to the service via an online form, which also requires the liaison’s contact information and method of transportation for the shipment. Each row in the spreadsheet later becomes an item level description in ArchivesSpace. This spreadsheet fills a second function of providing confirmation that the service staff is working with the correct media.

It is worth noting that not all special collection units use ArchivesSpace as their system of record. In these cases, there is no ArchivesSpace repository to provide the service with access to and the URL column in the metadata spreadsheet may be left blank. These spreadsheets will not be imported to ArchivesSpace but instead saved to the network share.

There must be a resource record for the collection in ArchivesSpace in order for service staff to create new item level descriptions. The spreadsheet requires the URL for the parent record, i.e. the archival object one level up in description hierarchy. If the item level description already exists and is only being updated by the service then the spreadsheet should contain the URL for the item description requiring an update.

Each piece of media must be assigned a unique identifier or media number, called the Component Unique Identifier (CUID) in ArchivesSpace, and the identifier must be written on the physical media. The identifier facilitates identification and management of information about the item. The service provides guidance on how to write the identifier on media without damaging the carrier or obscuring the original label, but the method is ultimately up to the digital liaison. The identifier is often a derivative of the accession number, but some units do not rely on accession numbers and instead use call numbers. The only requirement for the identifier is that it must be unique to the special collection unit. This identifier is included in the spreadsheet and used as the file name for all service output associated with that disk as well as the entered as the Component Unique Identifier (CUID) in items description in ArchivesSpace[3].

In the Service

Once the metadata spreadsheet and survey form have been submitted and the shipment of media has been received in the lab, service staff confirms that the correct boxes have arrived and the content was not damaged in transit. If the materials arrive in the containers listed in the spreadsheet service staff confirms that the container numbers are correct. Many times the media has been removed from many hybrid collections and placed together in a temporary container which is not numbered. In these cases, the service only confirms that the media has not been damaged. The service does not confirm each piece of media upon arrival as they are verified against the metadata spreadsheet during imaging and capture. Each special collection unit has been assigned a shelf in the lab where their media is stored while in queue. Shipments of media are processed in the order that they arrive; however, special collection units can request expedited accessioning.

Each disk is photographed by the service – these photographs serve as an access surrogate of the physical disk. Physical media is often restricted from use in the reading room due to the fragility and technical complications of access. If a researcher is interested in viewing the handwritten label on a disk, this photograph can fill that need. Photographing media also aids in workflow management. The photograph is helpful when the service needs to communicate with the special collection unit about a discrepancy between the physical media and the spreadsheet description.

When media is accessed on a machine it must be connected via a writeblocker, which ensures that staff members are not able to alter the content or metadata on the disk. Without a writeblocker, it is possible to change the metadata or reformat an entire disk volume just by connecting to the media. Once the media is safely connected to a machine it is scanned for viruses and malware on a non-networked machine. If viruses are discovered the service will contact the originating unit to determine the appropriate path forward. All media except for floppy disks are virus scanned prior to capture. It was determined that viruses or malware found on floppy disks are low risk to modern technology, but the fragility of this media type means that there may be only one chance to connect to the media. All content captured from media is virus scanned upon ingest to the DPS, which serves as the final security check for viruses and malware.

Once the media is deemed safe for use, content is captured in the form of a disk image. A disk image is an exact copy of the disk volume containing the folder structure, orphaned and deleted (but not overwritten) files, and all associated metadata. All media received by the service is imaged unless a different capture method has been requested by the originating unit. A special collection unit may request that the files be transferred without attempting to create a disk image for reasons including donor concerns with the capture of deleted files, or the disk capacity is large but contains a small amount of desired data.

There is one exception to the disk image first rule, Compact Disc-Digital Audio (CD-DA). This format has a higher error rate than a standard CD-ROM and a disk image would have an equally high error rate. For this reason, if a CD is identified as a CD-DA the service will not attempt to disk image but instead use a direct transfer method that reads the disk multiple times to capture the best possible copy.

If the first attempt at imaging fails, the Digital Accessioning Archivist will try to determine the reason for the failure and attempt a second time. This often includes returning to the photograph of the media to determine if there are any hints as to the formatting of the original disk. If disk imaging fails a second time, that imaging process is logged as a failure in the metadata spreadsheet. If imaging fails, the service then attempts to transfer the folder structure and files without a disk image. The transfer capture process is also attempted twice before it is logged as a failure. The service is limited to two attempts for imaging and transfer capture due to the significant backlog of material and the time constraints of a two-year project. Any error messages or reason for the failure is logged with the item level description to aid staff that may return to the media in the future for another attempt at capturing the content.

The service conducts quality assurance on content captured by mounting the disk image (if created) and viewing the folder structure. If the disk image can be mounted and a folder structure exists, that disk image is considered a success. If the files were captured via transfer, the quality assurance process is to ensure that the captured folders exist and contain data. The service is not able to conduct file level quality assurance due to the limited time and scope of the project.

If content was successfully captured from the media, the disk image or transferred files are then scanned for Personally Identifiable Information (PII). Three types of PII are identified by the service: social security numbers, credit card numbers and the phrase “bank account”. Unfortunately, bank account numbers are too variable to search for without returning every integer found on a disk. However, many record creators label their bank account numbers with the phrase “bank account”. When researching how other tools scan for private financial data we discovered that the common method is to search for financial terms such as ‘bank account’ or ‘account holder’. Due to the time constraints of the project the service is not able to filter out false hits from this scan. Instead the output is saved as a CSV file for later review by the originating unit.

Once the above steps have been completed the physical media is returned to the originating unit, while the service workflow proceeds. The spreadsheet must be imported by service staff to ArchivesSpace where each row in the spreadsheet becomes a new item level description, or updates an existing description. The final step in the workflow is to package the captured content along with any associated metadata files. At this point the workflow forks to provide special collection units with two options – save the package to a ‘hot folder’ for automatic ingest to the DPS or save the package to network storage to review before ingest. In the first option the ‘hot folder’ must be created by the digital preservation team and the digital liaison must provide the service staff with access. This ‘hot folder’ is configured by the digital preservation team to automatically pull content into an ingest workflow for the DPS. Once the package is saved to this folder the service workflow ends. Any issues that arise during ingest will be directed to the originating unit. The second location where packages may be saved is in the network storage share for that unit. This is the same network storage that is used for processing, so the service will create a “Completed” folder and will notify the unit via email when packaging is complete for a shipment of material. This is a useful option for units that plan to deaccession content once they are able to review the files. Many of the disks accessioned by the service have not been connected to a machine since before their arrival at the archive. In this instance staff may review the files and make any deaccessioning decisions, at which point they are responsible for ingesting the material to the DPS.

Researching and Developing Tools

Once stakeholder needs and the scope of the service was defined I began researching the tools necessary to complete the work. This section will address the more technical aspects of the workflow as well as the research and development required to implement the service.

Photography

Physical media is photographed using an overhead camera on a copy stand. Staff from the Digital Services Unit (DSU) at the Beinecke Library were very helpful in setting up and providing access to the photography equipment. Service staff had no experience with photography and relied heavily on DSU staff to configure and test the equipment. Optical media is photographed using a flatbed scanner because the lighting equipment used on the copy stand creates a glare, often making the creator’s label illegible in the photographs. After consulting with external groups, we determined that a flatbed scanner is the best method to photograph optical media. The scanner still causes some glare when the top of the disk is reflective, but the glare is diminished significantly compared to the copy stand.

Writeblocking

The lab is equipped with an UltraKit from Digital Intelligence, which contains a writeblocker for USB, SCSI, SATA and IDE connectors along with various cables and adapters. This writeblocker kit is sufficient for the vast majority of material, however the version of the kit originally purchased for the lab did not include a firewire writeblocker. In 2018 the service purchased an Ultrakit v4.1 to fill this gap and provide backup equipment for the lab. The new version of the Ultrakit includes a firewire writeblocker in addition to the writeblockers for USB, SCSI, SATA and IDE connectors. Purchasing backup writeblockers provides more flexibility for use of equipment in the lab. Since the lab is a shared space utilized by the digital accessioning service, digital preservation team, and the Beinecke Rare Book and Manuscript Library, it’s important that the service has dedicated access to the necessary equipment. Testing determined that hardware writeblockers are not reliable for protecting USB-connected Zip disks. For that reason the service relies on software writeblockers in the BitCurator environment when working with Zip disks.

Virus Scanning

Virus scanning must occur on a non-networked quarantine machine. A machine from the previous iteration of the digital forensics lab was repurposed to serve as our quarantine machine. This machine has a CD-ROM drive and Tableau T3458is forensic bridge Installed in the machine. This forensic bridge is internally installed and serves as a writeblocker with USB, SATA, SCSI and IDE connectors. The quarantine machine is not networked, but must be re-connected to the network monthly to update the virus scanner. Avast Antivirus software is installed on the machine and provides the following options:

  • Auto: attempts to repair the file. If unsuccessful, moves the file to the Virus Chest or deletes the file if neither action is successful.
  • Delete: permanently removes the file from your PC.
  • Repair: removes malicious code if the file is only partially infected. This action is not possible if the entire code is malware.
  • Chest: sends the file to the Virus Chest where the file cannot harm your system
  • Nothing: makes no changes to the contents or location of the file (not recommended).

If a virus is discovered these options are communicated to the originating unit and they must decide the appropriate path forward.

Capturing Content

The digital forensic machines are equipped with both 3.5 and 5.25 inch floppy disk drives installed internally and connected via a Kryoflux board. Kryoflux is both the software tool and hardware controller board that can be used to connect to and create a forensic disk image of a floppy disk. The Kryoflux is a common tool for archives because it is fairly inexpensive, works with a variety of floppy disk formats and includes write blocking capabilities. The kryoflux is also able to capture a low-level copy of the magnetic flux on a disk, saved in a proprietary stream file format. Although stream files cannot be mounted to view files like a disk image, they can be re-interpreted into a disk image using the Kryoflux software if the archivist is able to identify the correct disk image format.

Since Zip disks require the use of software writeblockers available in the BitCurator environment, Zip disks are imaged using the Guymager tool which is available in BitCurator. Both digital forensic machines have two partitions, one in Windows and one in Linux BitCurator with an additional hard drive that is accessible from either partition. This allows service staff to access content from either partition without requiring network access.

For all other media types, the service uses the free software tool, Forensic Toolkit (FTK) Imager. This tool is lightweight and easy to use. It also allows service staff to automatically create a manifest of files during the image creation process. Optical media creates an ISO image, but all other media types create a raw disk image. Although we researched the disk image creation tool Guymager for creating disk images, we ultimately decided to move forward with FTK Imager for two reasons. First is the ability to automatically create file manifests while imaging a disk. Second, because a large portion of optical media imaging is done by student workers while logging success/failure in an excel spreadsheet which is more easily done in a Window’s environment.

For transferring content from physical media service staff researched the Preservica Submission Information Package (SIP) creator tool. This tool can be pointed at a directory to package the folder structure and files and create preservation metadata. The issue with using the SIP creator to capture original files is that the SIP creator is used later in the workflow to package all service output together for ingest to the DPS. This would ultimately result in double packaging- the first containing the transferred files, and the second containing the first package along with the photograph, PII scan and other metadata files created by the service. This double packaging is not ideal and may cause issues with ingest to the DPS. The other drawback of this tool is the inability to de-select files or folders or skip a file if it cannot be copied. If a file is corrupted or unable to copy the SIP creator will fail to create a package.

For this reason, I decided to move forward with a different tool called FastCopy[4], which was already in use by the digital preservation team for transferring files. The tool is able to verify the files before and after transfer using MD5 checksum validation. If a single file is corrupted or unable to copy the tool will skip that file and provide an error message with the file path. When the tool is unable to copy a file that file path and associated error message are copied into a log file and the transfer capture is logged as a partial pass.

The final tool used to capture content is Exact Audio Copy (EAC), which is created specifically to work with CD-DAs which have a higher error rate which interferes with disk imaging. When run in secure mode[5], EAC reads each sector a minimum of two times, and up to 16 times if an error is detected in the first two reads. By reading each sector multiple times EAC can ensure that the service acquires the best possible copy of the audio files. EAC is configured to create a single WAV file from the audio disc, which is the best practice for preservation and follows our practices for preservation copies of digitized audio. The audio file may be split into track files for access later using track information in the metadata[6].

Scanning for Personally Identifying Information

One machine in the lab is dedicated to running the full version of Forensic Toolkit (FTK). This software is much more robust than the free FTK Imager used for disk imaging. Service staff creates a case in FTK and adds disk images as evidence items. For floppy disks and optical media the disk images are small enough that FTK can load as many as 20 disk images or folders of transferred files in a single case, which streamlines the scanning process. Only disk images or transferred files are scanned for PII; stream files captured from floppy disks cannot be scanned. For hard drives and larger storage media the service loads one disk at a time. Once the captured content is added to a case the service uses the Live Search function to identify files containing social security numbers, credit card numbers and the phrase ‘bank account’. Information acquired by the scan is exported as a CSV file. If multiple disk images were searched in a batch the exported data must be separated to reflect the PII on each disk.

Updating Description and Documenting Actions

When the metadata spreadsheet is received by the service additional columns are added to describe the work happening within the service. These columns become event records in ArchivesSpace and are associated with the media’s description. There are four event record types in use by the service: PII scan, Image Capture, Transfer Capture, and Stream Capture (which is only applicable to floppy disks). Each event record includes a date and can be recorded as a Pass, Fail or Partial Pass. Service staff may also add an Outcome Note to describe any issues that arise during that event, including error messages that arise during the capture process.

Once the metadata provided by the originating unit has been confirmed and the event records are complete the metadata spreadsheet is converted from Excel to a CSV file before it is imported to ArchivesSpace using a tool developed locally by the Beinecke Metadata Coordinator and Archivist from Manuscripts and Archives. This tool was originally developed as a Python script intended for use only by the service, however it quickly became clear that a tool like this would be useful to staff outside the service as well. The script interacts with the ArchivesSpace API to create an item level description and the associated event records from the data in the spreadsheet. The script also produces an outfile spreadsheet containing all the data from the original metadata spreadsheet with additional URLs to the newly created archival object and event records.

The first iteration of the script was limited to creating new descriptions and was unable to update existing descriptions. Since many disks already had item level descriptions in ArchivesSpace, creating a new description would lead to additional metadata clean-up work for the special collection unit.

In the second year of the service the script was further developed to improve error reporting, include a Graphical User Interface (GUI), and add the option to update existing records. This development allowed units with existing item descriptions to send material to the service without the need to condense two descriptions into one for each disk accessioned. Providing a GUI also makes the tool easier to implement for Yale units outside the service.

Preparing for Ingest into Digital Preservation System

The final step of the workflow is to prepare service output by packaging it for ingest to the DPS. These packages are created using the Preservica SIP Creator tool. Decisions made during the packaging step are vital because they affect access restrictions to the digital object in the DPS and the appearance of digital objects in ArchivesSpace. This tool has an easy-to-use GUI, however it would be prohibitively time-consuming to package the service output for a single disk at a time. In order to save time, the Digital Preservation Manager developed a batch packaging tool built on top of the SIP creator code. The batch packaging tool can process a directory of captured content from many disks and use the metadata spreadsheet to make the connection to ArchivesSpace.

There are three components that must be included during the packaging process. The first is the collection code, which provides the link between the digital object in the DPS and the item description in ArchivesSpace. The package must include the URL for item description in ArchivesSpace as the collection code in order for the digital object in the DPS to synchronize with the descriptions in ArchivesSpace. The batch packaging tool accesses this URL from the outfile spreadsheet created by the ArchivesSpace import tool.

The second component is choosing the correct deliverable unit option in the packager. The tool provides three deliverable unit options:

  • Single Deliverable Unit: the DPS treats the entire directory of content and metadata associated with the disk as a single unit. This appears in ArchivesSpace as a single digital object with no components.
  • Folders as Deliverable Units: the DPS treats each folder as a subcomponent. The SIP as a whole appears as a digital object in ArchivesSpace and each folder appears as a digital object component.
  • Files and Folders as Deliverable Units: the DPS treats each file and folder as a subcomponent. The SIP as a whole appears as a digital object in ArchivesSpace and each folder and file appears as a digital object component.

After discussions with stakeholders the service staff decided to package service output with folders as deliverable units. It’s important that the components of a SIP appear in ArchivesSpace, but file names may be restricted and should not be visible to researchers.

The third component is the security tag. Each repository must work with the digital preservation team to create security tags in the DPS in order to manage access and restrictions. It was originally thought that security tags could be applied by the ingest workflow and would be the responsibility of the unit, however it was determined that if no security tag is included in the packaging step that package would automatically be assigned the “Open” tag. This tag means that anyone with access to that DPS repository would have access to the service output. This meant the service had to reconsider the packaging process.

Each repository has an associated set of security tags in the DPS, e.g. RepositoryName_Open, RepositoryName_Restricted, etc. Since different materials require different levels of security one of these tags should not be applied to the entire corpus of material imaged/captured for that repository. Since the security tags must be applied for each disk it makes sense to include security tags as column in the metadata spreadsheet. Adding a row to the spreadsheet to include security tags is simple enough, but it also required updates to both the ArchivesSpace import tool and the batch packaging tool. The import tool was updated to ignore this new column in the spreadsheet since it does not affect the item description. The batch packaging tool was updated to pull the security tag data from the outfile spreadsheet to apply to the package.

Decisions and processes for managing deliverable units and security tags were developed in conjunction with BDAWG and the digital preservation team. Service staff also met with digital liaisons from special collection units to apprise them of newly developed processes and provide the opportunity to ask questions and voice concerns.

Feedback and Documentation

It is vital that the service consults with stakeholders to ensure their digital accessioning support needs for born-digital material are still met. After the initial round of testing for tools and workflows, the service hosted a workshop and invited the digital liaisons for each special collection unit. The workshops focused on walking through the workflow from end to end with special attention paid to the pre-service preparation of material, the metadata spreadsheet, and the service output in both ArchivesSpace and the DPS. These workshops also provided the opportunity to view the lab space and highlight the security of the facility- both the room and the building require ID access. Drafted workflow documentation was provided ahead of time and this meeting served as an additional opportunity to provide feedback. The workshop was offered three times to accommodate schedules, but it was preferred that more than one special collection unit was represented in each meeting. Bringing together liaisons with differing levels of experience with digital material provided the opportunity to share questions and experiences, building a stronger community around born-digital archives at Yale.

Since implementation of the workflow led to new developments, specifically around the ArchivesSpace import and DPS ingest, ongoing communication with stakeholders was necessary. Additional updates were provided to digital liaisons via email as the workflow was changed and new tools were developed. The service also has a designated email account to help manage communication and ensure that all inquiries are answered. Feedback from units often led to changes in the workflow and services available. The option to review content prior to ingest and new documentation to guide units in reviewing service output are both examples of changes made in response to feedback.

Since the digital accessioning service works with many special collection units it is important that the documentation is easily accessible. The Born Digital Archives Working Group dedicated a portion of their LibGuide to the providing information about the service and its use[7]. This LibGuide page includes an overview of the services provided, contact information, workflow documentation, manuals used for capturing content, and a link to the separate LibGuide for transporting special collections material. One section of the service LibGuide is committed to submission documentation including the metadata spreadsheet template, the submission form, and a PDF of the label that should be attached to a shipment of physical media. This LibGuide serves as a central reference point for units participating in the service as well as providing thorough documentation of the service for external parties interested in how born-digital accessioning is completed at Yale.

Although the lab is shared between the service, the digital preservation team, and the Beinecke Rare Book and Manuscript Library; other archival units need access to the specialized equipment to appraise content and review media prior to submission to the service. To fill this need the lab is available to staff members on Friday afternoons. Due to the security of the building, staff members are required to contact the service ahead of time to provide access to the building. Service staff are also available for consultation purposes during open lab hours. Information about the equipment available in the lab and requirements for using lab hours are documented in the LibGuide.

Challenges

Building a centralized service presented many organizational and technical challenges. The first logistical challenge faced was how to move fragile physical media to an off-site location. The lab is located just over a mile away from central campus but special collection material cannot, for security and preservation reasons, be transported by foot or in a staff member’s vehicle. Since the Stephen F. Gates ’68 Library Conservation Lab is located in the same building as the Digital Archaeology and Preservation Lab, guidelines for transporting special collection material was already underway. The Born Digital Archives Working Group advocated for the development of the Special Collections Transport Protocol and ensured the unique needs of transporting physical media were addressed. Recommendations for physical media included labeling shipments as containing magnetic media, packing optical media in a vertical position rather than laid flat to minimize the surface area that could be hit in case of impact, and using foam pieces as spacers to ensure that hard drives and other media cannot shift inside the shipping container.

The second organizational challenge was setting boundaries and articulating the limitations of the service. The service was planned as a two-year pilot with the goal of eliminating the accessioning backlog of physical media. In order to meet that goal there were many services that could not be provided including file level quality assurance and experimentation to acquire content from unusual digital formats. Through clear documentation and ongoing communication with special collection units, the service maintained its intended scope.

One major challenge that was not anticipated by the service, was that many units lack the time to prepare physical media for submission to the service. The work of completing the spreadsheet, numbering disks and sometimes creating a new finding aid in ArchivesSpace is time-consuming and difficult to manage with a limited staff. After discovering this gap in services, the Born Digital Archives Working Group requested a second student worker to help units prepare material for submission to the service. The student worker still requires staff time for management but this new position lessens the burden on special collection units.

The final challenge was as much an opportunity as a drawback: the ArchivesSpace-DPS system integration. This integration provides the opportunity to sync description and content for the same archival materials in both systems. This synchronized description shows the researcher what is available to access for physical media in a collection. It also provides access services with an easy to follow link to download the content for access in the reading room. The DPS can even extract files from certain disk image formats and sync those file names back to the ArchivesSpace record of the digital object. This extraction must be configured in the DPS workflow or done manually during full processing.

This system integration provides many great opportunities for improved description and access, but it also raised a number of challenges. For example, system updates occur regularly for both ArchivesSpace and the DPS and their respective test and production instances. These updates created data synchronization challenges because the data in the test ArchivesSpace instance is overwritten with data from the production instance after updates. This often slowed down the ability to test ingest and import tools and required resetting test data in the DPS. Although these updates and resets were announced ahead of time and service staff often created screenshots to save examples of test structures, this was still very disruptive to the implementation of the full service workflow. Although content was captured from disks starting in July 2016 the first spreadsheet was not imported to ArchivesSpace until March 2017. Given the two-year timeline there was no choice but to continue to capture content and complete the rest of service workflow while the system import and integration questions were not yet answered.

Ingest to the DPS was even further delayed as the deliverable unit options were tested and decisions were made within the units about creating and applying security tags. The deliverable unit and security tag decisions required ongoing communication between the stakeholders and the service. Once these decisions were made the batch packaging tool required further development to implement the changes. Adding security tags as a drop-down menu in the metadata spreadsheet meant that the template spreadsheet could no longer be publicly available on the LibGuide. In an effort to continue easy access for the digital liaisons, a link was provided from the LibGuide to a folder in Box which is available to all Yale staff.

Next Steps

The digital accessioning service pilot phase will officially end in June 2018. At the time of this article’s publication, the service has captured content from 5,910 pieces of media and created or updated 2,695 item level descriptions. Although the backlog of physical media has been diminished significantly there is still a need for centralized accessioning support for born-digital material. In addition to the existing backlog, special collection units continue to acquire born-digital content. Units have expressed an interest in acquiring more born-digital content now that the service exists to ensure responsible stewardship of the materials.

Starting in July 2018 the service will be a permanent function for special collection units at Yale University. The organization of the service will shift to reflect existing needs and available resources. Since the Beinecke Rare Book and Manuscript Library has the largest backlog of born-digital content, the service will focus 75% of its resources to imaging and capturing material from the Beinecke with 10% reserved for accessioning material from the other units. The last 15% of service resources will be dedicated to developing infrastructure and documentation to guide born-digital accessioning across Yale’s special collections. This allocation of resources is subject to change if a special collection unit has additional needs or requires priority accessioning services.

Moving forward, the service will be renamed Digital Accessioning Support Services (DASS). The new name reflects the reality that accessioning is not limited to the work done by the service and accessioning must start before material arrives at the lab. In addition to the functions currently performed by the service, the DASS will expand service offerings to include technical accessioning support for born-digital files received via direct network transfers, enriching captured content with metadata that will facilitate preservation and access, providing software training for staff working with born-digital transfers and DASS output, and growing the existing consulting role to include advising on the technical aspects of acquisitions. Expanding the services offered will require ongoing research in collaboration with BDAWG and members of the digital preservation team.

Conclusion

The pilot digital accessioning service proved the value of centralizing the highly technical aspects of working with born-digital archival materials. By removing the need for specialized hardware and expertise within each unit, special collections staff can focus on the other aspects of archiving born-digital materials. The pilot made it possible for staff to view and process digital files from 5,910 pieces of media. New and updated descriptions in ArchivesSpace provide staff and researchers with a better understanding of the digital materials available in a collection. The service also removed a barrier for special collection units that shied away from collecting born-digital materials due to lack of technology and time required for stewardship and preservation. Moving forward, the DASS will build on the successes of the pilot project and focus on future technical needs for acquiring and accessioning born-digital archival material.

Glossary

BDAWG: Born Digital Archives Working Group

DPS: Digital Preservation System, a local implementation of Preservica

YUL: Yale University Libraries

FTK: Forensic Toolkit

Disk Image: an exact copy of the disk containing the folder structure, orphaned and deleted (but not overwritten) files, and all associated metadata

Notes

[1] Yale Born Digital Archives Working Group. (2016). “Born Digital @ Yale: Home.” Yale University Library Research Guides. Retrieved from: https://guides.library.yale.edu/borndigital

[2] Storage Options: Yale ITS. [updated 2018] Yale Information Technology Services; [cited 2018 Mar. 2] Available from:https://its.yale.edu/services/storage-and-servers/server-hosting-and-administration/server-management/storage-options

[3] More information about disk numbering procedures and testing is available at http://campuspress.yale.edu/borndigital/2016/10/06/a-rose-by-any-other-naming-convention/

[4] Ipmsg.org. 2018. FastCopy; [cited 2018 Mar. 2]. Available at: https://ipmsg.org/tools/fastcopy.html.en

[5] Exactaudiocopy.de. (2018). “Extraction Technology » Exact Audio Copy.” Retrieved from: http://www.exactaudiocopy.de/en/index.php/overview/basic-technology/extraction-technology/

[6] More information about the testing process for CD-DAs is available at http://campuspress.yale.edu/borndigital/2016/12/20/to-image-or-copy-the-compact-disc-digital-audio-dilemma/

[7] Yale Born Digital Archives Working Group. (2015). “Born Digital @ Yale: Digital Accessioning Service.” Yale University Library Research Guides. Retrieved from: https://guides.library.yale.edu/c.php?g=300384&p=3593184

About the author

Alice Sara Prael is the Digital Accessioning Archivist for Yale Special Collections at the Beinecke Rare Book and Manuscript Library.  Previously, Alice was a National Digital Stewardship Resident at the John F. Kennedy Presidential Library, where she assessed digital preservation solutions.

Leave a Reply