Within Limits: mass-digitization from scratch

Pieter De Praetere

Issue 25, 2014-07-21

Within Limits: mass-digitization from scratch

The provincial library of West-Vlaanderen (Belgium) is digitizing a large part of its iconographic collection. Due to various (technical and financial) reasons no specialist software was used. FastScan is a set of VBS-scripts that was developed by the author using off-the-shelf software that was either included in MS Windows (XP & 7) or already installed (imageMagick, Irfanview, littlecms, exiv2). This scripting package has increased the digitization efforts immensely. The article will show what software was used, the problems that occurred and how they were scripted together.

By Pieter De Praetere

Introduction

FastScan is a set of Visual Basic Scripts written to ease the digitization efforts of the Provincial Library Tolhuis (Province of West-Vlaanderen, Belgium). Its main goal was, and to an extent still is, to increase scanning throughput, reduce manual intervention, and safeguard the quality of the resulting digital files. As a side effect, it was also hoped that it would increase morale of the scanning operators by allowing them to do more in less time, making the work less tedious. It was impossible to create an entire scanning application that could do both the scanning and the processing of the items, so we decided to use available applications wherever possible and use a scripting language to bolt them all together. As there is no budget, all applications needed to be freely available. Furthermore, IT policy forbids the installation of additional software, so they had to be already installed or usable without installation. The first version of the script was written in MS Batch and lessened manual intervention and increased throughput by automatically numbering and scanning items, while facilitating cropping. This script, however, proved very unwieldy and difficult to maintain. To address that problem it was rewritten in Visual Basic Script, a scripting language installed by default in Microsoft Windows Computers, which offered additional features. Afterwards, more features (such as auto-cropping, colour profiles, etc.) were added to the script, as well as support for more scanners and types of material. While generally successful, the script still has some shortcomings. This article, written by the author of FastScan, attempts to describe the application and why and how it was developed, as well as show some issues still remaining, with possible solutions.

Digitization process

The Provincial Library “Tolhuis” is a relatively small governmental library located in Bruges. This institution houses three major collections; a welfare section, a local history section (by far the largest) and an iconographic collection, consisting of local (West-Vlaanderen) items. The iconographic collection mostly dates from before the 1950s, with a large amount of items pre-dating WWI. The types of material in this collection are diverse, consisting of postcards, photographs, glass negatives, porcelain cards etc. While being fairly well preserved, it has not been catalogued or described. A couple of years ago, the decision was made to start digitizing the entire collection, both to reduce handling of the items in the interest of preservation and to increase public exposure of the collection. The digitized items were to be properly described and put into an online collection management system (Memorix Maior, by Picturae)[1], with a public component at http://www.beeldbankwest-vlaanderen.be. At the same time, the items are cleaned (duplicates and items that should not be in the collection are removed), catalogued and numbered. So far (2014), only a small part of the entire collection (mostly postcards) has been treated in this manner, but the goal is to digitize the entire collection in this way.

The initial project was a fairly low-tech effort, using consumer scanners (one Canon CanoScan 3200F and one HP PhotoSmart 8200) and GIMP software for scanning. Every step in the process was manual. Recently (2013), a new, more heritage-worthy scanner, a Mikrotek ScanMaker 9800XL, supporting more options, was added to the mix. However, while the other scanners are consumer devices, they support the basics of heritage scanning, e.g. optical ppi settings of 300 and above and exporting to TIFF. Before FastScan was created, about 10,000 images were digitized in various projects, though not all of them were cleaned, numbered or described. Most of the items that have been digitized have been postcards. The rationale for this is relatively straightforward: a large component of this collection are items concerning the First World War. Digitizing this part of the collection allowed the library to participate in World War One remembrance projects. Nevertheless, other items (photographs) are routinely digitized, in much the same manner.

Digitization is carried out by normal staff in addition to their other duties. There is no full time digitization expert or scanner operator, and those who scan did not receive formal training in digitizing heritage materials. Some items are being sent to specialized firms for digitization, mostly large items the scanners cannot accommodate. Due to the addition of a new scanner last year, transparent source material (negatives, including glass negatives) are now digitized in house. The quality of the digitization is adequate, with sufficient ppi and colour depth. Items are stored in TIFF format on a central server, with viewing copies in JPG uploaded to the online collection management system.

Issues

The initial project did not comply with international digitization standards [2]. It did however meet several of the criteria, such as a minimal sampling rate of 300 ppi, colour scan (notwithstanding the fact that the source material is mostly black and white), a bit depth of 8bits per channel and the use of TIFF (uncompressed baseline TIFF v6.0) as the storage format, but total compliance was not achieved. Scanning equipment was uncalibrated, as were most computers. Colour profiles were not used. While this produced (at first glance) no visible errors or mistakes in the resulting digital images, the potential for corruption of the colour information in the images is quite large.

Another major issue was the speed at which digitization proceeded. All work was done manually, using image editing software, mostly GIMP or the software that came with the scanner (proprietary scanning software). Any alterations (especially cropping and rotating) were done afterwards with GIMP. To store the image, the operator had to copy the number written on the item into the “save as”-box, taking care not to confuse numbers or to miss any of the zeroes with which the base number was padded. Apart from being slow, this also caused repeated errors and mistakes in the file names, which had to be rectified afterwards.

The slow digitization speed and the large amount of errors resulted in a huge backlog, with only a limited amount of material digitized. Furthermore, staff members executing the digitization became demotivated due to the large amount of work involved, resulting in a fairly small amount of items (20 – 25) digitized each day. The looming commemoration of the First World War contributed to the pressure, as the goal was to digitize the WWI postcard collection before the start of the remembrance year (summer 2014).

Priorities for improvement

Before improvements could be made, priorities had to be established. Based on “best practices” [3], compliance with international standards should have been the preferred way to go forward. However, the library chose to make an increase in scanning speed the number one priority. This choice was based on several arguments. Firstly, while we did not comply with the complete package, some minimal requirements (sample rate, file format, bit depth) were met Additionally, there had been no reports of errors due to the lack of calibration or not using colour profiles. This made adherence to the entire standard a much less important issue although compliance certainly is a goal, especially since one of the goals of digitizing the collection is to present the digital item as a surrogate of the physical one wherever possible. It is clear that adherence to international standards is a must for this type of digitization projects. But on the whole, the situation was not as dire as we first thought.

Improvements in scanning speed were more attractive for a number of reasons. Going faster, with automation, would reduce the time staff had to spend on tedious manual tasks, improving the morale and limiting the possibility of human error. An increase in speed would also allow us to scan our entire WWI postcard collection before the summer of 2014, so we could jump on the remembrance train. If we were going to automate the process, it also seemed superfluous to put a lot of effort into calibrating systems and forcing the use of colour profiles while still using the original process, as we would probably be forced to do it all over again when the new system was installed. With a solid automated framework in place, adding the elements that would increase compliance was expected to be easier to install. For those reasons, we chose to address the scanning throughput first and compliance second.

Possible solutions

Existing alternatives were evaluated before choosing to create our own system. However, several constraints severely limited the available options. Firstly, due to policy reasons it could not require (complicated) installation procedures. The computers to which the scanners are attached are used by different users, and installing and configuring the software for all of them would be prohibitively time-consuming. Furthermore, installation could not require administrative privileges, as those would require another layer of bureaucratic involvement. The variety in systems, both operating systems (Windows XP and Windows 7) and scanners is something that had to be taken into account as well. Further constraints were added by the fact that there is no budget for these kinds of applications, so prospective software would have to be free (gratis).

However, the most important requirements for the system were (and are) ease of use and ease of customisation. Most of the staff operating the system are not “digital natives” and shouldn’t need to be. The system thus had to be easily usable by people that are less technically inclined. The ease of customisation was equally important. No two digitization projects are the same. While most of the basics (e.g., storage file format, optical resolution) will indeed be shared by different projects, other elements (file name structure, folder structure, scanning systems) will usually not be the same; the system had to allow for this level of customisation. To summarise, a prospective system needed to be easily configurable (and preferably not need administrator privileges to install), allow for a diversity in systems and scanners, be easy to use and be easy to customise to the situation of the Provincial Library.

Finding a system answering to all those needs is extremely difficult. Some projects advocate the use of Adobe Photoshop [4], but there are some major problems with the use of Photoshop. Photoshop supports TWAIN technology [5], but has no support for batch scanning. Another issue with Adobe was the cost; as there was no ,for extra applications. GIMP, a free (GPL) image editor with features comparable to those of Photoshop, had the same issues as the Adobe application. Like Photoshop, it allows for TWAIN-based scanning, but not for batch image scanning. Research also indicated that scripting GIMP to allow for batch scanning was also not feasible.

Another, more basic image editor, IrfanView, is pre-installed on all systems in use in the library. This program allows for scanning using TWAIN and has a batch scan mode [6](Skiljan, 2010). This batch scan mode is very difficult to use together with a script (one that would, for example, take care of giving the correct name to scans), so in effect only the single-scan mode could be used. Also, the GUI version was not considered adequate as an image editor, especially when compared to GIMP or Photoshop.

While other, more comprehensive, options were certainly available, budget and time constraints forced the use of several free or pre installed applications that would be tied together using a set of scripts, thereby working around several limitations in the original software. This is not an ideal option, as it requires knowledge of a scripting language and takes a lot of time, but it would result in a script that is created with all of the requirements of the Library in mind, something that would be very hard to achieve with off-the-shelf software. Developing internally was also a requirement, as there was no budget for outside contractors.

FastScan

FastScan has three major components; IrfanView, ImageMagick, and a catalogue number generator. Only the last part is home grown; the first two are off-the-shelf systems. IrfanView in command line mode is used to acquire images from the scanner using a TWAIN interface. TWAIN is a API that forms a generic interface between scanner drivers and applications [7](TWAIN Working Group, 2014). While IrfanView can be used as an interface between TWAIN and another application (such as this script), it is not meant to do this. It would probably be cheaper to use an application that only functions as a scriptable TWAIN program and not a complete, if basic, image editor. Using IrfanView has its advantages however. It’s installed on every system in the library, so it does not require intervention. Furthermore, due to its wide installation base, users can be expected to be familiar with the system, which is important when things go wrong. IrfanView also fulfills all our needs (i.e., a scriptable TWAIN system), so there is no pressing need to change it. In short, while not perfect (also not being free software), IrfanView is more than adequate to be used in the system.

IrfanView is called in batch mode with the /scanhidden, /dpi=(300,300) and /convert=filename options. The /convert option is used to store a scan of the entire surface of the scanner as an uncompressed TIFF file. This is considered the “raw” file, which is then later cropped to the size of the object. The /dpi=(300,300) options set the ppi of the scan. This value is for now hard coded, but in a future version this will be specific to the type of object being scanned, so it can be set higher for smaller objects. /scanhidden hides the TWAIN GUI, according to the IrfanView manual [8](Skiljan, 2014), which is essentially the GUI of the software bundled with the scanner. This is hidden for two reasons. Firstly, it opens up the potential for user-induced errors, as options can be changed, thus potentially corrupting the scan (lowering ppi, only scanning parts of the image etc.). We had to configure every driver before using FastScan, to set the scanned area and ppi to its maximum, because otherwise IrfanView would scan only what the bundled software allowed (e.g. only half of the surface at 150ppi). The second reason to hide this is to lessen confusion. FastScan is a CLI app with no GUI elements. To have this one pop up suddenly, requiring a manual click of the “scan” button, would introduce a unnecessary step. In order to lessen the potential for errors and to decrease the amount of steps, we choose to hide this dialog.

The second component is ImageMagick, or IM for short. IM is a free image editing and viewing suite, released under the Apache 2.0 licence [9]. IM is used in two ways in FastScan; as an image cropper and as an extractor for metadata. The first function is by far the most important, as it removes the need for manual intervention while scanning. IrfanView creates an image of the entire surface of the scanner, after which ImageMagick cuts the object out. This allows for “blind” scanning, i.e. without intervention from the operator, as long as the background colour contrasts enough with the colour of the object (e.g. a black background with a light scan or vice versa).

To do this, IM guesses (using blurring techniques and fuzziness) the contours of the object and outputs its location. The fuzz-factor is the most important here, as this determines how IM defines item and background. A large fuzz-factor means a lot of the detail is blurred (and thus ignored), potentially causing parts of the object to be considered background. A small fuzz-factor on the contrary, takes more detail into account when defining an image, thus resulting in a larger share of the background being considered part of the object. With a high-contrast background (e.g. black on white), a factor of 15% was found to be adequate and produce only a small amount of images that are too large or too small. Another call to ImageMagick then crops the entire scan down to the object and stores this as a new file. We keep the original full surface scan around till the end of the program in case something goes wrong. Note that because this is an XP system the name of the ImageMagick convert.exe program has been changed to im_convert.exe so as not to inadvertedly call the Windows application convert.exe [10].

The error rate for this component is thus fairly low, although it could potentially generate a lot of errors, especially when the contrast between the background and the item is not very large. To date however, only a small amount of images has been rejected because IM cropped too much. This is mostly due to the fact that the items the library is scanning now have a white or black border around them (postcards, photographs, in memoriam cards), making it easy to select a high contrast background (fully white or black). The fuzziness factor used is quite conservative as well, which resulted in a larger (but still fairly small) group of scans that had to be re-cropped because IM cut out an area too large, mostly due to dust or scratches on the scanner surface. This is considered to be less of a problem since it can be rectified easily.

Using IM to crop the image has been a huge time-saver. Before the implementation of IM, a lot of the time spent on scanning was devoted to cropping the images. Now this can be done as part of the process with an error margin in some cases lower than when done by hand. It did require some adaptations from the operators, such as using (and switching) a high-contrast background (in the Library we used sheets of thick black paper slightly larger than the surface of the scanner with great success) and making sure the object was straight when laid on the scanner surface the first time around (in order to save time later and lessen the probability of IM errors). Previously, they could adapt and change the rotation of the image while scanning, as the resulting image appeared on their screen before saving. Because implementing IM decreased the scanning time, these small adaptations were not seen as a problem by the staff.

The third component is a custom catalogue number generator. Every object has a number, consisting of two elements. The first element is a prefix, denoting the type of material and consisting of three letters. The second element is a six-character number, with zero-padding. If both sides of an object (front and back) are to be digitized, a letter is added to the number. The letter A denotes the front side, the letter B the back side. The implementation of these catalogue rules created some difficulties in the program.

The first problem is the generation of the six-character number. In normal cases, items are digitized in order (e.g. 2 follows 1 which is followed by 3, 4, 5 etc.), so it is possible and even recommended (in order to save time) that the system automatically determines the catalogue number. This is, however, not as easy as it sounds. During the lifetime of the application, the number is stored in memory and auto-incremented. However, this would mean that every time the script is started, the user has to enter the number of the first item. This was considered not very user-friendly, so the system stores this value in a text file. Because we do this, it is vital that the application writes the last issued number to this file after every scan. This also prevents faults in the numbering when the application unexpectedly quits after a bug, as long as the fault occurs after scanning. This text file is tied to the user and the computer executing the script, and not centrally stored (e.g. on the shared drive). This was done for two reasons. Firstly, this allows more than one user using the program at the same time (on different computers), without the need for complicated locking procedures. Secondly, if more than one user is using FastScan, they can continue scanning from where they left without readjusting the number every time, as it only remembers “their” number. Nevertheless, this component is not without faults, so the user is prompted before every scan whether the generated number matches the one on the item and to change it accordingly. At present there is no automated manner to prevent duplication between users. However, this is not a major problem, as all objects are numbered before they are scanned. The only thing operators have to take care of is indicating on the items whether they have scanned them or not. In most cases, they simply take a batch of objects out of their storage and return them when finished, taking care to indicate that they have been scanned.

The second problem is the prefix generation. Having a prefix denoting the type of material is less than ideal, as it invariably leads to drawn-out discussions and confusion about which item belongs in which category. Nevertheless, the system is used, so the script must know how to work with it. This proved to be quite easy, but the easy solution had some caveats. At first, we created a VBscript Dictionary with all the types, and based on CLI input (cscript fastscan-3.0.vbs type_of_material) selected the right prefix and continued. This approach has one huge drawback; to add a new type, someone has to code it in. This is not a tenable solution, as there are currently no other programmers in the library. To rectify this, the dictionary was moved to an XML file, which contained not only the prefix, but also the length of the number after the prefix (zero-padding is still the default, no matter the length) and the increment (default is 1). Together with some redefined code, this makes it easy to add material types to the application.

<list>
	<material name="postkaart">
		<name>postkaart</name>
		<key>
			<prefix>PKT</prefix>
			<length>6</length>
			<step>1</step>
		</key>
	</material>
</list>

' Prefix determination
prefix = mXML.selectSingleNode ("/list/material[@name='" & Wscript.Arguments(0) & "']/key/prefix").Text
' Key length
k_length = mXML.selectSingleNode ("/list/material[@name='" & Wscript.Arguments(0) & "']/key/length").Text
' Key step
k_step = mXML.selectSingleNode ("/list/material[@name='" & Wscript.Arguments(0) & "']/key/step").Text

Tying of the CLI option type_of_material to the correct entry in the XML file is done with XPath. This proved to be easier than expected and has thus far not resulted in major issues.

While using an XML file solves some issues, others remain unsolved. Some types have customisations in the code (e.g., different scanning or cropping parameters). Adding these to the script still requires some programming. At the moment, there are no plans to move those to a configuration file as well, as this would entail a lot of extra work and overhead for relatively small gains.

The third (and final) problem faced is the scanning of both sides of an object. This does not happen all the time (roughly 20% of objects have an interesting enough backside to be selected for scanning), but still must be supported. The established procedure is to add A or B to the inventory number to differentiate between the front and the back of an object. This is supported by the application, but only after manual user intervention (there is no way it can guess whether or not the item will have a back side), except for certain types of material (this is one of the customisations hard coded in the application).

Supporting this required some modification to the flow of the application. Normally, every time the application loop restarts (the loop continues to run until the user quits), the inventory number is incremented by k_step. If the user answers yes to the prompt “Does this item have a backside?”, this incrementing should not happen the next time the loop runs, otherwise the inventory number would be wrong (front and backside of the item have the same inventory number). To prevent this, after answering yes the flag backside is set to 1. The next time the loop runs, backside will evaluate as 1, so the system knows not to increment the inventory number.

if backside <> 1 then
	' Not a backside
	' Reset brun
	brun = 0
	' New number
	number = last + cInt (k_step)
	...
else
	' Is a backside
	brun = 1 ' So we know when to reset backside
	number = last
end if
...
if backside = 1 and brun = 0 then
	' A side
	filename = prefix & pad (number, k_length) & "A.tif"
elseif backside = 1 and brun = 1 then
	' B side
	filename = prefix & pad (number, k_length) & "B.tif"
else
	' Normal case
	filename = prefix & pad (number, k_length) & ".tif"
end if
...
' Reset counters
if brun = 1 then
	backside = 0
end if

In order to determine whether the letter should be A or B and when to reset backside to 0, another flag is used; brun. This one is set to 0 when the question “Does this item have a backside?” is answered by yes. This signals the application that this is the first run, so the letter A must be used. The next loop, when backside = 1 evaluates to true, brun is set to 1. This means we are now in the second run, so the letter must be B. Furthermore, at the end of the application, the flag brun is checked again. If it equals 1, backside is reset to 0, because now both sides have been scanned. Afterwards, the application continues normally and will again prompt the user whether this item has a backside or not (this test is skipped for obvious reasons when backside equals 1). While seemingly complicated, this system has not caused any problems.

The components are tied together with Visual Basic Script, a scripting language created by Microsoft and installed by default on Windows XP and Windows 7 systems, which are the two operating systems used at the library. An earlier version used the Microsoft Batch File commands, but this was very unwieldy and hard to maintain, so the entire script was rewritten in VBScript. All things considered, the language is adequate, in some areas even excellent, especially its XML bindings. Given the choice, more features could probably be added easier using Python or Perl, such as the ability to use the TWAIN API from within the script, rather than using some other software [11], but due to the ban on installing software, this is not possible. VBScript is also not very difficult to learn (if harder to master), so other people (with a small amount of training) could take over maintenance, which is important with regard to the future of the system within the library.

FastScan has been designed from the ground up to reduce the workload for the operators. Most steps are automatic and for those that aren’t a wizard-like interface is used. In ideal cases the operator starts the application and answers the first set of questions to determine the number of the first object. The item is next placed on the scanner and is scanned, cropped and saved. No intervention is necessary. After this step, the cycle repeats. When the application has been used before, or a new cycle is started, the number of the first item (and all following numbers) are determined automatically by incrementing the previous number with 1. When the situation is not ideal, e.g. when the number is wrong, the number can be corrected using the wizard. In short, user intervention is limited to pressing the Enter key a few times when everything goes according to plan, and to entering some numbers when there is a problem.

Added value

FastScan as described above is more than adequate to digitize heritage material. However, it still lacks elements that would allow for more compliance with established heritage standards. The first element is the support of embedded colour profiles. Colour profiles (ICC) are the preferred way of sharing colour information between systems. In the base system, this is not supported. Supporting it would entail calibrating scanning equipment, installing colour profiles and finding a way to embed them in the images. As this was not possible with the described tools, new ones had to be found. The second element is metadata support. Metadata is here to be understood as technical metadata, not descriptive. The TIFF file format supports embedding metadata, thus creating a durable link between data and metadata. As with colour profiles, embedding required the use of extra software, although IM could be used to extract most of it. Both elements were researched and provisionally implemented, but only the module that adds colour profiles is currently used. Adding metadata is possible, but slows down scanning immensely. I will only briefly describe how we added colour profiles; the script for adding metadata will not be described, but is available at github .

Adding an ICC profile is supported in IrfanView with the use of plug-ins. However, due to the set-up at the Library, it is impossible to install those plug-ins, so we had to find another solution. We stumbled upon the free (GPL) colour management system called Argyll CMS. It is attractive to us for two reasons. Firstly, it can be used to calibrate scanners (note that a calibration target is still required) and create a valid ICC colour profile. Secondly, it contains software (tifficc) that allows embedding of colour profiles in TIFF files. Furthermore, this software is easily scriptable, so it could be integrated in FastScan. To use Argyll within FastScan, we first had to create ICC colour profiles from the scanners and store them in a central location. Using an XML file which contained the name of the scanner and the computer it was attached to (to customise certain options of TWAIN), we linked them together and created a script that called tifficc to embed this profile in the image. The resulting images are stored in the edited images sub folder of FastScan. Embedding is done after cropping, to prevent the embedding from being stripped by ImageMagick. While supporting colour profiles had no immediate impact on the work flow or the (perceivable) quality of the scans, it does mean increased compliance with international standards and safeguards the quality of the items.

FastScan Issues

FastScan as an application is not without issues. Three of the most important problems are the dependence on external software and not software libraries; the lack of portability due to the hard coding of some variables and options; and the possible cessation of maintenance and support when the contract of the main author ends. Two of those issues are related to the way FastScan works, while the third is a policy problem, for which other solutions may be needed.

Depending on external software is a necessity because of the programming language. While some libraries for VBScript exist, they are internal and there is no library repository for VBScript such as, for example, there is in CPAN for perl. External dependence is also necessary because of the ban on installing software, which forces the use of either pre-installed applications or programs for which no installation is required. FastScan makes liberal use of both of these types. To reduce this dependence, a major rewrite in a different programming language, either a scripting language such as Perl or Python, or a compiled high-level programming language like C++, seems to be the only solution. This is however not a likely solution for two reasons. Firstly, this would take huge amounts of time for (in some cases) only marginal benefits. Secondly, while it would be easier to do a Perl or Python rewrite, this would entail the installation of interpreters on all systems where FastScan is used. It is unlikely that this would be permitted within the library and furthermore, it would add a large amount of extra work, i.e. installing the interpreters on every system. Nevertheless, for future use, a rewrite (in a scripting language) remains a distinct possibility.

The second problem is the reliance on hard coded variables. This is quite an old problem; in the first versions almost everything was hard coded. In version 3.0, reliance on these variables has been reduced and as much as possible has been moved to either a configuration file (for paths to external applications) or an XML file. This XML file contains “local” information specific to the institution and to the systems where FastScan is used. In the file scanners.xml the name, type, colour profile and computer to which a scanner is attached are stored. This information is used by FastScan to set certain options and embed the right profile. In the second file, material.xml, options for the scanned item types is set, such as prefix, key length and number increment. While this is already an improvement over the previous situation, some extra settings should be stored in those files, such as the last used number for every item and the cropping settings (especially the fuzz factor). Doing so would make it easier to add new types without changing the source code, which is important with regard to future maintainability.

Another consequence of the reliance on hard coded variables and, more generally, the way in which the script is written especially for the library, is the difficulty of porting it to other environments. It is certainly possible, in theory, but in practice a lot of the logic will have to be changed. The settings for the external applications can probably remain the same, but the number generator, including the front side/backside switch, will have to be rewritten to adapt to the new environment. This is possible, as the source code is freely available, but certainly not easy. It might make more sense for institutions to simply create their own script, with their own quirks and settings. Other parts of the script present less problems, although reliance on external applications requires the installation of them, which is usually not a problem; but certain components (IrfanView) are not Free Software, so it is not guaranteed to always remain available. To sum up, while porting is possible, it will usually be easier to write your own script, perhaps based on the source code for FastScan.

The third issue is the uncertain status of future support and maintenance. While the current author is employed by the library, this contract will soon end and at the moment there is no replacement. Using the application does not require specialist knowledge, nor does changing configuration values. Solving bugs and adding new options does, however, require knowledge of VBScript and of the options and settings of the various external software. Some of these concerns are rectified by the fact that FastScan is Open Source (GPL 3), so the source code is available for all to see. This might result in another maintainer taking over, or the original author developing further, but this is by no means certain. Furthermore, technical documentation of the software is still lacking (a user manual is available, a developer manual is not), although work on this is ongoing. It still remains possible to scan without FastScan, but this would mean returning to the previous manual system. While not impossible; it is certainly would not be an ideal situation.

Concluding remarks

All things considered, FastScan has been a success. Basically, the script does nothing more than tie together several open source (or freeware) components to support the workflow in the Library. It is not inconceivable that other, more expensive and specialised, scanning software would be capable of doing the same, but due to the limited funds available and the difficulties with installing software, those cannot be easily implemented library-wide. FastScan is thus a “compromise solution”, but one that is tailored to the library, supporting the entire work flow and all its quirks. Such a level of customisation would be very difficult to achieve with off-the-shelf software. This is both its major advantage and its major weakness. As long as everything stays the same, FastScan is a great piece of software. When things change (as they inevitably will), it will require someone with VBScript knowledge and knowledge of the applications used to adapt the script. There is a major push/rewrite going on to address some of these issues by relying on configuration files. Nevertheless, it is not impossible that in the future FastScan will no longer be used because it cannot be adapted. To assist future maintainers and to prevent a possible “black hole”, the script is released under the GPL v. 3.0 licence and the source code (of the script, not the additional software) is available on GitHub:
https://github.com/pieterdp/sobki/tree/master/FastScan.

While it is being used, its advantages are clear. Items are scanned faster and, arguably, better than before. The most popular feature was the auto-number generator, as this is the most labour intensive and boring part of the job. Auto-cropping and auto-scanning further reduce the workload, increasing throughput. Some other additions, such as metadata and colour profiles, do not increase the scanning rate, but the quality of the scans. This had not been the original goal, but that does not diminish its importance. With future reuse and durable storage in mind, compliance to international standards is a must. Together with the increased throughput, this makes FastScan an important part of the digitization process at the library.

References

[1] Picturae. 2014. “Memorix Maior. Collectiebeheer. Picturae’s Webbased Beheertool.” https://picturae.com/nl/brochures/380-memorix. (Back)

[2] Van Dormolen, Hans. 2008. “Metamorfoze Preservation Imaging Guidelines.” /Archiving Conference/2008 (1): 162–65. (Back)

[3] Packed. 2013. “Over CEST – CEST – Cultureel Erfgoed Standaarden Toolbox | PACKED Vzw Expertisecentrum Digitaal Erfgoed.” http://www.projectcest.be/index.php/Over_CEST#.C2.A0.C2.A0.C2.A0digitaliseren. (Back)

[4] UMAss Amherst Libraries. 2011. “Guidelines for Digitization”. Amherst: UMAss Amherst Libraries. http://www.library.umass.edu/assets/aboutus/attachments/UMass-Amherst-Libraries-Best-Practice-Guidelines-for-Digitization-20110523-templated.pdf. (Back)

[5] Adobe. 2014. “TWAIN Plug-in | Photoshop CS4, CS5, CS6.” Accessed May 4. http://helpx.adobe.com/photoshop/kb/twain-plug-photoshop-cs4-cs5.html. (Back)

[6] Skiljan, Irfan. 2010. “Command Line Options for IrfanView.” December 27. http://www.robvanderwoude.com/files/iviewcli.txt. (Back)

[7]TWAIN Working Group. 2014. “TWAIN.” Accessed May 6. http://www.twain.org/. (Back)

[8]Skiljan, Irfan. 2014. “Command Line Options for IrfanView.” Accessed May 6. http://www.robvanderwoude.com/files/iviewcli.txt. (Back)

[9]ImageMagick Studio. 2014. “ImageMagick: Convert, Edit, Or Compose Bitmap Images.” Accessed May 6. http://www.imagemagick.org/. (Back)

[10] Hugemann, Wolfgang. 2012. “Usage under Windows — IM v6 Examples.” March 21. http://www.imagemagick.org/Usage/windows/#convert_issue. (Back)

[11] Ouwerkerk, Lennert. 2002. “Win32::Scanner::EZTWAIN – Search.cpan.org.” http://search.cpan.org/~lennerto/Win32-Scanner-EZTWAIN-0.01/EZTWAIN.pm. (Back)

About the Author

Pieter De Praetere is a employee of the Provincial Library “Tolhuis” (West-Vlaanderen), where his main occupation is the maintenance and technical support of the provincial Image Library (http://www.beeldbankwest-vlaanderen.be). Apart from this, he is also responsible for the management of the digitization project. He holds a Master’s degree in Archival Sciences and one in History, while also having a great interest in programming and scripting. E-mail: pieter.de.praetere@helptux.be URL: http://erfgoeddb.helptux.be

Subscribe to comments: For this article | For all articles

Within Limits: mass-digitization from scratch

Introduction

Digitization process

Issues

Priorities for improvement

Possible solutions

FastScan

Added value

FastScan Issues

Concluding remarks

References

About the Author

Leave a Reply

Current Issue

Previous Issues

For Authors