By Ilana Kingsley and Mark Morlino
The Rasmuson Library DVD Movie Browser was developed because our library patrons were unhappy with the search function of our SIRSI/Dynix library catalog. Patrons told us they wanted to be able to browse movie covers and titles, and the library catalog did not meet their needs.
The DVD Browser is a simple application that screen scrapes the Rasmuson Library catalog for DVD movies and dumps the data into a Drupal MySQL database.  Drupal is a content management system (CMS) which runs on MySQL/PHP. The DVD Browser application could have been built without Drupal, but since our Library Web site runs on Drupal and because Drupal has many useful features, such as RSS for new items, tagging, and user comments, we decided to store the data within the Drupal CMS.
Our first version of the DVD browser ran on Drupal 4.7 and used PHP to screen scrape the catalog. We found that the easiest way to port the DVD browser to Drupal 6.5 was to rewrite the entire script. The jump from 4.7 to 6.5 was big, and there were some significant changes within Drupal, specifically the change for custom templates to use the CCK module instead of the Flexinode module. We found that the Flexinode to CCK converter did not work well, and decided to re-create the entire system. The script was rewritten in Perl.
This paper describes the process of setting up the DVD Browser. The screen scraping script provided may be repurposed for your needs. If your library uses a different library catalog vendor, the concepts presented below should work with minor modifications to the code.
What the Application Does
- A Perl script screen scrapes the library catalog for new DVDs.
- The script temporarily stores the data in a text file and gathers additional information about the movie, such as the movie cover and genre, from other Web sites.
- The data is dumped into a Drupal MySQL database.
- The end user is able to browse and search for DVDs.
- Drupal 6.X
- Drupal Modules: CCK, FileField, ImageField, ImageAPI, Link
- PHP 5.2 (required for ImageAPI)
- GD (configured to support JPG) or ImageMagick
The Rasmuson Library DVD Browser uses Drupal 6.5 for its backend. Besides the core Drupal modules, which are automatically added when installing Drupal, we installed the following modules.
- CCK (Content Construction Kit) module and its child modules:
The DVD Browser requires configuration of five areas within Drupal. They are content types, taxonomy, views, blocks, and theme templates. Each section is described below.
Content Type Creation
Drupal Content Types are used to represent a specific type of content and provide a content-input template so that content providers can add, edit, and delete data without knowing how to program or code. Drupal automatically installs two basic Content Types, Page and Story. In order to modify these existing Content Types or create your own, the CCK module must be installed.
We created a Content Type named “Movie,” which contains the following settings and fields:
|Name: Movie||The human-readable name for this content type. The name is required and must be unique.|
|Type: movie||The machine-readable name for this content type. The name is required and must be unique.|
|Description: A DVD for the DVD browser.||The description is not required. It is used to help content providers select an appropriate template.|
|Submission form settings|
|Title field label: Title||The default name for this required field is Title.|
|Body field label:||The default name for this field is Body; however, when creating a custom Content Type, it’s often best to omit this field by leaving it blank.|
|The default options for Published and Promoted to Front Page are checked. We want to keep the Published option checked, so that when a new movie is added to the database it’s automatically published. We’ve deselected the Promoted to Front Page option; however, since we’re using the Views module to present data, checking or unchecking this option has no visible effect to the end-user.|
|Taxonomy||Taxonomy module form|
|Menu settings||Menu module form|
|Title||Node module form||The main title of the DVD. Note, the Title field is required and is not stored in Drupal’s Content Type table (e.g, moviecontent_type_movie). It is stored in the node table.|
|Alternative Title||field_dvdb_alt_title||Text||If the title is not in English, then the English title of the movie, if known.|
|Sort Title||field_dvdb_sort_title||Text||The title of the movie without words like “A”, “The”, “Le”, “La.” This field is used for sorting the movies.|
|Record ID||field_dvdb_record_id||Text||The unique ID number associated with the item in the library catalog.|
|Call Number||field_dvdb_call_number||Text||The call number of the DVD.|
|Short Summary||field_dvdb_short_summary||Text||Information about the movie, taken from the screen scrape.|
|Long Summary||field_dvdb_long_summary||Text||Information about the movie, taken from the screen scrape.|
|IMDB Code||field_dvdb_imdb_code||Text||If applicable, the IMDB number for the movie.|
|RT Code||field_dvdb_rotten_tomatoes_code||Text||If applicable, the Rotten Tomatoes number for the movie.|
|Cover||field_dvdb_cover||Image||A thumbnail of the DVD cover.|
Note: Items in gray were automatically created by Drupal. Items in black were fields that we created.
The DVD Browser’s Perl script grabs the genre classification of a movie from the Internet Movie Database (IMDB) and dumps this information into the Drupal database. In order for the script to input data into the database, we needed to create a vocabulary using Drupal’s Taxonomy module.
When setting up a Drupal Vocabulary, you can associate Content Types with a specific Vocabulary. We created a vocabulary named Genre and associated it with the Movie Content Type.
View and Block Creation
The Views module allows you to pull data from the Drupal database and output it in different ways. The DVD Browser uses Views to output movies to two sidebar Blocks, the Genre block, the Most Recent block,and the #-Z header block.
Much documentation has been written about theming Drupal sites. Whether you use an out-of-the box theme (e.g., Garland), a third party theme (e.g., Newsflash), or create your own theme, you’ll probably need to customize the node.tpl.php template.
An out-of-the-box node.tpl.php template usually outputs the content of a node with the following statement:
<div class="content"> <?php print $content; ?> </div>
The statement doesn’t include <div /> tags that can be used for formatting content with CSS (Cascading Style Sheets). For example, the following content is displayed using the Zen out-of-the box node template, node.tpl.php.
For the DVD browser, we used Zen starter theme as the basis of our own theme. We modified node.tpl.php to use <div /> tags for most of the Drupal database fields. Below is a screen shot of the same movie, but using the modified file.
The Screen Scrape
The DVD Browser gathers the Call Number, Title, Short Summary, and Long Summary by screen scraping the Rasmuson Library catalog. It then gathers additional information, the Long Summary, Genre, and DVD Cover from the Internet Movie Database (IMDB), Rotten Tomatoes, and FreeCovers.net. This section describes what the Perl script is looking for when gathering data from the catalog and other Web pages.
Initial population of the database takes several days due to the large number of videos owned by the library and because our catalog only displays ten items on a page. The Perl script calls the catalog URL, processes ten items, and begins again, starting where it left off. Each time the script is called it scrapes the html source code the catalog for information about newly added DVDs and stores the information in a .txt file on the server, for example DVD-22.txt.
The script looks at the source code for the case sensitive string Details.
<tr> <td class="itemlisting2" rowspan="2" width="10%"> <input value="Details" name="VIEW^1" id="VIEW1" class="itemdetails" type="submit"> </td> <td class="itemlisting2"> <a href="/uhtbin/cgisirsi.exe/RTbEt1KHPc/UAFRAS/204850051/20/DVD-22%20VIDEODISC/1/X1002522166/"> <!-- current hit; bold it --> <strong> DVD-22 VIDEODISC <!-- current hit; unbold it --> </strong> </a> <br>Billy's Hollywood screen kiss [videorecording] / Trimark Pictures presents a Revolutionary Eye production ; co-producers, Meredith Scott Lynn and Irene Turner ; produced by David Moseley ; written and directed by Tommy O'Haver. <br>&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;nbsp;O'Haver, Tommy. </td> <td class="defaultstyle" rowspan="2" width="100" align="left"> <img src="/WebCat_Images/English/Special/Link/SPACER.gif" alt="" width="100" border="0" height="1"> </td> </tr>
It then looks for the call number, which occurs three lines after the word Details. In the screen shot above, the call number is DVD-22. The script searches for the Title, which occurs just before the case sensitive string [videorecording]. The text after [videorecording] / (in this example, “Trimark Pictures presents….”), will be stored in the Drupal database in the Short Summary field. To obtain the Long Summary, the script calls the specific movie (e.g., http://goldmine.uaf.edu/uhtbin/cgisirsi.exe/x/UAFRAS/x/20/DVD-1/1/X1002522166) and looks for the case sensitive string Summary:.
After getting the Title and Summary from the library catalog, the script attempts to gather the Genre, Long Summary, and Cover from the Internet Movie Database (IMDB), the Summary and Cover from Rotten Tomatoes, and the Cover from FreeCovers.net.
In cases where the program is able to gather multiple summaries, or multiple images for a single DVD we establish an order of preference and only insert one summary or image into Drupal per movie but the extra ones are saved to files in case we want to use them later.
Our library catalog runs on an Oracle database; however, due to vendor restrictions, instead of easily accessing and repurposing data stored in the Oracle database, we needed to find a work-around to create a useful DVD browsing tool for our patrons.
Feedback about the DVD Browser has been positive. It is a heavily used tool and our patrons find it much easier to use that the library catalog.
Future plans for the DVD Browser include turning on comments and ratings, so that our patrons can add user-generated content.
There are two files available. The dvd_browser_screen_scraper.pl file continues to be under active development at UAF; future versions will use subroutines and error checking will be refined.
- node.tpl.php – Our modified template file used for formatting the display of nodes in the DVD Browser.
- dvd_browser_screen_scraper.pl – The Perl script used for screen scraping the library catalog and for gathering data from IMDB, Rotten Tomatoes, and freecovers.net.
A Word on DVD Covers
The DVD browser stores thumbnail images of DVD covers on a locally hosted server. The Perl script gathers images from Freecovers.net, the Internet Movie database, or Rotten Tomatoes. If an image can not be found for a movie, a “No Image is Available” picture is displayed.
We believe that the use of thumbnail images is covered under the doctrine of fair use. Besek’s (2003) gives a layman’s overview of fair use. She discusses the four factors of fair use as outlined in the Copyright Law of the United States of America, Section 107. Limitations on exclusive rights: Fair use. These factors are:
- “purpose and character of the use;”
- “nature of the copyrighted work;”
- “amount and substantiality of the portion used in relation to the copyrighted work as a whole;” and
- “effect of the use upon the potential market for or value of the copyrighted work.”
Applying these factors to the DVD Browser, we believe that:
- The “purpose and character of the use” is educational. We are an academic library that has built an Internet search application that helps our students search DVD movies owned by the library.
- The “the nature of the copyrighted work” is creative. Creative works, compared with factual works, generally have a stronger case of copyright infringement. However, similar to the Kelly v. Arriba Soft Corp. case analysis by Donohue (2002), “works that have been previously published, lend themselves more readily to their fair use.” Movie art work is published in a variety of formats such as posters, DVD covers, animations, and commercials.
- The “amount and significance of the portion used in relation to the copyrighted work as a whole” is insignificant. We are providing a low resolution copy of an image.
- The “effect of the use upon the potential market for or value of the copyrighted work” is insignificant. As in the Kelly v. Arriba Soft case and the Perfect 10 v. Google case, the images are transformative in nature. The images are being used as a tool to help students select and search for information in an academic setting.
 SIRSI/Dynix has a command line tool, as well as a reporting tool, that could have allowed us to export the MARC records from our catalog, however due to administrative reasons, this option was not available to us.
Ayazi, Sara. “Search Engines Score Another Perfect 10: The Continued Misuse of Copyrighted Images on the Internet.” North Carolina Journal of Law and Technology 7.2(2006): 367. 2 Oct. 2008 < http://jolt.unc.edu/abstracts/volume-7/ncjltech/p367>.
Besek, June M. “Copyright: What Makes a Use “Fair”?” Educause Review 38.6 (2003): 12-13. 2 Oct. 2008 < http://connect.educause.edu/Library/EDUCAUSE+Review/CopyrightWhatMakesaUseFai/40446>.
Donohue, Kelly. “Court Gives Thumbs-up for Use of Thumbnail Pictures Online.” Duke Law and Technology Review 0006 (2002). 2 Oct. 2008 <http://www.law.duke.edu/journals/dltr/articles/2002dltr0006.html>.
U.S. Copyright Office. Copyright Law of the United States of America and Related Laws Contained in Title 17 of the United States Code. 2 Oct. 2008 <http://www.copyright.gov/title17/92chap1.html#107>.
We’d like to thank Eugeniy Kalin for his programming efforts in creating the first version of the DVD Browser using Drupal 4.7. Thanks to Jim Hassel and David Basham for coming up with the idea for the DVD Browser and for the prototype coding.
About the Authors
Ilana Kingsley, Web Librarian
University of Alaska Fairbanks Rasmuson & BioSciences Libraries
Mark Morlino, Systems Administrator/Programmer
University of Alaska Fairbanks Rasmuson & BioSciences Libraries