Issue 5, 2008-12-15

Rasmuson Library DVD Browser: Fun with Screen Scraping and Drupal

The DVD Browser is a simple application that lets library patrons browse movie covers, titles, and reviews. It works by screen scraping the the Rasmuson Library catalog for DVD movies and dumps the data into a Drupal MySQL database. This paper describes the process of setting up the DVD Browser.

By Ilana Kingsley and Mark Morlino

Introduction

The Rasmuson Library DVD Movie Browser was developed because our library patrons were unhappy with the search function of our SIRSI/Dynix library catalog. Patrons told us they wanted to be able to browse movie covers and titles, and the library catalog did not meet their needs.

The DVD Browser is a simple application that screen scrapes the Rasmuson Library catalog for DVD movies and dumps the data into a Drupal MySQL database. [1] Drupal is a content management system (CMS) which runs on MySQL/PHP. The DVD Browser application could have been built without Drupal, but since our Library Web site runs on Drupal and because Drupal has many useful features, such as RSS for new items, tagging, and user comments, we decided to store the data within the Drupal CMS.

Our first version of the DVD browser ran on Drupal 4.7 and used PHP to screen scrape the catalog. We found that the easiest way to port the DVD browser to Drupal 6.5 was to rewrite the entire script. The jump from 4.7 to 6.5 was big, and there were some significant changes within Drupal, specifically the change for custom templates to use the CCK module instead of the Flexinode module.  We found that the Flexinode to CCK converter did not work well, and decided to re-create the entire system. The script was rewritten in Perl.

This paper describes the process of setting up the DVD Browser.  The screen scraping script provided may be repurposed for your needs. If your library uses a different library catalog vendor, the concepts presented below should work with minor modifications to the code.

What the Application Does

  1. A Perl script screen scrapes the library catalog for new DVDs.
  2. The script temporarily stores the data in a text file and gathers additional information about the movie, such as the movie cover and genre, from other Web sites.
  3. The data is dumped into a Drupal MySQL database.
  4. The end user is able to browse and search for DVDs.

Requirements

  • Perl
  • MySQL
  • Drupal 6.X
  • Drupal Modules: CCK, FileField, ImageField, ImageAPI, Link
  • PHP 5.2 (required for ImageAPI)
  • GD (configured to support JPG) or ImageMagick

Drupal Configuration

The Rasmuson Library DVD Browser uses Drupal 6.5 for its backend. Besides the core Drupal modules, which are automatically added when installing Drupal, we installed the following modules.

  1. CCK (Content Construction Kit) module and its child modules:
    1. FileField
    2. ImageField
    3. Link
    4. ImageAPI
  2. Views

The DVD Browser requires configuration of five areas within Drupal. They are content types, taxonomy, views, blocks, and theme templates. Each section is described below.

Content Type Creation

Drupal Content Types are used to represent a specific type of content and provide a content-input template so that content providers can add, edit, and delete data without knowing how to program or code. Drupal automatically installs two basic Content Types, Page and Story. In order to modify these existing Content Types or create your own, the CCK module must be installed.

We created a Content Type named “Movie,” which contains the following settings and fields:

Initial Settings
Identification
Name: Movie The human-readable name for this content type. The name is required and must be unique.
Type: movie The machine-readable name for this content type. The name is required and must be unique.
Description: A DVD for the DVD browser. The description is not required. It is used to help content providers select an appropriate template.
Submission form settings
Title field label: Title The default name for this required field is Title.
Body field label: The default name for this field is Body; however, when creating a custom Content Type, it’s often best to omit this field by leaving it blank.
Workflow settings
Default options:

* Published
Promoted to Front Page

The default options for Published and Promoted to Front Page are checked. We want to keep the Published option checked, so that when a new movie is added to the database it’s automatically published. We’ve deselected the Promoted to Front Page option; however, since we’re using the Views module to present data, checking or unchecking this option has no visible effect to the end-user.
Fields
Label Field Name Type Description
Taxonomy Taxonomy module form
Menu settings Menu module form
Title Node module form The main title of the DVD. Note, the Title field is required and is not stored in Drupal’s Content Type table (e.g, moviecontent_type_movie). It is stored in the node table.
Alternative Title field_dvdb_alt_title Text If the title is not in English, then the English title of the movie, if known.
Sort Title field_dvdb_sort_title Text The title of the movie without words like “A”, “The”, “Le”, “La.” This field is used for sorting the movies.
Record ID field_dvdb_record_id Text The unique ID number associated with the item in the library catalog.
Call Number field_dvdb_call_number Text The call number of the DVD.
Short Summary field_dvdb_short_summary Text Information about the movie, taken from the screen scrape.
Long Summary field_dvdb_long_summary Text Information about the movie, taken from the screen scrape.
IMDB Code field_dvdb_imdb_code Text If applicable, the IMDB number for the movie.
RT Code field_dvdb_rotten_tomatoes_code Text If applicable, the Rotten Tomatoes number for the movie.
Cover field_dvdb_cover Image A thumbnail of the DVD cover.

Note: Items in gray were automatically created by Drupal. Items in black were fields that we created.

Taxonomy Creation

The DVD Browser’s Perl script grabs the genre classification of a movie from the Internet Movie Database (IMDB) and dumps this information into the Drupal database. In order for the script to input data into the database, we needed to create a vocabulary using Drupal’s Taxonomy module.

When setting up a Drupal Vocabulary, you can associate Content Types with a specific Vocabulary. We created a vocabulary named Genre and associated it with the Movie Content Type.

View and Block Creation

The Views module allows you to pull data from the Drupal database and output it in different ways. The DVD Browser uses Views to output movies to two sidebar Blocks, the Genre block, the Most Recent block,and the #-Z header block.

Theme Templates

Much documentation has been written about theming Drupal sites. Whether you use an out-of-the box theme (e.g., Garland), a third party theme (e.g., Newsflash), or create your own theme, you’ll probably need to customize the node.tpl.php template.

An out-of-the-box node.tpl.php template usually outputs the content of a node with the following statement:

<div class="content">
     <?php print $content; ?>
</div>

The statement doesn’t include <div /> tags that can be used for formatting content with CSS (Cascading Style Sheets). For example, the following content is displayed using the Zen out-of-the box node template, node.tpl.php.

image

Figure 1: DVD listing: Zen Template

For the DVD browser, we used Zen starter theme as the basis of our own theme. We modified node.tpl.php to use <div /> tags for most of the Drupal database fields. Below is a screen shot of the same movie, but using the modified file.

image

Figure 2: DVD listing: UAF Modified Zen theme

The Screen Scrape

The DVD Browser gathers the Call Number, Title, Short Summary, and Long Summary by screen scraping the Rasmuson Library catalog. It then gathers additional information, the Long Summary, Genre, and DVD Cover from the Internet Movie Database (IMDB), Rotten Tomatoes, and FreeCovers.net. This section describes what the Perl script is looking for when gathering data from the catalog and other Web pages.

Initial population of the database takes several days due to the large number of videos owned by the library and because our catalog only displays ten items on a page. The Perl script calls the catalog URL, processes ten items, and begins again, starting where it left off. Each time the script is called it scrapes the html source code the catalog for information about newly added DVDs and stores the information in a .txt file on the server, for example DVD-22.txt.

The script looks at the source code for the case sensitive string Details.

<tr>
    <td class="itemlisting2" rowspan="2" width="10%">
       <input value="Details" name="VIEW^1" id="VIEW1" class="itemdetails" type="submit">
    </td>
    <td class="itemlisting2">
        <a href="/uhtbin/cgisirsi.exe/RTbEt1KHPc/UAFRAS/204850051/20/DVD-22%20VIDEODISC/1/X1002522166/">
        <!-- current hit; bold it -->
	<strong>
	DVD-22 VIDEODISC
        <!-- current hit; unbold it -->
        </strong>
      </a>
      <br>Billy's Hollywood screen kiss [videorecording] / Trimark Pictures presents a Revolutionary Eye production ; co-producers, Meredith Scott Lynn and Irene Turner ; produced by David Moseley ; written and directed by Tommy O'Haver.
      <br>&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;nbsp;O'Haver, Tommy.
    </td>
    <td class="defaultstyle" rowspan="2" width="100" align="left">
        <img src="/WebCat_Images/English/Special/Link/SPACER.gif" alt="" width="100" border="0" height="1">
    </td>
   </tr>

It then looks for the call number, which occurs three lines after the word Details. In the screen shot above, the call number is DVD-22. The script searches for the Title, which occurs just before the case sensitive string [videorecording]. The text after [videorecording] / (in this example, “Trimark Pictures presents….”), will be stored in the Drupal database in the Short Summary field. To obtain the Long Summary, the script calls the specific movie (e.g., http://goldmine.uaf.edu/uhtbin/cgisirsi.exe/x/UAFRAS/x/20/DVD-1/1/X1002522166) and looks for the case sensitive string Summary:.

After getting the Title and Summary from the library catalog, the script attempts to gather the Genre, Long Summary, and Cover from the Internet Movie Database (IMDB), the Summary and Cover from Rotten Tomatoes, and the Cover from FreeCovers.net.

In cases where the program is able to gather multiple summaries, or multiple images for a single DVD we establish an order of preference and only insert one summary or image into Drupal per movie but the extra ones are saved to files in case we want to use them later.

Conclusion

Our library catalog runs on an Oracle database; however, due to vendor restrictions, instead of easily accessing and repurposing data stored in the Oracle database, we needed to find a work-around to create a useful DVD browsing tool for our patrons.

Feedback about the DVD Browser has been positive. It is a heavily used tool and our patrons find it much easier to use that the library catalog.

Future plans for the DVD Browser include turning on comments and ratings, so that our patrons can add user-generated content.

Code

There are two files available. The dvd_browser_screen_scraper.pl file continues to be under active development at UAF; future versions will use subroutines and error checking will be refined.

  • node.tpl.php – Our modified template file used for formatting the display of nodes in the DVD Browser.
  • dvd_browser_screen_scraper.pl – The Perl script used for screen scraping the library catalog and for gathering data from IMDB, Rotten Tomatoes, and freecovers.net.

A Word on DVD Covers

The DVD browser stores thumbnail images of DVD covers on a locally hosted server. The Perl script gathers images from Freecovers.net, the Internet Movie database, or Rotten Tomatoes. If an image can not be found for a movie, a “No Image is Available” picture is displayed.

We believe that the use of thumbnail images is covered under the doctrine of fair use. Besek’s (2003) gives a layman’s overview of fair use. She discusses the four factors of fair use as outlined in the Copyright Law of the United States of America, Section 107. Limitations on exclusive rights: Fair use. These factors are:

  1. “purpose and character of the use;”
  2. “nature of the copyrighted work;”
  3. “amount and substantiality of the portion used in relation to the copyrighted work as a whole;” and
  4. “effect of the use upon the potential market for or value of the copyrighted work.”

Applying these factors to the DVD Browser, we believe that:

  1. The “purpose and character of the use” is educational. We are an academic library that has built an Internet search application that helps our students search DVD movies owned by the library.
  2. The “the nature of the copyrighted work” is creative. Creative works, compared with factual works, generally have a stronger case of copyright infringement. However, similar to the Kelly v. Arriba Soft Corp. case analysis by Donohue (2002), “works that have been previously published, lend themselves more readily to their fair use.” Movie art work is published in a variety of formats such as posters, DVD covers, animations, and commercials.
  3. The “amount and significance of the portion used in relation to the copyrighted work as a whole” is insignificant. We are providing a low resolution copy of an image.
  4. The “effect of the use upon the potential market for or value of the copyrighted work” is insignificant. As in the Kelly v. Arriba Soft case and the Perfect 10 v. Google case, the images are transformative in nature. The images are being used as a tool to help students select and search for information in an academic setting.

Note

[1] SIRSI/Dynix has a command line tool, as well as a reporting tool, that could have allowed us to export the MARC records from our catalog, however due to administrative reasons, this option was not available to us.

References

Ayazi, Sara. “Search Engines Score Another Perfect 10: The Continued Misuse of Copyrighted Images on the Internet.” North Carolina Journal of Law and Technology 7.2(2006): 367. 2 Oct. 2008 < http://jolt.unc.edu/abstracts/volume-7/ncjltech/p367>.

Besek, June M. “Copyright: What Makes a Use “Fair”?” Educause Review 38.6 (2003): 12-13. 2 Oct. 2008 < http://connect.educause.edu/Library/EDUCAUSE+Review/CopyrightWhatMakesaUseFai/40446>.

Donohue, Kelly. “Court Gives Thumbs-up for Use of Thumbnail Pictures Online.” Duke Law and Technology Review 0006 (2002). 2 Oct. 2008 <http://www.law.duke.edu/journals/dltr/articles/2002dltr0006.html>.

U.S. Copyright Office. Copyright Law of the United States of America and Related Laws Contained in Title 17 of the United States Code. 2 Oct. 2008 <http://www.copyright.gov/title17/92chap1.html#107>.

Acknowledgements

We’d like to thank Eugeniy Kalin for his programming efforts in creating the first version of the DVD Browser using Drupal 4.7. Thanks to Jim Hassel and David Basham for coming up with the idea for the DVD Browser and for the prototype coding.

About the Authors

Ilana Kingsley, Web Librarian
University of Alaska Fairbanks Rasmuson & BioSciences Libraries
ilana.kingsley@uaf.edu

Mark Morlino, Systems Administrator/Programmer
University of Alaska Fairbanks Rasmuson & BioSciences Libraries
mrmorlino@alaska.edu

4 Responses to "Rasmuson Library DVD Browser: Fun with Screen Scraping and Drupal"

Please leave a response below, or trackback from your own site.

  1. Jonathan Rochkind,

    I would love more information about how you retrieve the information from imdb, rottentomatoes, and FreeCovers.net. You are just matching on title? Do you have much trouble with false positives or false negatives matching on title keyword? Do these three services offer any kind of an API, or are you screen-scraping to find content?

  2. Jonathan Rochkind,

    Oh, it also occurs to me that I’m not sure how to identify _which_ bib records in my catalog represent movies. How are you determining that, by call number including ‘DVD’ at your library?

  3. Mark Morlino,

    Hi Jonathan,

    Thanks for your interest in our article.

    We are just matching by title. IMDB and Rotten Tomatoes are both usually pretty good. If there are multiple matches, the first one is very often the correct one. The development and testing was time consuming because when I encountered false positives or false negatives I modified the code to prevent them, and that would require dumping all of the data and starting from the first DVD to make sure the change to the screen scraping code did not affect the processing of any of the other DVDs.

    Freecovers.net is the only one that offers a public API. Basically, the API offers the same searching functionality as the web page but returns an XML document rather than an HTML document. So we can parse the XML into a data structure easily and consistently to look for matches. It is considerably easier to code and less error prone than the screen scraping. Unfortunately, the perl module we are using to parse the XML stores the list of possible title matches as a hash, which means we do not maintain the order in which they were sent, and Freecovers appends various information to the titles (to indicate a release, a language, a region, or sometimes a video game by the same name) so there are many DVDs that have cover images available from Freecovers that the program doesn’t because it cannot determine which one to use.

    The DVDs in our catalog all have call numbers that begin with “DVD” the rest of the call number is based on the order in which the DVD was cataloged and whether or not it is part of a multiple dvd set. So it is fairly easy for us to search for them in the catalog by number. The browser can just search for the most recent DVD that it knows about and keep processing results until it finds a call number that does not begin with DVD.

    I hope this helps.

    -Mark

  4. mark411,

    Hi! Is it possible to retrieve information from covers.hearsay24.com ?

Leave a Reply