Using Static Site Generators for Scholarly Publications and Open Educational Resources

Chris Diaz

Issue 42, 2018-11-08

Using Static Site Generators for Scholarly Publications and Open Educational Resources

Libraries that publish scholarly journals, conference proceedings, or open educational resources can use static site generators in their digital publishing workflows. Northwestern University Libraries is using Jekyll and Bookdown, two open source static site generators, for its digital publishing service. This article discusses motivations for experimenting with static site generators and walks through the process for using these technologies for two publications.

by Chris Diaz

Introduction

Static site generators build websites from plain-text files. Most are free to use and are available under an open source license [1]. They are often described in comparison to content management system (CMS) software, like WordPress or Drupal. CMS websites use database processes on a web server to dynamically create HTML on demand. Static site generators, however, perform all of the plain-text-to-HTML processing before the files are deployed online. This preprocessing workflow removes the need for high-touch system administration, database installations, server-side processing, and security patching, reducing the need for full-time developers and system administrators for digital publishing services. These advantages make static site hosting, maintenance, and preservation more affordable and sustainable for small teams.

Northwestern University Libraries began using static site generators for our digital publishing service in 2018. We initially licensed the Digital Commons platform from Bepress to support our open access publishing services, but the Elsevier acquisition made us question our reliance on proprietary software and motivated us to consider open source alternatives (Schonfeld 2018). At the same time, interest in open source software for library publishing was growing (Library Publishing Coalition 2018). This article reflects on our use of two open source static site generators for library publishing, including an overview and evaluation of the technologies while focusing on two popular use cases: scholarly publications and open educational resources.

Why build static sites?

Static sites are gaining popularity in the digital library and cultural heritage community [2]. The Getty Museum is currently developing Quire, a new monograph publishing tool that uses the Hugo static site generator (The J. Paul Getty Trust 2018). Newson’s excellent “Tools and Workflows for Collaborating on Static Website Projects” covers how static website generators work, the benefits they offer, and a case study for a digital collections project (Newson 2017). For digital publishing, the benefits of static sites include affordability, sustainability, and preservation.

Static sites are cheap to host and require very few computing resources. In addition to domain name services, dynamic sites require an operating system to manage the web server, the database, and all server-side scripting, often with significant software dependencies that require regular monitoring and maintenance. Modern web browsers are reducing the need for server-side processing for rendering dynamic web content. Static sites simply require storage and a content delivery network (CDN) from hosting providers. The requirements for specific operating systems, databases, software dependencies, or server-side scripting for are minimal or nonexistent for hosting static sites. In some cases, static sites can be hosted for free or very cheaply (e.g. $6.00 US – $30.00 US per year) by providers like GitHub Pages, Netlify, or Amazon Web Services (AWS).

Static sites are generally smaller than dynamic sites. For one of our conference publications with over 30 presentations, the entire working directory is 230 MB (includes the full Git version history, Markdown, configuration, and Sass files) and the output directory is 61 MB (which includes only the HTML/ CSS files and PDF full-text content). For comparison, some hosting providers suggest 1 GB of space needed to run one WordPress site, more than 10 times the needed disk space to host a static website (Jackson 2018).

Static websites are easy to maintain because they are composed of asset, HTML, CSS, and JavaScript files. Whereas dynamic websites are processed and built on-demand or cached on a server, static site generators process and build the site before the files are deployed online. The website doesn’t change until there’s a new deployment of the site. Websites powered by a CMS facilitate the exchange of information between the client and the server, such as user login credentials, which is stored in a web accessible database. Both the database and the CMS require system administrators to install and configure the software on the server, which require monitoring to ensure constant uptime, maintain software dependencies, handle regular security patches, and schedule server migrations. This is not the case for static sites.

Scholarly Publications with Jekyll

Jekyll [3] is an open source, command-line tool that transforms plain-text files into static sites. Jekyll is very popular among the open source community. It is often the tool of choice for blogs, project marketing websites, and technical documentation due to its integration with GitHub, which offers GitHub users free static site hosting with its GitHub Pages service [4]. GitLab [5] and Bitbucket [6] also provide free static site hosting.

Jekyll’s built-in toolset is well-suited to support scholarly publications, like journals and conference proceedings. Both sx archipelagos [7] and The Programming Historian [8] use Jekyll and GitHub Pages for hosting and publishing their peer-reviewed publications. We use Jekyll to build conference proceedings websites for content we’re storing in our institutional repository. The static site serves as an lightweight presentation layer for a collection we manage in our institutional repository. This provides us a low-cost, highly customizable, simple website for our institutional stakeholders (Diaz 2018).

Figure 1: How static websites enhance institutional repository collections

Content

3The full-text and metadata are formatted as Markdown [9] and YAML [10] respectively. Both Markdown and YAML are stored in Markdown files (.md), which can be opened by any plain-text editor. Because static sites do not use content management systems, users interact with the plain-text files and folders directly. If you were to open a basic Jekyll project for a scholarly journal, for example, the folder and file structure might look something like this:

Working directory for a scholarly journal built with Jekyll

When you run the build command, the process takes all of the files in the scholarly journal file directory and outputs website files in subdirectory called _site. The _site directory contains everything you would need to deploy the site to the web.

Output Directory File Contents for a Complete Website Built with Jekyll

Jekyll has built-in support for Collections, which enables you to define content types, such as pages, articles, and posters (Jekyll 2018). Each collection will likely have its own designated HTML layout and default metadata values. For a journal project, you could create a subdirectory called _articles and keep all of the journal articles as Markdown files.

The Markdown files include metadata and full-text content. The metadata appears at the top of the file as a key-value store represented as YAML, a field-driven markup language suitable for metadata. The full-text of the article is formatted as Markdown.

Markdown File containing YAML metadata and Markdown content

Both YAML and Markdown are human-readable, non-proprietary plain-text formats. From a preservation perspective, one could compress the Markdown files into a .zip file at the time of publication and store them in a preservation or repository system.

The _config.yml file contains all site-level configurations, such as the publication’s title, editor contact information, Google Analytics tracking code, and the website’s theme.

Basic settings a journal might store in a Jekyll _config.yml file

At Northwestern, we used Jekyll and our institutional repository to create a conference proceedings publication for Computational Research Day [11], an annual conference on data science and high-performance computing topics. We have made the source code [12] available for reference, however the style sheets are restricted to Northwestern University websites and are not openly-licensed.

Layout

Layouts for Jekyll websites are often set by a theme. There are hundreds of Jekyll themes available on GitHub [13] or RubyGems.org [14] for free. Themes allow users to ignore difficult design decisions requiring writing and editing HTML, CSS, and JavaScript. Most themes are designed for users to plug into their Jekyll projects without needing to edit any HTML, CSS, or JavaScript files, allowing content creators to focus on the full-text and metadata of their sites. However, because Jekyll was built with blogs in mind, there may be instances when library publishers would want to edit the layouts of their publications to enhance the display metadata and match the user expectations of scholarly audiences.

Jekyll Includes [15] is a method of modularizing HTML templates into component parts, sometimes referred to as partials, and can be useful additions to existing themes. These are small HTML files that are often limited to a specific HTML element in a page layout, like a search bar or navigation menu. These modules help keep the code better organized. For scholarly publications, Jekyll Includes are helpful when needing to call HTML elements on a conditional basis. For example, a scholarly publication might offer two or three Creative Commons license options for authors. With Jekyll Includes, you can create a template for displaying each of these licenses with their respective logo and govern their use with conditional logic based on the article’s YAML metadata.

Jekyll Includes example of an HTML element for an article license wrapped in conditional logic

For scholarly publications specific to Northwestern University, we use the Sass [16] files and HTML pattern library created by the university’s marketing department to defer the layout and styling decisions. For non-Northwestern specific scholarly publications, we experiment with open source CSS frameworks or themes that contain responsive layout elements and usability features that are important to stakeholders.

Metadata

Jekyll’s use of the Liquid templating language makes reusing YAML data easy. Liquid is a system of tags, objects, and filters that Jekyll uses during the build process to load YAML and Markdown content into specified HTML elements (Shopify 2018). For articles, you can reuse the YAML front-matter for each journal article to create machine-readable metadata in the HTML header of each article’s web page to assist web search indexing and citation management systems. You could do this by creating an HTML file called metadata.html and assigning metadata values to designated HTML <meta> tags according to the scheme’s guidelines. Here is an example of a few lines from a metadata.html file following Google Scholar recommendations [17] that pulls from YAML front-matter for each journal article and places it into the <head> element of the rendered HTML:

Google Scholar Metadata Example

The metadata.html could include as many additional metadata schemas as you would like, including DublinCore and Open Graph. Each schema has its own <meta> tag attribute but uses the same YAML front-matter from the articles.

Publishing Workflow

We partnered with Northwestern Information Technology Research Computing department to publish the proceedings from Computational Research Day, a conference they organize. Research Computing announced a call for posters, presentations, and visualizations and marketed them to various departments on campus. The also managed the review and editorial timeline for the publication.

We used SmartSheets [18] to create a web form and serve as a submission management system for receiving documents, capturing submission metadata, assigning reviewers, and tracking the editorial timeline. The web form including an option for submitters to participate in the proceedings publishing; publication was not required.

To build out the website, I made HTML layouts based on Northwestern’s HTML pattern library and Sass files developed by the marketing department. I also set up a collection in our institutional repository to store the final versions of the poster and visualization submissions and mint DOIs. I used the metadata from the submission form to deposit the files into the repository and to manually create Markdown files for each submission. We reused the DOIs created by the repository for each submission to display on the website in order to facilitate.

To deploy the site, I used the s3_website gem [19] to deliver the static site files to a provisioned AWS using S3 bucket [20]. While the files are stored in S3, they are delivered using CloudFront [21]. I requested a Northwestern subdomain and launched the conference publication [11].

Figure 2: Screenshot of conference website, available at: http://crd.northwestern.edu/

We expect to maintain the website for the conference for up to three years after the conference has been discontinued. We chose to reuse the DOIs from the repository under the assumption that the active maintenance of the repository will outlast the availability of the static site.

Open Educational Resources

Bookdown

Bookdown [22] is an open source R [23] package that transforms plain-text into HTML, PDF, and EPUB. Bookdown works with text formatted with either Markdown or R Markdown [24], a version of Markdown that allows users to embed executable R code into Markdown files. Similar to Jekyll, Bookdown is a static site generator that renders content as online and downloadable e-books. It is popular within the R community of scholars and practitioners working in fields that use statistical computing.

We worked with a faculty member in the Statistics department to publish a set of modified teaching labs for an introductory statistics course that uses R. We chose Bookdown for this project because it was a specifically designed for the content he was creating and the computing environment he was introducing to his students.