Issue 30, 2015-10-15

Collecting and Describing University-Generated Patents in an Institutional Repository: A Case Study from Rice University

Providing an easy method of browsing a university’s patent output can free up valuable research time for faculty, students, and external researchers. This is especially true for Rice University’s Fondren Library, a USPTO-designated Patent and Trademark Resource Center that serves an academic community widely recognized for cutting edge science and engineering research. In order to make Rice-generated patents easier to find in the university’s community, a team of technical and public services librarians from Fondren Library devised a method to identify, download, and upload patents to the university’s institutional repository, starting with a backlog of over 300. This article discusses the rationale behind the project, its potential benefits, and challenges as new Rice-generated patents are added to the repository on a monthly basis.

by Scott Carlson and Linda Spiro


A concrete measure of a research university’s success is the value of its patents. Every year, the National Academy of Inventors and the Intellectual Property Owners Association publish a list of the top 100 worldwide universities that were granted United States utility patents — that is, patents based on a new and useful process, machine, manufacture, or composition of matter (or a new and useful improvement).[1] Using data from the U.S. Patent and Trademark Office (USPTO), they select patents that list a university as the first assignee on the U.S. utility patent and post the information online (“Top 100 Worldwide Universities Granted U.S. Utility Patents in 2014”, 2014).

The commercial value of patents to universities worldwide is a topic that has been explored by researchers for many years. In 1997, Wallmark looked at how universities can encourage an increased output of inventions and patents. Using patent data from Chalmers University of Technology in Gothenburg, Sweden, he estimated the economic value of patents on the basis of employment in spin-off companies from university patents (Wallmark, 1997). In 2009, Striukova from University College London discussed the impact of patenting not just on the university, but also on society and the economy. She stressed that “university patents are not only about creating financial market value,” but also “play an important role in creating knowledge spillovers, building networks with other academics and venture capitalists and catalyzing university-industry recognition”; however, she advises to patent judiciously so as not to waste resources on “unworthy patents” (Striukova, 2009, p. 388). Likewise in 2008, Nicol from the University of Tasmania underscored the value of knowledge dissemination as a key component of the university’s mission. In addition to having a social, academic, and practical value, she believed universities are increasingly recognizing the commercial value of university-created knowledge although she favors the release of raw research results into the public domain for some research over patenting (Nicol, 2008). In 2014, while discussing how Technology Transfer offices might best manage university patents, Cummings from Ohio State and the University of Utah emphasized the importance of universities’ patented inventions to the U.S. economy: “More and more, both our federal and state governments rely on top-tier research universities to improve our economy by providing the next generation of inventors and entrepreneurs who create groundbreaking inventions, high-growth startups, thousands of new jobs, and, ultimately be commercialized to drive a cycle of innovation, thereby securing a global leadership position for the U.S. economy.” (Cummings, 2014, p. 1027)

Aware of the abundance of research detailing the contributions of patents to the economy as well as the prestige of originating universities, Rice’s Patent and Trademark Resource Center staff discussed various ways of promoting Rice patents. When Jan Comfort, patent librarian at Clemson University, shared news of a project to add Clemson-generated patents to their institutional repository, we realized this was the solution we were seeking.

Collecting Rice patents in our own Digital Scholarship Archive would have numerous potential impacts. First, the initiative would be a convenience, especially for novice patent searchers. Using the popular Google Patent Search to find patents assigned to Rice University can be an exercise in frustration. Even when enclosing Rice University in quotation marks or using the “with the exact phrase” advanced search, Google retrieves over 100,000 patents. Meanwhile, the United States Patent and Trademark Office’s Patent Assignment Database is only searchable from 1980 onward, and depends on assignment data being free of errors — which we found was not the case for some Rice patents. Secondly, spotlighting Rice patents in the Digital Scholarship Archive could also raise Rice’s standing locally, nationally and internationally. The repository would connect patent inventors with their other scholarly research in the archive, providing a more complete view of their respective bodies of work. Increased visibility could also facilitate future collaborations between Rice University and the technology or health sector, patent areas in which Rice is strong. Finally, collecting Rice patents would also allow us to perform long-term tracking, such as tracing citations to determine if particular patents had long-term impacts on future scholarship.

With these benefits in mind, a team of technical and public services librarians from Rice University’s Fondren Library met to explore the feasibility of adding Rice-generated patents to the university’s repository. Members of the group included: Linda Spiro and Siu Min Yu, Government Information Librarians; Scott Carlson, Metadata Coordinator; Monica Rivero, Digital Curation Coordinator; Kathy Weimer, Head, Kelley Center for Government Information; Lisa Spiro, Executive Director of Digital Scholarship Services; and Shannon Kipphut-Smith, Scholarly Communications Liaison.

Identifying & Describing Patents

The project began with Fondren’s government information librarians identifying an initial list of patents generated by the Rice community. Using the Public Web-Based Examiner Search Tool (PubWEST), a research staple of Patent and Trademark Resource Centers, the librarians identified an initial 200 patents that listed variants of “Rice University” or “William Marsh Rice University” as the assignee (a person or business receiving an assignment of ownership interest). This initial search showed that a sizeable amount of Rice scholarship existed, but was not being represented in the repository, helping justifying the necessity of the project.

Before pressing on with the project, a handful of concerns about the initial dataset needed to be addressed. The first issue was identifying as many of the existing Rice-generated patents as possible. Keyword searching in PubWEST using variations of the terms “Rice” and “University” in the assignee field increased our search total to just over 400 results. A little more than 50 of these results were dismissed as false hits, usually patents relating to the production of rice crops. After meeting with Rice’s Office of Technology Transfer, which maintains a private list of Rice-related patents dating back to 1978, we were able to identify several patents that were missed in the PubWEST searches due to misspellings in the original text. (Some of the assignee typos included “Marshurice University”, “Rich University”, “William Rice Marsh Rice University” and “William Marsh University.”) By the time the entire backlog was uploaded to the repository in late Spring of 2015, the total number had topped off at 365 patents.

The other major concern was the exportable metadata from PubWEST. Our application profile for the project defined 16 discrete pieces of metadata for each repository record, with almost half of the metadata directly referencing information about each specific patent: patent number, title, assignee(s), inventor(s), filing date, publishing date, and abstract. However, metadata exports from our PubWEST searches did not provide all of the necessary information: only the first assignee and inventor names were included for each patent, while abstracts were missing entirely. We briefly considered data-mining downloaded copies of the patents to fill in the missing metadata, but this option was rendered moot when we came upon Free Patents Online (FPO), a website that allowed us to perform complex searches and export metadata from the resulting searches, including the full assignee, inventor and abstract information. (This discovery also freed us from having to rely on the limited number of computer terminals able to access PubWEST in Fondren.)

Once acquired, the metadata from both PubWEST and FPO were combined in a single spreadsheet and uploaded to Google Refine (now OpenRefine) for normalization — specifically, assignee and inventor metadata. Both pieces of information appear in patents with appended geographic information; this geographic metadata caused us some degree of difficulty, as we soon found more than 50 instances of matching inventor names with inconsistent geographic information in our backlog. To counter these inconsistencies, as well as ward off future issues, we decided to expunge geographic information from all inventor and assignee metadata in the backlog and preclude it from any future submissions.

Acquiring Patents

An analysis of the metadata for Rice patents showed that the Rice community averaged 2 to 3 patents issued per month between January 2014 and March 2015. This suggested that the regular ongoing acquisition and submission of new patents into the repository could easily be a manual task; but with a starting backlog of more than 350 patents, it was obvious that we would need a batched process.

The USPTO website allows for the searching of individual patents, and has made bulk patent information available to the public, free of charge, through Reed Technology Information Services. While Reed Tech does offer full text versions of patents in multiple formats (such as TIFF images and markup languages), downloads are only available in bundles of the entire weekly issuance of the USPTO for each given year; in other words, acquiring a single patent would necessitate downloading all of the other patents issued that week. Full text markup language bundles ranged between one and two gigabytes, while multi-page TIFF archives could be anywhere between seven and 20 gigabytes each. Due to time, bandwidth and hard drive space concerns, we decided to investigate automatically pulling PDF copies of patents directly from the USPTO’s Patent Full-Text and Image Database (PatFT).

Downloading full-text patent PDFs (“Full Documents”) is not a direct process on PatFT. Searching for patents leads users to an HTML representation (full-text is available for patents issued from 1976 onward). Clicking the “Images” button at the top of each patent page brings the user to another section of the USPTO website that contains embedded PDF images representing single pages of the patent. The PDF section is navigated with a toolbar on the left side of the page, which moves the user through the patent page by page; however, another button on the toolbar (“Full Pages”) will embed a PDF of the entire patent.

After inspecting the source code on these embedded pages, we discovered patterns in the methods the USPTO chose to store and label the PDF copies. Because the navigation of the patents is by individual pages, the USPTO stores each page as a separate PDF file in a directory specific to every patent. Each file is named after its page number; however, the full PDF copy (from the “Full Pages” button) is named “0.pdf.”

The directory path for each patent is also its own pattern: the seven-digit numerical portion patent identification number reversed, with a padding digit added to the beginning. For example, in the case of patent US8,880,707B2:

US 8880707 B2 =

This means that the PDF containing the entire text of this particular patent would be at the URL:

Knowing this, we looked for a way to generate full-text PDF URLs for all of our patents, using only the patent identifiers. To do this, we created a spreadsheet utilizing several formulas that broke the patent identifiers into the file path sectors and combined all of the pieces into a URl using a CONCATENATE formula:

US8880707B2 8880707 07 807 0 88 0.pdf 07/807/088/0.pdf
  1. Patent identifier, minus any spaces or punctuation
  2. Isolated identifier [=MID(A1,3,7)]
  3. First URL sector [=RIGHT(B1, 2)]
  4. Second URL sector [=MID(B1, 3, 3)]
  5. Padding digit for Third URL sector [0] (boilerplate until the USPTO moves into triple-digits)
  6. Final URL sector [=LEFT(B1, 2)]
  7. PDF filename [0.pdf] (boilerplate)
  8. Constructing the generated directory/filename [=CONCATENATE(C1,”/”,D1,”/”,E1,F1,”/”,G1)]
  9. Final constructed URL [=CONCATENATE(“”,H1)]

Once we generated a list of constructed PDF URLs, we then wanted to automatically download them while renaming the PDFs from “0.pdf” to their unique patent identifiers. The easiest solution was to use GNU Wget, a command line program. Wget’s “-O” download option, normally used to concatenate multiple files into one written output file, can also be used as a simple renaming command when downloading one specific file. We set up another column in our URL spreadsheet that transformed the patent identifiers (column A) and final constructed URLs (column I) into Wget commands:

=CONCATENATE(“wget -O “,A1,”.pdf “,I1)

The final column of Wget commands were then copied to a text file and saved as an executable .BAT file, which successfully downloaded and renamed our entire list of patents. A sample of this spreadsheet is available for download.

Accessing Patents

After the patents were downloaded, the team began discussing strategies to make them searchable within Rice’s Digital Scholarship Archive. Our Digital Scholarship Archive — a DSpace instance — utilizes the software’s PDF Text Extractor, which extracts recognized text from PDFs and stores it as a searchable text bitstream. Unfortunately, the PDFs batch-downloaded from the USPTO web site turned out to be image-only scans, with no optical character recognition (OCR) performed. This left us with the option to use Adobe Acrobat’s “Recognize Text” feature to manually OCR the files, but the potential textual errors left us wondering if this was our only option. We briefly flirted with an alternate plan: uploading the PDFs to the scholarship archive as image scans, but accompanying them with text-only versions, acquired by downloading HTML copies of the patents (using a modified form of the Wget spreadsheet) and converting them to text-only. However, we assessed that this option would introduce complications into the future workflow for depositing new patents, and the error rate of the Acrobat OCR method was deemed an acceptable risk.

Indeed, throughout the initial upload, our focus swiftly began to shift beyond the task at hand to dealing with monthly patent searches and deposits. Once the payload for the initial batch of patents was described and ready for ingest, we turned our attention to devising a feasible, ongoing plan to add all new Rice patents. The team eventually agreed that moving forward, the government information librarians would conduct a monthly search for all new Rice-generated patents; then, following a set of instructions, they would acquire the PDFs, OCR them, and manually add them to the repository. Because the FPO website allowed the librarians to not only conduct their monthly searches, but store the search results, export metadata, and download PDFs, much of the workflow was written with the web site in mind, though perhaps not on a long-term basis. (See the following section for a more detailed discussion).

Results and Further Work

The initial batch of 365 patents was added to the Rice Digital Scholarship Archive in early May of 2015. As predicted (based on the number of monthly Rice patents issued between 2014 and 2015), the number of patents issued in the ensuing months averaged between two and three new Rice-related patents per month. Besides being on-target with our workload estimations, two to three new patents per month provided our government information librarians with enough practice at depositing material into the Digital Scholarship Archive — a relatively new experience for them — without being overwhelming.

There were a handful of unresolved issues requiring further work. One such issue concerned the rights statement we applied to each patent deposited into the repository. Patent descriptions for utility and plant patents are published into the public domain 18 months after the patent is filed, while design patent descriptions are made available when the patent is granted; accordingly, we assumed our rights description for all patent metadata would be “public domain.” But what about patents that contained copyrighted content? Many patents cite proprietary journal articles which contain information relevant to the patent, but the articles are not included in a download of the patent, so that was not a concern. However, a patent author might elect to include copyrighted images to illustrate his or her point.

We consulted sections 601.08 d and e of the Manual of Patent Examining Procedure (MPEP) containing the rules and regulations governing patents (MPEP 2014). Section 601.08-d states that a “copyright or mask work notice may be placed in a design or utility patent application adjacent to copyright and mask work [2] material contained therein.” Section 601.8-e then dictates that the notice must indicate that the owner “has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all (copyright or mask work) rights whatsoever.” Because this statement is already included in a patent, there was  no need to post a specific copyright disclaimer; however, rather than classifying each repository entry as public domain, we dispensed with a standard rights statement and added text to the patent collection page encouraging patrons and researchers with questions to contact Fondren Library’s Patent and Trademark Resource Center.

The other major issue concerned the source of our searches, metadata, and PDF downloads in the on-going workflow. Though quite useful, the FPO website is, ultimately, a non-governmental, for-profit website; while neither of those descriptions are intended to disparage the site or condemn its contents, we as librarians are constantly mindful of becoming dependent on impermanent sources of information. In the future, we would like to pursue a coding project that would not only download new patents, but mine the text for metadata extraction, making the project completely in-house and independent of sources outside of the USPTO. But for now, we will continue to make use of our workflow as-is.

We will also use various methods to assess the success of the project. First, we will monitor usage statistics to determine future directions. The Digital Scholarship Archive conveniently gathers statistics such as top 10 visited items, total visits per month, top country views, and top city views. If usage statistics are low, do they improve after a publicity campaign? If the national (or international) impact appears to be more than the local, to what groups do we need to advertise more? What departments on campus do we need to share usage statistics with to raise the profile of the university? Finally, we will make use of our internal usage statistics, while also soliciting feedback from our users via a comments/suggestions request on the patent collection home page.

With an eye to future improvements and data from the past and present, the Rice patent collection has the potential to become and remain a vital part of the Rice Digital Scholarship Archive and a window to the world revealing the research conducted at Rice University.


The authors wish to acknowledge the efforts of the entire Fondren Library team that contributed to the success of this project: Siu Min Yu, Monica Rivero, Shannon Kipphut-Smith, Kathy Weimer, and Lisa Spiro.

End Notes

[1] The United States Patent and Trademark Office issues three kinds of patents: utility, design, and plant. Design patents are granted to new, original, and ornamental designs for articles of manufacture; plant patents are awarded to researchers for the invention or discovery (and asexual reproduction) of distinct and new plant varieties.

[2] Under United States Code Title 17 § 901(a), a mask work is broadly defined as the topographic (layout) creation embodied in the design of an integrated circuit. Title 17, sections 901 to 914 form the Semiconductor Chip Protection Act of 1984, which protected the intellectual property rights associated with circuit design.


17 U.S.C. § 901(a)(2) (2015).  Available from:

Cummings, Brian. 2014. “The changing landscape of intellectual property management as a revenue-generating asset for U.S. research universities.” George Mason Law Review 21(4): 1027-1047.

Manual of Patent Examining Procedure. Section 601.08: Detailed description and specification of the invention [R-11.2013] (9th ed. Rev. March 2014) Available from:

Nicol, Dianne. 2008. “Strategies for dissemination of university knowledge.” Health Law Journal Annual 2008: 207-234.

Striukova, Ludmila. 2009. “Value of university patents as a determinant of technology transfer.” International Journal of Technology Transfer and Commercialisation 8(4): 379-391.

“Top 100 Worldwide Universities Granted U.S. Utility Patents in 2014.” National Academy of Inventors and the Intellectual Property Owners Association. Available from:

Wallmark, J. Torkel. 1997. “Inventions and patents at universities: the case of Chalmers University of Technology.” Technovation 17(3): 127-139. DOI: 10.1016/S0166-4972(97)00094-1

About the Authors

Scott Carlson ( is the Metadata Coordinator at Fondren Library, Rice University. He received his MLIS from Dominican University in River Forest, Illinois, and an Archives Certificate in Digital Stewardship from Simmons College. Scott is also the co-founder of Indie Preserves, a website that provides practical preservation advice to independent music labels and bands. His Twitter handle is @scottythered.

Linda Spiro is a Government Information Librarian at Fondren Library, Rice University. She has served on the mentoring, strategic plan, and bylaws committees of the Patent and Trademark Resource Center Association (PTRCA). For the Government Documents Roundtable (GODORT) of the American Library Association (ALA), she has served as Secretary and chair of many committees. She has master’s degrees from the University of Texas at Dallas (special education) and the University of North Texas (library science).

Leave a Reply

ISSN 1940-5758