Issue 52, 2021-09-22

Editorial : The Cost of Knowing Our Users

Some musings on the difficulty of wanting to know our users’ secrets and simultaneously wanting to not know them.

by Mark Swenson

For the second time in the past year we have an article that provokes concerns with patron privacy. Despite some reasonable objections to printing this article because of the implied endorsement of a cavalier attitude toward patron data, the editorial committee has decided that the overall quality of the article merits its publication. We have added a note to the top of the article which acknowledges that there are considerable problems with elements of patron privacy. In this editorial I will mention some of the problems that come with using patron data in this way, as well as express some doubt that there are technical solutions to handling this data that could make it truly anonymized.

As agencies with limited resources and a base of users that need specific information services, libraries are caught in a hard place when it comes to collecting data on user behavior to improve services. We have always collected data about our collections and users. Many of the articles which run in this journal focus on how to dissect, clean, parse, and use that data. However, we have entered a pact with our users which states that what they individually do will never be divulged to anyone else. Consequently library user data tends to be very broad and generic. For example, we know that a given book has been checked out 100 times, but we don’t have any idea if the book was checked out by 100 different people, by 50 people twice, or by one person 100 times.

We know that our computer systems have the granular information that can answer this kind of question. So it’s possible to mine that and create new lists linking users and resources and get the kind of detailed use information that would be great for evaluating a resource. This is what this issue’s controversial article describes. In doing this, however, a library creates a record that theoretically makes it much easier for someone to misuse patron data and turn the question around: asking not what resources are being used by individuals but which individuals are using specific resources. When we create data like this it becomes difficult to control its future. We may have the best intentions in doing it, but it’s hard to be sure that everyone who can obtain it will respect the privacy of users equally.

This is incredibly frustrating, so we keep delving into this data mine to figure out a way that maybe we can get that kind of information about users while somehow keeping their information anonymous. If we could do that, then maybe we could know if it would be worthwhile using limited resources to buy the next book in the series since 100 people will probably check it out too, or if it is more likely that we will just be buying it for the one person.

As a thought experiment I want to briefly imagine exactly what we’d need to do to successfully have truly anonymized individualized user data, and why ultimately this probably isn’t possible. My goal here isn’t to sketch an actual viable plan, but rather to just show how hard this really is.

Because it’s a good example of a scenario where this is at least imaginable, I want to use the case of database access logs. From the logs we are taking only the following information: a user’s barcode, the name of the resource, and the date and time on which the resource was accessed.

It is important to note here one of the many problems that libraries have with trying to maintain user privacy is that the logs in question are often not under our direct control. In such cases vendors may be collecting this information and not providing control to libraries as to what data is being collected, how long that data is kept, and how frequently that data is deleted.

Without control over that data, even with a hypothetically perfectly anonymized set of logs, the promise of anonymity granted by them is worthless. Either the data could be obtained directly from the vendor or any anonymization rendered worthless by comparing the records to vendor data. In most cases with vendor-collected data I think that it should be a high priority for libraries to establish policies with vendors that value patron privacy first and foremost.

For the purpose of this thought experiment we will imagine data which the library has complete control over. In reality, such a limitation would make doing the rest of the work to obtain the holy grail of anonymized data largely pointless, but it’s the only good starting point.

Let us also assume that we are working with a large population of over ten thousand users (I work in a suburban public library and a user base of this size is typical here) and a modest collection of resources. This scenario avoids problems one would be likely to encounter at a small academic institution, for example, where it might be possible to identify a user because the number of persons interested in a topic is small and there might be one resource that is almost exclusively used by them.

First, I need to at least obfuscate the barcodes, the single piece of data that can be tied to an individual. If I am storing this in a database, and the database is hacked or requested as the result of a legal order, I don’t want to lose control of a list of barcodes that could then be used to figure out the identities of individual users. With modern technology, the best way to obfuscate some data is to use a cryptographic hash, turning my 14 digit barcodes into a much longer (maybe 32-72 character) random blob of letters and numbers.

At this point things are looking good, but what if the person who gets the data wants to know if a specific person, whose barcode they already have, is listed in this database. It would be trivial for them to figure out which hashed value matches that barcode just by figuring out the hash algorithm, of which there are just a handful in common use. So to protect against that I’m going to need to add a salt (some extra random data) to the barcodes to make them harder to guess.

It is here that the difficulty facing the person who wants to both protect and use this information becomes apparent. If I’m collecting these barcodes over a period of time and want later to be able to run a report that determines which resources are being used by multiple users and which ones are only being used by one, I need to always use the same salt. But if I use the same salt all of the time, that needs to be stored in a place where it could be seen by the same person who has a copy of database.

I can generate a unique salt for every single record in the database and encrypt that salt with a public key. That should thoroughly obscure the information and it should be impossible for anyone to figure it out unless they got my private key. In theory, I can collect this data and only remove the encryption in a safe way to create a nice anonymized report.

But who else has access to this private key? How can I keep it safe while not putting my library in a position where the ability to run reports on this set of data requires my secret knowledge. Also, when I run a report and start to observe patterns in the data, even with a large population, certain users may be much easier to identify than I would expect. It might be obvious to a worker at a service desk that an anonymous user who used two dissimilar resources in a short timeframe would likely be a specific person.

With that problem in mind, it becomes necessary to make the user id hashes unique for each resource. To do this, the barcode, the database name, and the salt all have to go into the hashing process, and then the salt needs to be encrypted and stored for later retrieval. Other problems with the private key and data ownership remain, but we’ve maybe, finally gotten to a point where if the stars align correctly, and there are no flaws in the encryption, and the code is all written perfectly, this information’s pretty anonymized.

That’s a really hard place to get to, and with the remaining problems with the private key and its ownership, a hard place to stay. On top of what’s already been mentioned in trying to get a balance between restricted access to the key and keeping it safe, it would still be vulnerable to potential warrant requests from law enforcement or vulnerable to unauthorized access on a hacked network. As much as we would really, really like this data, is it worth it to go here? I’m not so sure.

3 Responses to "Editorial : The Cost of Knowing Our Users"

Please leave a response below:

  1. Kristin Briney,

    This editorial fails to address the actual issue with the article in question: the violation of user privacy (namely, the privacy to access resources without surveillance). Instead, the editorial lays out a technical solution to an ethical problem, except that anonymization isn’t actually a solution. It has been shown by researchers such as Narayanan that anonymization simply doesn’t work. The larger issue here is that code4lib published an article that violates patron privacy and justified this violation with a theoretical technical exercise in how we can better protect data that should never have been collected and analyzed in the first place. Patron data is not the new black or the new oil. As Becky Yoose says, data is glitter: it gets everywhere, it’s hard to clean up, and it’s best to never even let it into your house. Under this analogy, code4lib just got glitterbombed and needs to clean up a mess.

  2. Melissa Belvadi,

    I strongly disagree with this statement: “it should be a high priority for libraries to establish policies with vendors that value patron privacy first and foremost”. Rather I think that first and foremost we should establish such policies that value informed patron *choice*. Let our patrons decide where each wishes to make the tradeoff between features and privacy. To adopt your value, we should be negotiating with every publisher to disable on their platforms the features that allow patrons to optionally create “My Researcher” accounts, which usually involve giving their email address, which is a fairly unique identifier easily linked to their human existence. The generations that have widely adopted the use of Facebook and the like have their own values about privacy and we have no business imposing our own more restrictive ones at their expense.

  3. Becky Yoose,

    “Despite some reasonable objections to printing this article because of the implied endorsement of a cavalier attitude toward patron data, the editorial committee has decided that the overall quality of the article merits its publication.”

    Any library worker making such a statement must reexamine their commitment to protecting a patron’s right to privacy at the library. The decision to sum up these critical privacy issues as an “implied endorsement of a cavalier attitude toward patron data” indicates that the editorial board places little to no value on privacy. Instead, they chose to champion code over all else, patrons and professional ethics be damned.

    As the editorial states, this is the *second* article to have significant patron privacy issues. After the publication of the first article, the editorial committee made a statement that they would “pull in outside experts to comment on articles.” [1] When I was called to review the article with a very short turnaround (the email stated that the article was slated to be published later that week), I thought there was some progress in incorporating privacy checks in the review process. Instead, the article was published as-is with one little note on the top of the article stating that there were concerns. In addition, it was implied later on that I would be more than welcome to clean up my feedback to turn it into a more formal, publishable article.

    I would say that the editorial process is broken, but perhaps it’s working by design.

    The more I learn about the editorial process in the journal, the more concerned I am about the ability of the editorial committee’s commitment in creating meaningful, substantial change to the process. It is standard for many library journals to ask authors to revise and resubmit articles. I recommended that the article be heavily edited to address the multitude of privacy issues if the article is published at all. [2] However, the current editorial review structure seems not to accommodate revise and resubmit.

    The implied expectation that I would contribute a rebuttal of my own in the form of article submission or other publication completely misses the point about why this article shouldn’t have been published as is in the first place. *It describes in detail how to violate professional ethics and patron privacy*. A debate in the scholarly literature doesn’t undo the harm of such an article in the scholarly record. Even if the original article was framed in the context of privacy and ethics, any debate would not and must not be a substitute for an ethics review during the review process.

    Disregarding a patron’s right to privacy seems to be becoming a pattern here, particularly when you brought someone to point out the privacy red flags only to ignore them. I have been aware that the editorial committee is revising guidelines for guest editors; however, I am not optimistic that this revision for “guests” would solve the root issues present in the overall editorial process. I ask that the editorial committee make the following changes:

    1. Create a code of ethics or ethical guidelines for submission authors – technology isn’t neutral. Librarianship has a code of ethics. Patrons have rights in the library. Anyone working in library technology has to recognize all of this and reflect this in their scholarly output.

    2. Create a rubric or other mechanisms to evaluate submissions on their adherence to professional ethics, including the patron’s right to privacy – library technology journals are not neutral, either. What is and is not published here shapes the library technology discourse. People in the profession consider this journal as the prestige library technology journal. If this journal doesn’t reflect the profession’s ethics or protect the rights patrons have in the library, it codifies this disregard of ethics and patron rights in the scholarly record.

    3. Make more use of “revise and resubmit” during the review process, or make more decisions not to publish if the ethical issues in the original submitting ultimately cannot be resolved or adequately addressed in the submission.

    4. Revise the process for bringing in external reviewers – please, for the love of everything good and holy, do not bring in external reviewers four days before schedule publication, only to ignore their input. Ideally, the ethics review would trigger a need for external review early in the editorial process.

    Some people have called for the editorial committee to apologize to me for the treatment I received during this process. I do not want an apology. I want fundamental changes to the submission and editorial processes. If the editorial committee is committed to meaningful change, an excellent first step would be a full retraction of the article in question. Only then any progress on ethical guidelines for submissions and reviews can start in earnest.

    [1] https://journal.code4lib.org/articles/15340#comment-2745195
    [2] https://journal.code4lib.org/articles/16087#comment-2745444

Leave a Reply

ISSN 1940-5758