Parsing and Matching Dates in VIAF

Jenny A. Toves; Thomas B. Hickey

Issue 26, 2014-10-21

Parsing and Matching Dates in VIAF

The Virtual International Authority File (OCLC Online Computer Library Center 2013) http://viaf.org is built from dozens of authority files with tens of millions of names in more than 150 million authority and bibliographic records expressed in multiple languages, scripts and formats. One of the main tasks in VIAF is to bring together personal names which may have various dates associated with them, such as birth, death or when they were active. These dates can be quite complicated with ranges, approximations, BCE dates, different scripts, and even different calendars. Analysis of the nearly 400,000 unique date strings in VIAF led us to a parsing technique that relies on only a few basic patterns for them. Our goal is to correctly interpret at least 99% of all the dates we find in each of VIAF’s authority files and to use the dates to facilitate matches between authority records.

Python source code for the process described here is available at https://github.com/OCLC-Developer-Network/viaf-dates.

by Jenny A. Toves and Thomas B. Hickey

Background

VIAF merges records from various national and trans-national authority files into clusters. These authority files provide identifiers and often standard names for people, titles, series and concepts. There are many challenges in matching across the files, including recognizing and parsing dates associated with personal names. Bibliographic dates can be surprisingly complicated and over the years libraries have come up with a mélange of approaches to capture the aspects of dates that they are interested in.[1][2] Dates can appear in a number of different fields, including name headings, separate date fields, and within notes. VIAF needs to recognize the presence of dates and convert them into a uniform format that it can use when matching and comparing records about people.

How VIAF Uses Dates

While contemporary files such as ORCID (http://orcid.org/) place great reliance on email addresses and institutional associations, library files have in general eschewed such information and rely heavily on birth and death dates to disambiguate people. VIAF tries to match records that describe the same entity in multiple files, and dates associated with people are important pieces of information when matching names across authority files. Each VIAF cluster that represents a person, maintains a date range (described below in Storing and Comparing Dates). If VIAF can identify that two names are associated in some way (for example the forms are similar or have matching cross references) and share both a birth and death year, this is considered a very strong indication of a match, second only to more specific information such as matching unique identifiers. While on the face of it matching dates would seem straight forward, there are some subtleties even after the dates have been parsed.

Most of the dates VIAF needs are specifically coded as dates, so recognizing them is not difficult. However VIAF also needs to pick dates out of unstructured text, such as notes in authority records. For this we use some fairly simple regular expressions [3] to recognize and parse these dates. Appendix 3 has examples of these.

Here are examples of some of the main variants we see:

Ranges. Typically used to indicate birth and death
- 1903-1993
Flourished dates (when the person was active). May or may not be a range.
- 1130 fl.
Circa dates. Approximate dates
- ca. 1507-1584
Other approximations
- 4./5. Stol.
- 10..-11..
Showing month and day
- 1942 June 24-
- 1943 ún. 23.-?
Different scripts
- ٢٧٣م
Different calendars
- 604-672
Of course, these can be combined in various ways
- 14..-1472
- 14??-
- –ca 1475

Here are some ways we see December 3, 1949 entered:

1949 (December 3)-
1949 Dec. 3-
1949 3 déc.-
1949 December 3-
1949 (Dec. 3)-
03.12.1949-

We find 27,000 ways to say what we interpret as flourished around 1850. Here are the patterns that occur more than a thousand times (the first character is the MARC subfield code):

f18..-19..
d18??-
f18..-18..
d18..-19..
f18..-18..?
f17..-18..
f18..-18..
d18..-18..,
d18..-18..
d18..-….
fca 18–
d17..-18..
f18..-19..?
f17..-18..?
f18..-19..
f17..-18..

Although our previous system was able to share much of the code used to parse dates from various authority files we became aware that we had no way to make global changes in how dates were handled. One change we wanted to make was how we handled flourished dates, and realized we needed to touch the dozens of routines that handle the details of each authority file.

An interesting observation is that while the domain, that is how dates are represented, is almost limitless, the range of actual individual dates representing a single day is fairly small. Given two thousand years of publication there are only about 730,000 different dates of interest (2000 years x 365 days/year). Since most dates we encounter are after 1700, only about 115,000 dates will be encountered with any regularity and the range of the transformation from text to our stored dates is fairly small. Another observation is that while we characterized the domain as ‘almost limitless’, in fact, within date subfields we find 384,000 actual unique date strings in all the VIAF authority source authority records, even before normalization (this is larger than the 115,000 individual dates both because of variant representations and because many of these strings represent a date range). A table [4] of all these date strings, ranked by frequency is available on GitHub.

Identifying dates in authority and bibliographic records

VIAF is interested in various dates in both authority and bibliographic data. We tend to divide dates into four categories

Publication dates
Personal dates in a field with a specified date format
Personal dates in free text notes
Personal dates in a heading

The first two of these are fairly straight forward. While publication dates can be complicated, this data is considered of low relevance for matching names so we simplify it by just using the equivalent of the MARC-21 Date1 field.[5] This is a coded field, but even that can have a number of variant forms, such as 198U to indicate an approximate date in the 1980’s. For the purpose of VIAF we avoid publication dates that are indicated as questionable or are date ranges and discard dates that are not all digits, along with a few patterns, such as ‘0000’ and ‘9999’, which do not look like valid dates.

When a date is in a field such as the MARC-21 046 [6], parsing is also quite straight-forward since the ISO 8601 standard on the Representation of Dates and times [7] is specified.

Picking up dates from free text notes fields can be a challenge, but prior to 2009, MARC-21 authority records did not have a structured way of entering date, month and days associated with people, so this information is often embedded in notes and needs to be found. With the many languages and scripts in VIAF source authority records, identifying the type of date in free text can be difficult, e.g. whether it is a publication date, birth, death, or some other type of date. Again, we simplify the process by looking for key words and then picking up only the most straightforward of the dates.

While all these dates are important we have put the most effort into parsing the dates contained in the personal headings themselves. A typical heading looks like:

‡a  Shakespeare, William, ‡d  1564-1616

‡a  Shakespeare ‡b  William ‡f  1564-1616

Where the ‡d (MARC-21) and ‡f (UNIMARC) indicate the date subfield with a hyphen indicating a birth-death range. Our approach to parsing these dates is the topic of the rest of the paper.

Storing and Comparing Dates

Each cluster in VIAF representing a person has a date range stored as two dates (min and max) consisting of a year, month and day. Associated with the date range is an indication of what they represent. We have three values:

Lived – The dates are birth and/or death dates
Flourished – The dates show when the person was active
Circa – The dates are approximate

Each section of the date (Year-Month-Day) is set to 0 if not available. The dates themselves are kept in the standard Western calendar (often called the Gregorian calendar), using negative dates for dates before 1 A.D./C.E., turning 1 B.C.E. into -1. While best practice when using negative numbers for B.C.E. dates is to use astronomical conventions [8] [1] which use a year 0, we do not, maintaining consistency with the common practice of not having a year 0 in B.C.E dates and reserving 0 to indicate no data.

Our most basic test for comparing date ranges is to see whether two sets of date ranges are compatible:

If the both the min and max years are zero in both dates, there is no conflict
For min dates after 1400:
- One max date cannot be less than the other min date
- Max dates must be within 120 years of the other min date when present
If either pair is marked Flourished
- Min and max dates must be within 100 years of each other if present
If either pair is marked Circa
- Min and max dates must be within 10 years of each other if present
Otherwise both are Lived
- Min and max dates must be within 3 years of each other if present
- If both have months specified, they must match
- If both have days specified, they must match

The 3 year range was arrived at pragmatically (we started at 0 and worked up). For various reasons dates associated with people tend to vary by a few years, and allowing some variation in them helps bring records together.

When comparing dates within a single authority file (e.g. to look for duplicates) the fully parsed dates essentially have to match exactly to not conflict.

When only one of the min/max dates is present for comparing we get a single-date match which is much weaker than the earlier mentioned double-date match.

Mapping Dates to Patterns

Our original approach grew from parsing dates in two authority files to eventually three dozen. Although our processing for each file shared code, each had its own dedicated parsing routines. Driven by necessity, we evolved the current system that relies on mapping dates to patterns shared across all authority files that are then used to guide the parsing.

Here are some of the pieces we need to identify:

Type of date: flourished, circa, birth/death dates
Era, e.g. A.D., B.C., B.C.E for Gregorian dates
Reign and era dates for Japanese names
Century indications
Months
Alternate calendars, e.g. Hijri (Arabic) dates
Wild card characters
Separating the min and max dates (e.g. birth and death)

Parsing a date subfield is accomplished by morphing the input data into a pattern that guides parsing. There are four basic steps:

Normalization of the input data
Splitting the input data into min (e.g. birth) date and max (e.g. death) date sections
- If we have a Han (Japanese) date then resolve it via lookup table and we are done[9]
- If all parts of the date are century dates then resolve them and we are done
Resolving any wildcards
Parsing the min/max date sections

As an example, “1949 December 3-“ is converted to the pattern “NNNN month NN”. This pattern is then matched to this regular expression:

NNNNmonthNN = re.compile('(\d\d\d\d) +([^0-9. ]*)\.? ?(\d\d?)')

Which is then used to pick out the year, month and day from the input string. Appendix 4 shows the 32 recognized patterns, and the code for all the patterns can be found in dateFld.py.[10]

Normalization starts with lowercasing the data and decomposing Unicode characters. Next some non-Latin digits and punctuation characters such as hyphens and question marks are converted to Latin equivalents. Digits are then converted to “N” and anything that looks like a month name is converted to “month”. We have built these tables up from examples encountered in dates, augmented by month names and abbreviations for various locales known to the Python date libraries. Here are some examples of the transformation:

1999 -> NNNN
1999 January 10 -> NNNN month NN
1947-1999 -> NNNN-NNNN
30 B.C. -> NN bc
Circa 1920 -> circa NNNN

Splitting tries to find a single hyphen and use that point to split the date into min/max sections. If no hyphen exists, then look for evidence of this being a death date (ie. “d 1946”). If there is no hyphen and no evidence of a death date then it is assumed to be a birth date. If there are multiple hyphens, then we have a manually constructed table of 75 common patterns[11] that can be split. An example is “NNNN-NNNN-“. The second hyphen is probably a typo so we ignore it and split the date at the first hyphen. Here are some examples of the transformation:

NNNN-NNNN -> (NNNN, NNNN)
died NNNN -> (None, NNNN)
NNth century-NNNN -> (NNth century, NNNN)
NNNN -> (NNNN, None)

Resolving wildcards means finding things like “NNN?”, “NNN.” and “NNNN or N” and modifying the pattern, the input string and the datetype. An example is the string “NNN?” in a birth pattern with a date string of “197?”. The input data is changed to ‘1979’. The pattern is changed to ‘NNNN’ and the date type is set to ‘circa’. ‘circa’ allows ±10 years to match so this birth date will match any other birth date from 1969 to 1989. There are some patterns that we are unable to parse, usually because they are ambiguous. Appendix 1 has a list of these patterns.

Post parsing we apply Hijri and BC adjustments. Arabic dates are converted to their Latin form. If both min and max dates are BC (or if only the max date is BC), multiply min and max by -1.

The sample VIAF code in GitHub has a stub for the Hijri code. Since it has its own licenses it needs to be obtained separately.[12]

During this transformation, the dates undergo some ‘sanity’ checks, the most important being:

Max >= min date
Max date – min date <= 110 years
Month values must be between 01 and 12 inclusive
Day values must be appropriate for the given month

Testing

When we receive a new authority file to add to VIAF, one of our first tasks is to identify the dates in the records and decide how to parse them. To do this we extract all the dates found in the structured fields, count them and rank them. These strings can then be put through our existing date parsing routines and then visually compared to the input. Since we have the dates ranked by frequency we concentrate on the dates we will encounter the most, making sure that the interpretation looks correct and that we are handling at least 99% of the dates correctly. The date patterns tend to be clustered, so this task is not as difficult as it might seem.

Since we now have a uniform approach to dates across the various authority files that contribute to VIAF, we can also consolidate all the dates we have found across all of them. This lets us spot patterns that, while fairly rare in individual files, across all of VIAF are large enough to merit attention. Appendix 2 shows the most common patterns along with an indication of how often they are encountered.

Limitations

We recently added Japanese ‘reign’ dates to VIAF. While there are many other forms of dates in the world [13] [14], we are comfortable that we are covering the vast majority of dates we encounter.

Works Cited

[1] Library of Congress. “Extended Date/Time Format (EDTF) 1.0 Submission.” Library of Congress. January 13, 2012. http://www.loc.gov/standards/datetime/pre-submission.html (accessed 11 11, 2013).

[2] Dublin Core Metadata Initiative. DCMI Date Working Group. http://dublincore.org/groups/date (accessed 04 24, 2013).

[3] Wikimedia. “Regular expression.” English Wikipedia. http://en.wikipedia.org/wiki/Regular_expression (accessed 11 11, 2013).

[4] https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/datefields.txt

[5] Network Development and MARC Standards. “MARC 21 Format for Bibliographic Data: 008:.” Network Development and MARC Standards. 09 01, 2011. http://www.loc.gov/marc/bibliographic/bd008a.html (accessed 04 26, 2013).

[6] Network Development and MARC Standards Office, Library of Congress. “MARC 21 Format for Authority Data: 046: Special Coded Dates.” 04 05, 2011. http://www.loc.gov/marc/changes-rda-046.html (accessed 05 25, 2013).

[7] WikiMedia. “ISO_8601.” Wikipedia. 04 24, 2013. http://en.wikipedia.org/wiki/ISO_8601 (accessed 04 26, 2013).

[8] Espenak, Fred. “Year Dating Conventions.” NASA Eclipse Web Site. February 25, 2008. http://eclipse.gsfc.nasa.gov/SEhelp/dates.html (accessed 02 10, 2014).

[9] https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/handates.py

[10] https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/dateFld.py

[11] https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/overrides.py

[12] Alsadi<alsadi@gmail.com>, Muayyad Saleh. “hijra/hijra.py.” http://github.com/ojuba-org. 2006-2008. https://github.com/ojuba-org/hijra/blob/master/hijra.py (accessed 08 26, 2014).

[13] Reingold, Edward M., and Nachum Dershowitz. Calendrical calculations. Cambridge, UK: Cambridge University Press, 2001.

[14] Blinn, Peter. “Today’s date in over 400 more-or-less obscure foreign languages.” Curious Notions. 2014. http://www.curiousnotions.com/todays-date.asp (accessed 03 18, 2014).

About the Authors

Jenny Toves is a Software Architect in OCLC Research. In addition to VIAF, her current interests include the large scale FRBR clustering of bibliographic records, which is central to OCLC’s linked data program. One aspect of this work is the algorithmic creation of authority records by mining the bibliographic data. She has an undergraduate degree in Computer Information Systems and a M.S. in Computer Science.

Thomas Hickey is Chief Scientist at OCLC where he helped found OCLC Research. Current interests include metadata creation and editing systems, authority control, parallel systems for bibliographic processing, and information retrieval and display. In addition to implementing VIAF, his group looks into exploring Web access to metadata, identification of FRBR works and expressions in WorldCat, the algorithmic creation of authorities, and the characterization of collections. He has an undergraduate degree in Physics and a Ph.D. in Library and Information Science.

Appendix 1

Unhandled date patterns with multiple hyphens.

These patterns are not handled because the meaning is ambiguous or because we haven’t seen it before. This data is monitored to look for new overrides to add to overrides.py (https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/overrides.py).

UNHANDLED	ca NN- –	2330
UNHANDLED	ca NN – –	110
UNHANDLED	NN- –	69
UNHANDLED	NN-..-	38
UNHANDLED	N–	36
UNHANDLED	ca NN —	34
UNHANDLED	NN-ה/NN-המאה ה	30
UNHANDLED	-caNNN of -caNNN	17
UNHANDLED	NN – –	17
UNHANDLED	-NNN or -NNN	16

Appendix 2

This table summarizes the 40 most common date patterns covering 99.4% of the dates VIAF encounters. See https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/appendices.py for code to generate this table.

Pattern	Occurences	Example	Parsed date	Date type	% of total
NNNN	19563491	d1947-	[1947, ”, ”]	lived	94.19%
.	372270	d(1947- ).	[0, ”, ”]	lived	95.99%
NNN	179488	d900-talet	[900, ”, ”]	lived	96.85%
NNNN?	145978	f1950?-….	[1950, ”, ”]	circa	97.55%
NNth century	126436	d20th century	[1900, ”, ”]	flourished	98.16%
ca. NN. jh.	47126	dca. 20. Jh.	[1900, ”, ”]	flourished	98.39%
…	43903	d1977-…	[0, ”, ”]	lived	98.60%
ca. NN./NN. jh.	17848	dca. 20./21. Jh.	[1900, ”, ”]	flourished	98.69%
NNth cent.	16349	d17th cent.	[1600, ”, ”]	flourished	98.76%
NN	15008	d19-….	[19, ”, ”]	lived	98.84%
ca.	11056	dca. Gegenwart	[0, ”, ”]	circa	98.89%
NNNN month NN	10839	d1921 October 30-	[1921, ’10’, ’30’]	lived	98.94%
?	10562	d?-….	[0, ”, ”]	lived	98.99%
NN. jh.	8852	d20. Jh.	[1900, ”, ”]	flourished	99.04%
NN./NN. jh.	6758	d20./21. Jh.	[1900, ”, ”]	flourished	99.07%
NNth cent	6620	d19th cent	[1800, ”, ”]	flourished	99.10%
N…	6342	d18..-1…	[1850, ”, ”]	flourished	99.13%
ca. N. jh.	5796	dca. 6. Jh.	[500, ”, ”]	flourished	99.16%
NNe e.	5136	d18e E.	[1700, ”, ”]	flourished	99.18%
NNNN month N	4886	d1956 November 7-	[1956, ’11’, ’07’]	lived	99.21%
ca. N. h. NN. jh.	4783	dca. 2. H. 20. Jh.	[1900, ”, ”]	flourished	99.23%
ca. NN.jh.	4016	dca. 20.Jh.	[1900, ”, ”]	flourished	99.25%
NN. stol.	3964	d19. stol.	[1800, ”, ”]	flourished	99.27%
Nth century	3787	dactive 9th century	[800, ”, ”]	flourished	99.29%
NN.NN.NNNN	3437	d09.06.1703-	[1703, ’06’, ’09’]	lived	99.30%
N. jh. v. chr.	3213	d3. Jh. v. Chr.	[-300, ”, ”]	flourished	99.32%
?.	2836	d(1892-?).	[0, ”, ”]	lived	99.33%
NNe eeuw	2722	d18e eeuw	[1700, ”, ”]	flourished	99.34%
N. jh. n. chr.	2563	d5. Jh. n. Chr.	[400, ”, ”]	flourished	99.36%
ca NN- –	2330	fca 18- –	[0, ”, ”]	circa	99.37%
ca. N. hälfte NN. jh.	2250	dca. 2. Hälfte 17. Jh.	[1600, ”, ”]	flourished	99.38%
sec. xvi	2225	dsec. XVI	[1500, ”, ”]	flourished	99.39%
ca. ende NN. jh./anfang NN. jh.	2105	dca. Ende 20. Jh./Anfang 21. Jh.	[1900, ”, ”]	flourished	99.40%
NNth/NNth cent.	2041	d17th/18th cent.	[1600, ”, ”]	flourished	99.41%
sec. xvii	1904	dsec. XVII	[1600, ”, ”]	flourished	99.42%
N	1813	d1-1	[0, ”, ”]	lived	99.43%
NNth/NNth cent	1797	d17th/18th cent	[1600, ”, ”]	flourished	99.44%
ca. N./N. jh.	1729	dca. 5./6. Jh.	[400, ”, ”]	flourished	99.44%
NNNN ?	1636	f1577 ?-1650	[1577, ”, ”]	circa	99.45%
ca. NN./NN.jh.	1415	dca. 20./21.Jh.	[1900, ”, ”]	flourished	99.46%

Appendix 3

Regular expressions used to find birth and death dates in free text note fields in MARC authority records.

## DD. MM. YYYY.

# ('', 'narozen', '31', '01', '1962')

birth3 = re.compile('(^| )(narozen) (\d{1,2}) ?(\d{1,2}) ?(\d{4})')

death3 = re.compile('(^| )(zemrel) (\d{1,2}) ?(\d{1,2}) ?(\d{4})')

# dec 21, 1903

# ('', 'b.', 'jan 31 1962', 'jan ', '31 ', '1962')

birth1a = re.compile('(^| |\()(b\.|n\.) ((\D{3,9} )?([0-9]{1,2},? )?([0-9]{4}))')

death1a = re.compile('(^| |\()(m\.|d\.) ((\D{3,9} )?([0-9]{1,2},? )?([0-9]{4}))')

birth1b = re.compile('(^| )(born|ne|nee|birth|dob) (([a-z]{3,9} )?([0-9]{1,2},? )?([0-9]{4}))')

death1b = re.compile('(^| )(died) (([a-z]{3,9} )?([0-9]{1,2},? )?([0-9]{4}))')

# 21 dec, 1903

# ('', 'b.', '31 jan 1962', '31', 'jan', '1962')

birth2a = re.compile('(^| |\()(b\.|n\.) (([0-9]{1,2}) (\D{3,9}) ([0-9]{4}))')

death2a = re.compile('(^| |\()(m\.|d\.) (([0-9]{1,2}) (\D{3,9}) ([0-9]{4}))')

birth2b = re.compile('(^| )(born|ne|nee|birth|dob) (([0-9]{1,2}) ([a-z]{3,9}) ([0-9]{4}))')

death2b = re.compile('(^| )(died) (([0-9]{1,2}) ([a-z]{3,9}) ([0-9]{4}))')

Appendix 4

There are 32 known patterns which map to 10 regular expressions that are used to parse the dates. The month table (monthLookup in https://github.com/OCLC-Developer-Network/viaf-dates/blob/master/dateFld.py) is used to convert month names to 01-12.

knownDatePatterns = {

'NNNN month NN' : parseNNNNmonthNN,

'NNNN monthNN' : parseNNNNmonthNN,

'NNNN month N' : parseNNNNmonthNN,

'NNNN monthN' : parseNNNNmonthNN,

'NNNN month' : parseNNNNmonth,

'NN month NNNN' : parseNNmonthNNNN,

'NN month NNN' : parseNNmonthNNNN,

'N month NNNN' : parseNNmonthNNNN,

'month NN NNNN' : parsemonthNN_NNNN,

'month N NNNN' : parsemonthNN_NNNN,

'NNNN NN month' : parseNNNN_NNmonth,

'NNNN N month' : parseNNNN_NNmonth,

'NNNN NN NN' : parseNNNN_NN_NN,

'NNNN-NN-NN' : parseNNNN_NN_NN,

'NNNN/NN/NN' : parseNNNN_NN_NN,

'NNNN-NN-NN-' : parseNNNN_NN_NN,

'NNNN N N' : parseNNNN_NN_NN,

'NNNN/N/N' : parseNNNN_NN_NN,

'NNNN/N/NN' : parseNNNN_NN_NN,

'NNNN/NN/N' : parseNNNN_NN_NN,

'NN.NN.NNNN' : parseNN_NN_NNNN,

'NN-NN-NNNN' : parseNN_NN_NNNN,

'NN/NN/NNNN' : parseNN_NN_NNNN,

'NN.N.NNNN' : parseNN_NN_NNNN,

'N.NN.NNNN' : parseNN_NN_NNNN,

'N.N.NNNN' : parseNN_NN_NNNN,

'NN.NNNN' : parseNN_NNNN,

'NNNNNNNN' : parseNNNNNNNN,

'NNNN' : parseNNNN,

'NNN' : parseNNNN,

'NN' : parseNNNN,

'N' : parseNNNN,

}

Subscribe to comments: For this article | For all articles

Parsing and Matching Dates in VIAF

Background

How VIAF Uses Dates

Identifying dates in authority and bibliographic records

Storing and Comparing Dates

Mapping Dates to Patterns

Testing

Limitations

Works Cited

About the Authors

Appendix 1

Appendix 2

Appendix 3

Appendix 4

Leave a Reply

Current Issue

Previous Issues

For Authors