Comparing Lists of Works Cited with Regular Expressions
Written by
Categories: Modern

Comparing Lists of Works Cited with Regular Expressions

XKCD 208: a programmer swings in on a rope to save the day with regular expressions
Obligatory XKCD (#208, released under a Creative Commons Attribution-NonCommercial 2.5 License)

As part of a recent project, I had to compare my list and a co-author’s list of works cited in a chapter. Since we started five and a half years ago, one co-author dropped out and versions of the files got confused between different people in the project, some forthcoming publications had appeared or changed venues, and the first entries were written long before we had a style guide. It was very important to make sure that every work cited appeared in the bibliography, and to reduce the length of the chapter as much as possible by removing references to works which are not cited. Since the lists of works cited contained 150 and 180 entries, many of which fill several lines in print, comparing the two lists was going to be a tedious task. And so I turned to the powerful arts of a dead tongue which I had not invoked since I learned it from German and Indian adepts in a distant land: the language of shell scripting and regular expressions.

My goal was to reduce each list to the minimum information that lets me compare it with the other: the author or authors, date of publication, and a short version of the title. Once I had a list of works cited, I could check the other details of the bibliography in the actual file.

The first step was re-familiarizing myself with the sed and diff commands and the regular expression syntax which they use. Teaching examples tend to assume that [A-Za-z] represents ‘all letters.’ Back when I learned this, the assumption that text was in ASCII did not seem so hilarious.

  • 5.5 Character Classes and Bracket Expressions (for dealing with real language, which is stored in Unicode, not toy language in ASCII)
  • 5.7 Back-References and Subexpressions (how sed names the parts of a string found by the regular expression)

The next set was creating a small set of entries to test the script on. The final iteration looked like this:

Babelon, Ernest (1893). Catalogue des monnaies grecques de la Bibliothèque Nationale: Les Perses achéménides.  Paris: C. Rollin and Feuardent. ark:/12148/bpt6k399051b
Smith, John (1901) Silly Book title. Utopia: Unpressed.
Bichler, R. (forthc.) Herodotus and the perception of the Persian Empire. Some observations from a historical and methodological perspective. In: P. Mack and J. North (eds.), The Afterlife of Herodotus and Thucydides, forthcoming.
Rollinger, R. (2014a). Das teispidisch-achaimenidische Großreich. Ein ‘Imperium’ avant la lettere? In M. Gehler and R. Rollinger (eds.), Imperien in der Weltgeschichte. Epochenübergreifende und globalhistorische Vergleiche 1. Wiesbaden: Harrasowitz, pp. 149–192.
Heller, A. (2010). Das Babylonien der Spätzeit (7.-4. Jh.) in den klassischen und keilschriftlichen Quellen (Oikumene 7). Berlin: Verlag Antike.
Briant, Pierre (1997b) Bulletin d'Histoire Achéménide (I). Topoi Supplément 1. Recherches récentes sur l'Empire achéménide (Lyon: De Boccard) pp. 5-127
Sancisci-Weerdenburg, H. and Kuhrt, A. (eds.) (1990). Centre and Periphery: Proceedings of the Groningen 1986 Achaemenid History Workshop (Achaemenid History 4). Leiden: Nederlands Instituut voor het Nabije Oosten.

The third was building up the script step by step until it caught the authors’ names, the date in brackets, and the book title. The version that I ended up using looks like this:

sean@g-plinius-quartus:~$ sed "s/^\(.*(.\{4,8\})[\. ][[:alpha:]|[:blank:]|-|']\+\).*$/\1/" <~/Documents/sampleBibEntry.txt
Babelon, Ernest (1893). Catalogue des monnaies grecques de la Bibliothèque Nationale
Smith, John (1901) Silly Book title
Bichler, R. (forthc.) Herodotus and the perception of the Persian Empire
Rollinger, R. (2014a). Das teispidisch
Heller, A. (2010). Das Babylonien der Spätzeit
Briant, Pierre (1997b) Bulletin d'Histoire Achéménide
Sancisci-Weerdenburg, H. and Kuhrt, A. (eds.) (1990). Centre and Periphery

Click here to see a version with the syntax highlighted!

This script is not perfect: if we expand it to allow date (forthcoming) (11 characters between the brackets) then it will interpret substrings like (Oikumene 7) as the date of publication thanks to greedy matching. And if one version of the file says “(1234).” and the other says “(1234)”, or one version says “A, B, and C (1234)” and the other says “A, B and C (1234)” this script will not change that. That made it difficult to compare the results with the diff command.

But it let me take the files I had, process them so they had just enough information to compare, and pipe the results into two new files.

Write a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.