Comparing Lists of Works Cited with Regular Expressions
My goal was to reduce each list to the minimum information that lets me compare it with the other: the author or authors, date of publication, and a short version of the title. Once I had a list of works cited, I could check the other details of the bibliography in the actual file.
The first step was re-familiarizing myself with the sed
and diff
commands and the regular expression syntax which they use. Teaching examples tend to assume that [A-Za-z] represents ‘all letters.’ Back when I learned this, the assumption that text was in ASCII did not seem so hilarious.
- 5.5 Character Classes and Bracket Expressions https://www.gnu.org/software/sed/manual/html_node/Character-Classes-and-Bracket-Expressions.html (for dealing with real language, which requires enough characters that it is stored in Unicode, not toy language in ASCII)
- 5.7 Back-References and Subexpressions https://www.gnu.org/software/sed/manual/html_node/Back_002dreferences-and-Subexpressions.html#Back_002dreferences-and-Subexpressions (how sed names the parts of a string found by the regular expression)
The next set was creating a small set of entries to test the script on. The final iteration looked like this:
Babelon, Ernest (1893). Catalogue des monnaies grecques de la Bibliothèque Nationale: Les Perses achéménides. Paris: C. Rollin and Feuardent. ark:/12148/bpt6k399051b
Smith, John (1901) Silly Book title. Utopia: Unpressed.
Bichler, R. (forthc.) Herodotus and the perception of the Persian Empire. Some observations from a historical and methodological perspective. In: P. Mack and J. North (eds.), The Afterlife of Herodotus and Thucydides, forthcoming.
Rollinger, R. (2014a). Das teispidisch-achaimenidische Großreich. Ein ‘Imperium’ avant la lettere? In M. Gehler and R. Rollinger (eds.), Imperien in der Weltgeschichte. Epochenübergreifende und globalhistorische Vergleiche 1. Wiesbaden: Harrasowitz, pp. 149–192.
Heller, A. (2010). Das Babylonien der Spätzeit (7.-4. Jh.) in den klassischen und keilschriftlichen Quellen (Oikumene 7). Berlin: Verlag Antike.
Briant, Pierre (1997b) Bulletin d'Histoire Achéménide (I). Topoi Supplément 1. Recherches récentes sur l'Empire achéménide (Lyon: De Boccard) pp. 5-127 http://www.achemenet.com/dotAsset/e897d215-7e49-440a-8c51-634b0d88f03e.pdf
Sancisci-Weerdenburg, H. and Kuhrt, A. (eds.) (1990). Centre and Periphery: Proceedings of the Groningen 1986 Achaemenid History Workshop (Achaemenid History 4). Leiden: Nederlands Instituut voor het Nabije Oosten.
The third was building up the script step by step until it caught the authors’ names, the date in brackets, and the book title. The version that I ended up using looks like this:
sean@g-plinius-quartus:~$ sed "s/^\(.*(.\{4,8\})[\. ][[:alpha:]|[:blank:]|-|']\+\).*$/\1/" <~/Documents/sampleBibEntry.txt
Babelon, Ernest (1893). Catalogue des monnaies grecques de la Bibliothèque Nationale
Smith, John (1901) Silly Book title
Bichler, R. (forthc.) Herodotus and the perception of the Persian Empire
Rollinger, R. (2014a). Das teispidisch
Heller, A. (2010). Das Babylonien der Spätzeit
Briant, Pierre (1997b) Bulletin d'Histoire Achéménide
Sancisci-Weerdenburg, H. and Kuhrt, A. (eds.) (1990). Centre and Periphery
Click here to see a version with the syntax highlighted!
This script is not perfect: if we expand it to allow date (forthcoming)
(11 characters between the brackets) then it will interpret substrings like (Oikumene 7)
as the date of publication thanks to greedy matching. And if one version of the file says “(1234).” and the other says “(1234)”, or one version says “A, B, and C (1234)” and the other says “A, B and C (1234)” this script will not change that. That made it difficult to compare the results with the diff
command.
But it let me take the files I had, process them so they had just enough information to compare, and pipe the results into two new files.
Edit 2022-08-08: fixed links broken when WordPress introduced the block editor