Although in the study of manuscript culture one of the characteristic activities is to align parallel parts of a work–and this is a common definition of collation–I speak of collation in the more narrow sense of identifying differences between texts (i.e., after passages are already “aligned”). There are three methods to collate texts: 1) read two texts side by side and note the differences, 2) compare printed page images (by allowing your eyes to merge two page images, often with a device especially for that purpose); 3) transcribe and compare transcriptions with aid of a computer.
Last semester I taught an introductory programming course to non-Computer Science graduate students at URI. My curriculum focused mostly on the Python data science toolset of Jupyter, pandas, and numpy, and using these tools to analyze the students' own datasets.
One student, an English Ph.D. candidate, asked for an alternative involving natural language processing tasks: performing a collation of two editions of a novel released about a decade apart. This kind of work was new to me, but seemed like a simple enough task for a beginning programmer to handle.
My student had both editions of the book as PDFs (scanned from physical books with embedded OCRed text. We explored two modules for extracting the text:
PyPDF2 was our first try. Its getPage() method didn't include whitespace in its output, giving each page's text as a single long word, probably as a result of the PDF's internal formatting, as suggested by Ned Batchelder on StackOverflow. I suspect it would be simple enough to read each word and paste them together as needed, but it was easier to find another solution for PDF reading.
PyMuPDF just worked, at least well enough for this use. It added unnecessary newlines, which would have been a problem if we were interested in paragraph breaks but wasn't an issue here. It also failed with one file's dropcaps, which was probably more an OCR/encoding issue. Here's an example of use (output on the right; the file 01.01_Why_Model.pdf is one of the readings for Scott Page's Model Thinking course on Coursera):
pdf = fitz.open('01.01_Why_Model.pdf')
text = ''for page in pdf: text += page.getText()
Text comparison with difflib
It took me an embarrassing amount of time before I realized the tool we needed here was diff. Python's difflib was the ideal solutions. It has a few basic options that easily produce machine-readable (like the command-line app) or HTML table output, but can also produce more complex output with a little effort. Its
HtmlDiff tool worked perfectly for this.
The image to the right shows
difflib's output from this code in a Jupyter window:
from nltk import word_tokenize
from IPython.display import display, HTML
str1="This is my short string"
str2="This is another not long string"
words1 = word_tokenize(str1)
words2 = word_tokenize(str2)
hd = difflib.HtmlDiff()
HtmlDiff options (to display only differences in context , and to limit the number of lines of context) were ideal for this case--we don't need to show the entire book just to print a relative handful of differences. For example, the following will only show changes with three words of context around each:
difflib's HTML was output, the rest of the student's work on this project was reading through the table, identifying individual changes as 'substantive' or 'accidental', and tabulating them. But there's more we could do with Python to simplify this or enrich the final output, for example:
- Identify changes where a single punctuation mark was changed to another--many of these were probably either typos or OCR errors.
- Do part-of-speech tagging on the books' text and include this data in the output--did the author systematically remove adjectives and adverbs, or is there some other trend?