Simple Text Collation with Python

Although in the study of manuscript culture one of the characteristic activities is to align parallel parts of a work–and this is a common definition of collation–I speak of collation in the more narrow sense of identifying differences between texts (i.e., after passages are already “aligned”). There are three methods to collate texts: 1) read two texts side by side and note the differences, 2) compare printed page images (by allowing your eyes to merge two page images, often with a device especially for that purpose); 3) transcribe and compare transcriptions with aid of a computer.

Last semester I taught an introductory programming course to non-Computer Science graduate students at URI. My curriculum focused mostly on the Python data science toolset of Jupyter, pandas, and numpy, and using these tools to analyze the students' own datasets.

One student, an English Ph.D. candidate, asked for an alternative involving natural language processing tasks: performing a collation of two editions of a novel released about a decade apart. This kind of work was new to me, but seemed like a simple enough task for a beginning programmer to handle.

Reading PDFs


My student had both editions of the book as PDFs (scanned from physical books with embedded OCRed text. We explored two modules for extracting the text:

PyPDF2 was our first try. Its getPage() method didn't include whitespace in its output, giving each page's text as a single long word, probably as a result of the PDF's internal formatting, as suggested by Ned Batchelder on StackOverflow. I suspect it would be simple enough to read each word and paste them together as needed, but it was easier to find another solution for PDF reading.

PyMuPDF just worked, at least well enough for this use. It added unnecessary newlines, which would have been a problem if we were interested in paragraph breaks but wasn't an issue here. It also failed with one file's dropcaps, which was probably more an OCR/encoding issue. Here's an example of use (output on the right; the file 01.01_Why_Model.pdf is one of the readings for Scott Page's Model Thinking course on Coursera):

import fitz
pdf ='01.01_Why_Model.pdf')
text = ''for page in pdf: text += page.getText()

Text comparison with difflib

difflib HTML Table

difflib.HtmlDiff's output from comparing two simple strings.

It took me an embarrassing amount of time before I realized the tool we needed here was diff. Python's difflib was the ideal solutions. It has a few basic options that easily produce machine-readable (like the command-line app) or HTML table output, but can also produce more complex output with a little effort. Its HtmlDiff tool worked perfectly for this.

The image to the right shows difflib's output from this code in a Jupyter window:

import difflib
from nltk import word_tokenize
from IPython.display import display, HTML

str1="This is my short string"
str2="This is another not long string"

words1 = word_tokenize(str1)
words2 = word_tokenize(str2)

hd = difflib.HtmlDiff()

HTML(hd.make_table(words1, words2))

Two other HtmlDiff options (to display only differences in context , and to limit the number of lines of context) were ideal for this case--we don't need to show the entire book just to print a relative handful of differences. For example, the following will only show changes with three words of context around each:

hd.make_table(context=True, numlines=3)


Once the difflib's HTML was output, the rest of the student's work on this project was reading through the table, identifying individual changes as 'substantive' or 'accidental', and tabulating them. But there's more we could do with Python to simplify this or enrich the final output, for example:

  • Identify changes where a single punctuation mark was changed to another--many of these were probably either typos or OCR errors.
  • Do part-of-speech tagging on the books' text and include this data in the output--did the author systematically remove adjectives and adverbs, or is there some other trend?

Pay the library staff well

Thus, money rules the world. It determines the status of men as well as the value of the services rendered by them. Unfortunately, people are prepared to benefit by a service only in proportion to the value set on it by money. Thus, a famished staff will render the efforts of the First Law as futile as paucity of books or paucity of readers. In the trinity of the library–books, staff, and readers–the richness of the staff in worldly goods appears to be as necessary as the richness of the other two in number and variety, if the law ‘Books Are for Use’ is to be translated into practice. It will have to be so, so long as men’s status is left to the capricious and arbitrary rule of Mammon. 'Therefore, pay the library staff well,’ says the First Law.

Thinking in Systems

Hunger, poverty, environmental degradation, economic instability, unemployment, chronic disease, drug addiction, and war, for example, persist in spite of the analytical ability and technical brilliance that have been directed toward eradicating them. No one deliberately creates those problems, no one wants them to persist, but they persist nonetheless. That is because they are intrinsically systems problems–undesirable behaviors characteristic of the system structures that produce them. They will yield only as we reclaim our intuition, stop casting blame, see the system as the source of its own problems, and find the courage and wisdom to restructure it.

Because otherwise I'll forget how to do this before I have to again.

Pandoc with the --self-contained option will convert your images into Base64 and embed them as data: URIs. For example:

pandoc -o blogpost.html --self-contained --metadata pagetitle="My Blog Post" .\

Note that this generates a complete web page instead of a fragment. If you don’t want that, you can use a custom template. A file with nothing in it but $body$ should work fine.

(NB: Use this sparingly–it’s an inefficient way of encoding binary data.)

Raised by Wolves

Older article (2006) that uses an analogy I love:

Academic libraries now hire an increasing number of individuals to fill professional librarian positions who do not have the master’s degree in library science….

Historically, the shared graduate educational experience has provided a standard preparation and socialization into the library profession. The new professional groups have been ‘raised’ in other environments and bring to the academic library a ‘feral’ set of values, outlooks, styles, and expectations.

The Brown University Library hires a fair number of “feral professionals”–it was interesting to find out this isn’t a relatively-new issue.

Stanley Wilder, writing in The Chronicle of Higher Education also elaborated on this, mentioning the relative youth and high salaries of the “ferals”:

For example, people in nontraditional positions accounted for 23 percent of the professionals at research libraries in 2005, compared to just 7 percent in 1985.

But the most compelling aspect of the nontraditional population is its youth: 39 percent of library professionals under 35 work in such nontraditional jobs, compared with only 21 percent of those 35 and older….

Within the under-35 population, 24 percent of nontraditional library employees earn $54,000 or more, compared to just 7 percent of those in traditional positions. Our profession has no precedent for the existence of so large a cohort of young employees who begin their careers at salaries approaching those of established middle managers.

Wilder’s closing describes the issue I’m currently looking at:

The libraries that thrive in the coming years will be those that apply the full range of nontraditional expertise in the service of those timeless values, and not the other way around.

A chart

A chart describing book properties

"A chart showing the percentage of excellence in the physical properties of books published since 1910."

Does Economics Make Politicians Corrupt?

All these findings correspond with a substantial body of research in the economic literature, which, with the help of surveys, laboratory experiments, as well as field experiments showed that those who learn about markets (economists) or act in markets (businessmen) are lacking in … ‘pro-social behavior’ … I also use corruption as a proxy to show whether there are any differences in pro-social behavior between economists and non-economists, but unlike them, I observe behavior outside the artificial situation of a laboratory. By analyzing real world data of the U.S. Congress, I found that politicians holding a degree in economics are significantly more prone to engage in corrupt practices.

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.