Another Pickwick Discard

This visualization was the first draft of the last chart in this post. I added extra spacing between the bars to provide space for annotations; eventually I decided this was unnecessary.

Pickwick Graph: New Vocabulary by Chapter
#y-positions for the next graph. Add empty spaces (label='' and width=0) under each bar.
ypos = np.arange(chapterdata.shape[0]*2 - 1, 0, -1)
widths = list(itertools.chain.from_iterable((x, 0) for x in chapterdata['new_vocabulary']))[:-1]
labels = list(itertools.chain.from_iterable((str(x), '') for x in chapterdata.index))[:-1]

#Don't need extra room after the first two chapters.
ypos = np.delete(ypos, (0,1))
del(widths[1])
del(widths[2])
del(labels[1])
del(labels[2])
MOST_COMMON_COUNT = 5

fig, ax = plt.subplots(figsize=(14, 40))
#plt.barh('new_vocabulary', chapterdata.index, data=chapterdata,orient='h', color=mycolor)

ax.barh(ypos, widths, .8, tick_label=labels, color=mycolor)
plt.ylim([0, ypos.max()+1])
plt.grid(axis='x')
xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Count of New Vocabulary")
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #print(str(widths[labels.index(chaplbl)]))
    plt.text(10, ypos[labels.index(chaplbl)], str(widths[labels.index(chaplbl)]),\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)]) 
        #print(newwords)
        plt.text(10, ypos[labels.index(chaplbl)]-1.1, 'Major new vocabulary:', weight='bold',\
                 bbox=dict(facecolor='#ffffff99'))
        plt.text(500, ypos[labels.index(chaplbl)]-1.1, newwords,\
                 bbox=dict(facecolor='#ffffff99'))

plt.show()

Pickwick Discard Graph

Pickwick Graph: Count of Unique Words by Chapter

The Pickwick Papers: Count of unique words by chapter.

Here's one of the also-ran graphs I mentioned in my last post about Pickwick. They were part of my exploration of the data and didn't seem interesting enough to include in the last post.

The first shows the count of unique vocabulary per chapter, which isn't all that interesting without any context. It might work better as a stacked bar with unique vocabulary and total word count for each chapter.

#First make a DataFrame with chapter #s and word counts.
clengths = pd.DataFrame((
    (x, len(chaptervocab[x-1])) for x in range(1,58)), 
    columns=['Chapter', 'Count of Unique Words'])
#Bar color for the rest of the plots.
mycolor="#3E000C"
Set the size of the plot.
plt.figure(figsize=(12, 20))
#plt.xlim(0,10000)
#Choose Seaborn display settings. 
sns.set(style="whitegrid")
#Make a horizontal (orient='h') barplot.
sns.barplot('Count of Unique Words', 'Chapter', 
            data=clengths, orient='h', color=mycolor)

plt.show()
#I don't need this any more.
del(clengths)

Python, newspaper, and NLTK

I've been continuing my experimentation with NLTK, using the Python newspaper module. While newspaper seems mainly intended for making personalized Google News-like sites, it does some language processing to support this. Below, I continue where newspaper leaves off using NLTK and seaborn to explore an online text.

Getting Ready

In [1]:
import re # For some minor data cleaning

from IPython.core.display import display, HTML
import newspaper
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

I have several test sites in the cell below--sites are testy about allowing non-browser clients access and I've found I need to alternate while experimenting. Most of these sites work well, but newspaper often gives weird results from the Bing News search.

In [2]:
#paper = newspaper.build('https://www.providencejournal.com/')
#paper = newspaper.build('https://www.bing.com/news/search?q=providence')
#paper = newspaper.build('https://www.guardian.co.uk')
paper = newspaper.build('https://www.vulture.com/')
len(paper.articles)
Out[2]:
69

newspaper basics

Here I choose an article and download() and parse() it--which makes its data available. Then we can pull some basic details like the article's title, address, images, and text:

In [3]:
#We *can* do this, but I want to make sure I have a relatively-long article to use.
a = paper.articles[0]

#Via aldaily.com. Longer articles make more interesting graphs.
a = newspaper.Article('https://themillions.com/2020/03/on-pandemic-and-literature.html')
a.download()
a.parse()


display(HTML("<h3>%s</h3>" % a.title))
print(a.url)

#display(HTML('<img src="%s"/>'%a.top_image))
print(a.top_image) # also a.images, a.movies
print(a.text[:500])

On Pandemic and Literature

https://themillions.com/2020/03/on-pandemic-and-literature.html
https://themillions.com/wp-content/uploads/2020/03/1-2-870x1024.jpg
Less than a century after the Black Death descended into Europe and killed 75 million people—as much as 60 percent of the population (90% in some places) dead in the five years after 1347—an anonymous Alsatian engraver with the fantastic appellation of “Master of the Playing Cards” saw fit to depict St. Sebastian: the patron saint of plague victims. Making his name, literally, from the series of playing cards he produced at the moment when the pastime first became popular in Germany, the engrave

Our Article also gets a list of the authors of the text. I've found this tends to be the least accurate piece of newspaper's processing.

In [4]:
a.authors
Out[4]:
['Ed Simon',
 'Madeleine Monson-Rosen',
 'Ken Hines',
 'Kirsty Logan',
 'Patrick Brown',
 'Emily St. John Mandel',
 'Diksha Basu',
 'Sonya Chung',
 'Andrew Saikali']

NLP with newspaper

The .nlp() method gives us access to a summary of the text and a list of keywords. I haven't looked at the source closely enough to figure out how it's determining these, though the keywords are approximately the most common non-stopwords in the article.

In [5]:
a.nlp()

print(a.summary)
display(HTML('<hr/>'))
print(a.keywords)
There has always been literature of pandemic because there have always been pandemics.
What marks the literature of plague, pestilence, and pandemic is a commitment to try and forge if not some sense of explanation, than at least a sense of meaning out of the raw experience of panic, horror, and despair.
Narrative is an attempt to stave off meaninglessness, and in the void of the pandemic, literature serves the purpose of trying, however desperately, to stop the bleeding.
Pandemic literature exists not just to analyze the reasons for the pestilence—that may not even be its primary purpose.
The necessity of literature in the aftermath of pandemic is movingly illustrated in Emily St. John Mandel’s novel Station Eleven.

['pandemic', 'disease', 'narrative', 'sense', 'black', 'writes', 'plague', 'death', 'literature', 'world']

NLP with NLTK

That's what Newspaper can do for us. But since I had nltk installed already (and Newspaper requires it even if I hadn't), I can take the this article's text and do some basic processing with it.

First I need to tokenize the text, breaking it into individual words and punctuation marks.

In [6]:
a.tokens = nltk.word_tokenize(a.text)
print(a.tokens[:50])
['Less', 'than', 'a', 'century', 'after', 'the', 'Black', 'Death', 'descended', 'into', 'Europe', 'and', 'killed', '75', 'million', 'people—as', 'much', 'as', '60', 'percent', 'of', 'the', 'population', '(', '90', '%', 'in', 'some', 'places', ')', 'dead', 'in', 'the', 'five', 'years', 'after', '1347—an', 'anonymous', 'Alsatian', 'engraver', 'with', 'the', 'fantastic', 'appellation', 'of', '“', 'Master', 'of', 'the', 'Playing']

Next, guess each token's part of speech, using NLTK's "off-the-shelf" English tagger. This returns a list of 2-tuples (token, tag from the Penn Treebank tagset.

In [7]:
a.pos_tags = nltk.pos_tag(a.tokens)
a.pos_tags[:15]
Out[7]:
[('Less', 'JJR'),
 ('than', 'IN'),
 ('a', 'DT'),
 ('century', 'NN'),
 ('after', 'IN'),
 ('the', 'DT'),
 ('Black', 'NNP'),
 ('Death', 'NNP'),
 ('descended', 'VBD'),
 ('into', 'IN'),
 ('Europe', 'NNP'),
 ('and', 'CC'),
 ('killed', 'VBD'),
 ('75', 'CD'),
 ('million', 'CD')]

The Treebank tagset isn't particularly intuitive, especially if your last contact with English grammar was in middle school. Here's the help text for a few of the less-obvious tags above.

In [8]:
for pos in ['NNS', 'NNP', 'IN', 'DT', 'JJ']:
    nltk.help.upenn_tagset(pos)
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

I'll also tag the article using the "Universal" tagset--it has fewer tags, which makes for a simpler visualization later on.

In [9]:
a.upos_tags = nltk.pos_tag(a.tokens, tagset="universal")
a.upos_tags[:15]
Out[9]:
[('Less', 'ADJ'),
 ('than', 'ADP'),
 ('a', 'DET'),
 ('century', 'NOUN'),
 ('after', 'ADP'),
 ('the', 'DET'),
 ('Black', 'NOUN'),
 ('Death', 'NOUN'),
 ('descended', 'VERB'),
 ('into', 'ADP'),
 ('Europe', 'NOUN'),
 ('and', 'CONJ'),
 ('killed', 'VERB'),
 ('75', 'NUM'),
 ('million', 'NUM')]

We can also have NLTK calculate a frequency distribution of the words in our article--here I'll use it to show the most common 10 tokens, most of which you probably could have guessed:

In [10]:
a.word_freqs = nltk.FreqDist(word.lower() for word in a.tokens)
a.word_freqs.most_common(10)
Out[10]:
[('the', 310),
 (',', 262),
 ('of', 209),
 ('and', 112),
 ('.', 112),
 ('a', 88),
 ('to', 85),
 ('that', 80),
 ('’', 70),
 ('in', 61)]

Visualization

NLTK's FreqDist can also generate plots. Not great plots. Here's an example.

In [11]:
plt.figure(figsize=(12, 8))
a.word_freqs.plot(25)
plt.show()

Line graphs usually make me think "time series". This should probably be a bar plot, and we can do that. Start by translating our FreqDist object's data to a pandas DataFrame:

In [12]:
wfdf = pd.DataFrame(a.word_freqs.items(), columns=['token', 'frequency'])
wfdf.head()
Out[12]:
token frequency
0 less 2
1 than 13
2 a 88
3 century 4
4 after 9

We can now generate a Seaborn barplot of the token frequency data, which is largely unsurprising.

In [13]:
mycolor="#3E000C" #"#9A1C42"
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
sns.barplot('frequency', 'token', data=wfdf.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

We can make the result (arguably) more interesting by removing stopwords--very common words that don't affect the meaning of the text--from the frequency list. Here we get the stopwords for English.

In [14]:
sw = nltk.corpus.stopwords.words('english')
print(sw)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

And next, remove stopwords (and punctuation) from our word frequency list, and create a new DataFrame and a new graph.

In [15]:
# '\W' matches one "non-word character", i.e., not a letter, number or underscore.
wf2 = [word for word in a.word_freqs.items() if word[0] not in sw and not re.match('\W', word[0])]
wf2df = pd.DataFrame(wf2, columns=['token', 'frequency'])

plt.figure(figsize=(12, 8))
sns.barplot('frequency', 'token', data=wf2df.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

This tells us more about the article. But what about the part-of-speech tags I generated earlier? Here's a function that will take those lists and generate a graph from them:

In [16]:
def posFreqGraph(tags):
    posfreqs = nltk.FreqDist(word[1] for word in tags)
    posfdf = pd.DataFrame(posfreqs.items(), columns=['pos', 'frequency'])
    plt.figure(figsize=(12, 8))
    sns.barplot('frequency', 'pos', data=posfdf.sort_values(by='frequency', ascending=False), color=mycolor)
    plt.show()

First, the graph of Penn Treebank tags. It's crowded--which shows us the richness of this tagset--but still readable. (Here's a complete list of these tags with meanings.)

In [17]:
posFreqGraph(a.pos_tags)

Here's the same visual built from the universal tagset data.

In [18]:
posFreqGraph(a.upos_tags)

Closing

I think that's a good start. Newspaper makes it easy to load web pages and get just the important text, which we can feed into NLTK for some analysis. Here I did only the very basics of an analysis with NLTK; I plan to experiment more over the next few weeks.

Simple Text Collation with Python

Although in the study of manuscript culture one of the characteristic activities is to align parallel parts of a work–and this is a common definition of collation–I speak of collation in the more narrow sense of identifying differences between texts (i.e., after passages are already “aligned”). There are three methods to collate texts: 1) read two texts side by side and note the differences, 2) compare printed page images (by allowing your eyes to merge two page images, often with a device especially for that purpose); 3) transcribe and compare transcriptions with aid of a computer.

Last semester I taught an introductory programming course to non-Computer Science graduate students at URI. My curriculum focused mostly on the Python data science toolset of Jupyter, pandas, and numpy, and using these tools to analyze the students' own datasets.

One student, an English Ph.D. candidate, asked for an alternative involving natural language processing tasks: performing a collation of two editions of a novel released about a decade apart. This kind of work was new to me, but seemed like a simple enough task for a beginning programmer to handle.

Reading PDFs

whymodel.png

My student had both editions of the book as PDFs (scanned from physical books with embedded OCRed text. We explored two modules for extracting the text:

PyPDF2 was our first try. Its getPage() method didn't include whitespace in its output, giving each page's text as a single long word, probably as a result of the PDF's internal formatting, as suggested by Ned Batchelder on StackOverflow. I suspect it would be simple enough to read each word and paste them together as needed, but it was easier to find another solution for PDF reading.

PyMuPDF just worked, at least well enough for this use. It added unnecessary newlines, which would have been a problem if we were interested in paragraph breaks but wasn't an issue here. It also failed with one file's dropcaps, which was probably more an OCR/encoding issue. Here's an example of use (output on the right; the file 01.01_Why_Model.pdf is one of the readings for Scott Page's Model Thinking course on Coursera):

import fitz
pdf = fitz.open('01.01_Why_Model.pdf')
text = ''for page in pdf: text += page.getText()

Text comparison with difflib

difflib HTML Table

difflib.HtmlDiff's output from comparing two simple strings.

It took me an embarrassing amount of time before I realized the tool we needed here was diff. Python's difflib was the ideal solutions. It has a few basic options that easily produce machine-readable (like the command-line app) or HTML table output, but can also produce more complex output with a little effort. Its HtmlDiff tool worked perfectly for this.

The image to the right shows difflib's output from this code in a Jupyter window:

import difflib
from nltk import word_tokenize
from IPython.display import display, HTML

str1="This is my short string"
str2="This is another not long string"

words1 = word_tokenize(str1)
words2 = word_tokenize(str2)

hd = difflib.HtmlDiff()

HTML(hd.make_table(words1, words2))

Two other HtmlDiff options (to display only differences in context , and to limit the number of lines of context) were ideal for this case--we don't need to show the entire book just to print a relative handful of differences. For example, the following will only show changes with three words of context around each:

hd.make_table(context=True, numlines=3)

Closing

Once the difflib's HTML was output, the rest of the student's work on this project was reading through the table, identifying individual changes as 'substantive' or 'accidental', and tabulating them. But there's more we could do with Python to simplify this or enrich the final output, for example:

  • Identify changes where a single punctuation mark was changed to another--many of these were probably either typos or OCR errors.
  • Do part-of-speech tagging on the books' text and include this data in the output--did the author systematically remove adjectives and adverbs, or is there some other trend?

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.

Tags