Another Pickwick Discard

This visualization was the first draft of the last chart in this post. I added extra spacing between the bars to provide space for annotations; eventually I decided this was unnecessary.

Pickwick Graph: New Vocabulary by Chapter
#y-positions for the next graph. Add empty spaces (label='' and width=0) under each bar.
ypos = np.arange(chapterdata.shape[0]*2 - 1, 0, -1)
widths = list(itertools.chain.from_iterable((x, 0) for x in chapterdata['new_vocabulary']))[:-1]
labels = list(itertools.chain.from_iterable((str(x), '') for x in chapterdata.index))[:-1]

#Don't need extra room after the first two chapters.
ypos = np.delete(ypos, (0,1))
del(widths[1])
del(widths[2])
del(labels[1])
del(labels[2])
MOST_COMMON_COUNT = 5

fig, ax = plt.subplots(figsize=(14, 40))
#plt.barh('new_vocabulary', chapterdata.index, data=chapterdata,orient='h', color=mycolor)

ax.barh(ypos, widths, .8, tick_label=labels, color=mycolor)
plt.ylim([0, ypos.max()+1])
plt.grid(axis='x')
xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Count of New Vocabulary")
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #print(str(widths[labels.index(chaplbl)]))
    plt.text(10, ypos[labels.index(chaplbl)], str(widths[labels.index(chaplbl)]),\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)]) 
        #print(newwords)
        plt.text(10, ypos[labels.index(chaplbl)]-1.1, 'Major new vocabulary:', weight='bold',\
                 bbox=dict(facecolor='#ffffff99'))
        plt.text(500, ypos[labels.index(chaplbl)]-1.1, newwords,\
                 bbox=dict(facecolor='#ffffff99'))

plt.show()

Pickwick Discard Graph

Pickwick Graph: Count of Unique Words by Chapter

The Pickwick Papers: Count of unique words by chapter.

Here's one of the also-ran graphs I mentioned in my last post about Pickwick. They were part of my exploration of the data and didn't seem interesting enough to include in the last post.

The first shows the count of unique vocabulary per chapter, which isn't all that interesting without any context. It might work better as a stacked bar with unique vocabulary and total word count for each chapter.

#First make a DataFrame with chapter #s and word counts.
clengths = pd.DataFrame((
    (x, len(chaptervocab[x-1])) for x in range(1,58)), 
    columns=['Chapter', 'Count of Unique Words'])
#Bar color for the rest of the plots.
mycolor="#3E000C"
Set the size of the plot.
plt.figure(figsize=(12, 20))
#plt.xlim(0,10000)
#Choose Seaborn display settings. 
sns.set(style="whitegrid")
#Make a horizontal (orient='h') barplot.
sns.barplot('Count of Unique Words', 'Chapter', 
            data=clengths, orient='h', color=mycolor)

plt.show()
#I don't need this any more.
del(clengths)

More Natural Language Processing

This week's NLP experimentation involves Project Gutenberg's plain-text edition of The Pickwick Papers. I parsed the text into individual chapters and calculated some summary statistics about each, then build a visualization of each chapter's new vocabulary.

Setup

A few imports to start with:

The rest of these imports are just NLTK, Seaborn and the standard Python data science toolkit.

In [1]:
from collections import OrderedDict
import re

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from requests import get
import seaborn as sns

%matplotlib inline

First I grab Project Gutenberg's plain-text edition of The Pickwick Papers, then strip out (using split() and list indexing) Gutenberg's front- and backmatter and the table of contents to get just the book's text.

Also, Gutenberg's plain-text version uses underscores to indicate italics. I don't need those so I remove them here.

In [2]:
rq = get('https://www.gutenberg.org/files/580/580-0.txt')

#Not ascii, which requests assumes
rq.encoding='utf8'

pickwick = rq.text
pickwick = pickwick.split('THE POSTHUMOUS PAPERS OF THE PICKWICK CLUB')[2]
pickwick = pickwick.split('End of the Project Gutenberg EBook of The Pickwick Papers')[0]
pickwick = pickwick.replace('_', '')

First, I break the text into paragraphs, then use enumerate() and filter() to get indexes of the chapter headings (conveniently, they're the only "paragraphs" that start with the string "CHAPTER "). This gives me a list like this:

[(2, 'CHAPTER I. THE PICKWICKIANS'), 
(26, 'CHAPTER II. THE FIRST DAY’S JOURNEY, AND THE FIRST EVENING’S ADVENTURES;\r\nWITH THEIR CONSEQUENCES'),
...]

I then cycle through this list to locate the text of each chapter. The chapter texts are stored in an OrderedDict, with the chapter titles used as keys.

Then I delete my _chapterheads list, since I no longer need it and want to limit the number of copies of this book I have stored in memory.

In [3]:
paragraphs = pickwick.split('\r\n\r\n')

_chapterheads = list(filter(lambda x: x[1].startswith('CHAPTER '), enumerate(paragraphs)))

chapters = OrderedDict()
for i in range(len(_chapterheads)-1):
    ch = _chapterheads[i]
    nextch = _chapterheads[i+1]
    chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:nextch[0]]))
    
ch = _chapterheads[i+1]
chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:]))

del(_chapterheads)

Next, I need lists of tokens:

  • First, generate lists of entities, including both words and punctuation marks, one for each chapter.
  • Then, a copy of that first list, but lowercased and with punctuation removed.
  • Finally a third list with each chapter separated into sentences.
In [4]:
_chaptertexts = ['\n'.join(x) for x in chapters.values()]
_booktext = '\n\n'.join(_chaptertexts)

chaptertokens = [nltk.word_tokenize(chap) for chap in _chaptertexts]
chapterwords = [[w.lower() for w in chap if not re.match('\W', w)] for chap in chaptertokens]
chaptersentences = [nltk.sent_tokenize(chap) for chap in _chaptertexts]

del(_chaptertexts)
del(_booktext)

Next, I copy the lists from chapterwords into sets, giving me a list of the unique words in each chapter--I'll use that soon. Then I have a copy of this with stopwords removed, and a frequency distribution of the words in each chapter.

In [5]:
sw = nltk.corpus.stopwords.words('english')

chaptervocabsw = [set(chap) for chap in chapterwords]
chaptervocab = [set([word for word in chap if word not in sw]) for chap in chaptervocabsw]

#Remember we're not counting punctuation here.
chapterfreq = [nltk.FreqDist(chap) for chap in chapterwords]
In [6]:
newvocab = []
for c in range(57):
    newvoc = chaptervocabsw[c].difference(*chaptervocabsw[:c])
    newvocab.append(nltk.FreqDist(dict([x for x in chapterfreq[c].items() if x[0] in newvoc])))

#Make newvocab 1-indexed.
newvocab = dict(enumerate(newvocab, start=1))

Now I use Pandas to build a DataFrame of potentially-interesting statistics. This is done with a complex-looking list comprehension that generates an 8-tuple describing each chapter. The DataFrame constructor interprets this as a table.

In [7]:
chapterdata = pd.DataFrame([
                (
                    len(chapterwords[x-1]),
                    np.mean([len(tok) for tok in chaptertokens[x-1] if not re.match('\W', tok)]),
                    len(chaptersentences[x-1]),
                    len(chapterwords[x-1]) / len(chaptersentences[x-1]),
                    len(chaptervocabsw[x-1]),
                    len(chapterwords[x-1])/len(chaptervocabsw[x-1]),
                    len(newvocab[x]),
                    len(newvocab[x])/len(chaptervocabsw[x-1])*100,
                )
            for x in range(1,58)], 
            index=pd.Index(range(1,58), name='chapter'),
            columns=['word_count', 'avg_word_length', 'sentence_count', 'avg_sentence_length',\
                     'unique_words', 'lexical_diversity', 'new_vocabulary', 'pct_new_vocab'])

chapterdata.to_csv('~/data/nlp/pickwick_details.csv')
chapterdata
Out[7]:
word_count avg_word_length sentence_count avg_sentence_length unique_words lexical_diversity new_vocabulary pct_new_vocab
chapter
1 1774 4.929538 79 22.455696 705 2.516312 705 100.000000
2 9888 4.644114 391 25.289003 2441 4.050799 2087 85.497747
3 4650 4.440215 167 27.844311 1421 3.272343 682 47.994370
4 4657 4.518145 165 28.224242 1402 3.321683 555 39.586305
5 3719 4.536166 139 26.755396 1227 3.030970 458 37.326813
6 5969 4.388005 211 28.289100 1639 3.641855 616 37.583893
7 5322 4.558812 210 25.342857 1632 3.261029 536 32.843137
8 4678 4.478623 213 21.962441 1407 3.324805 401 28.500355
9 3305 4.384266 155 21.322581 1059 3.120869 271 25.590179
10 5407 4.240244 215 25.148837 1516 3.566623 441 29.089710
11 7350 4.334830 335 21.940299 1953 3.763441 501 25.652842
12 2205 4.503855 87 25.344828 786 2.805344 163 20.737913
13 7048 4.539728 232 30.379310 1841 3.828354 492 26.724606
14 6893 4.233425 256 26.925781 1652 4.172518 363 21.973366
15 5123 4.523131 203 25.236453 1487 3.445192 348 23.402824
16 7265 4.276944 299 24.297659 1791 4.056393 404 22.557231
17 3551 4.391157 83 42.783133 1038 3.421002 179 17.244701
18 3857 4.361162 162 23.808642 1159 3.327869 183 15.789474
19 5325 4.328263 223 23.878924 1471 3.619986 265 18.014956
20 6411 4.262674 222 28.878378 1591 4.029541 305 19.170333
21 7341 4.301594 259 28.343629 1890 3.884127 324 17.142857
22 6213 4.353774 252 24.654762 1595 3.895298 255 15.987461
23 3301 4.182672 129 25.589147 1022 3.229941 138 13.502935
24 5787 4.592189 213 27.169014 1564 3.700128 258 16.496164
25 7100 4.404930 303 23.432343 1707 4.159344 259 15.172818
26 2460 4.357317 93 26.451613 787 3.125794 83 10.546379
27 3754 4.329249 134 28.014925 1162 3.230637 170 14.629948
28 8926 4.336993 247 36.137652 2175 4.103908 370 17.011494
29 4167 4.404848 121 34.438017 1241 3.357776 162 13.053989
30 4309 4.444419 175 24.622857 1283 3.358535 156 12.159002
31 6131 4.377100 216 28.384259 1643 3.731589 239 14.546561
32 5501 4.392111 193 28.502591 1514 3.633421 189 12.483487
33 6364 4.395663 201 31.661692 1817 3.502477 326 17.941662
34 9501 4.498579 300 31.670000 2005 4.738653 290 14.463840
35 5980 4.515050 277 21.588448 1704 3.509390 255 14.964789
36 4599 4.404218 170 27.052941 1419 3.241015 190 13.389711
37 5099 4.269857 186 27.413978 1358 3.754786 168 12.371134
38 5395 4.413160 223 24.192825 1562 3.453905 197 12.612036
39 6010 4.363894 219 27.442922 1555 3.864952 174 11.189711
40 5046 4.373365 189 26.698413 1396 3.614613 169 12.106017
41 5237 4.344090 172 30.447674 1519 3.447663 182 11.981567
42 5609 4.433945 228 24.600877 1628 3.445332 177 10.872236
43 5086 4.311050 217 23.437788 1478 3.441137 203 13.734777
44 5415 4.163250 212 25.542453 1441 3.757807 141 9.784872
45 6474 4.343682 234 27.666667 1785 3.626891 226 12.661064
46 3810 4.359580 171 22.280702 1068 3.567416 92 8.614232
47 4644 4.394488 167 27.808383 1333 3.483871 100 7.501875
48 5029 4.320143 175 28.737143 1376 3.654797 108 7.848837
49 7360 4.235462 260 28.307692 1725 4.266667 208 12.057971
50 5757 4.496265 194 29.675258 1585 3.632177 149 9.400631
51 5530 4.520615 192 28.802083 1681 3.289709 207 12.314099
52 4648 4.225904 147 31.619048 1359 3.420162 143 10.522443
53 4773 4.432642 192 24.859375 1353 3.527716 109 8.056171
54 5806 4.313469 223 26.035874 1443 4.023562 103 7.137907
55 4820 4.363900 179 26.927374 1414 3.408769 153 10.820368
56 4578 4.262342 196 23.357143 1199 3.818182 66 5.504587
57 2774 4.664023 81 34.246914 1000 2.774000 77 7.700000

Skimming the data, it looks like there's not much variation in word length per chapter, but quite a bit more in sentence length and lexical diversity (the ratio of total word length to unique words in the chapter). We can quickly verify this with a simple calculation on the dataframe, normalizing the standard deviations of each column.

In [8]:
chapterdata.std()/chapterdata.mean()
Out[8]:
word_count             0.303518
avg_word_length        0.029765
sentence_count         0.309125
avg_sentence_length    0.142167
unique_words           0.221977
lexical_diversity      0.108498
new_vocabulary         0.975031
pct_new_vocab          0.846217
dtype: float64

I wonder if there happens to be any correlation between sentence length and lexical diversity:

In [9]:
print(np.corrcoef(chapterdata.avg_sentence_length, chapterdata.lexical_diversity))
[[1.         0.09856164]
 [0.09856164 1.        ]]

No, there isn't. At least not a significant one. We can see this in a quick scatterplot:

In [10]:
sns.set(style="whitegrid")
sns.scatterplot('avg_sentence_length', 'lexical_diversity', data=chapterdata)
plt.ylabel('Lexical Diversity')
plt.xlabel('Average Sentence Length')
plt.show()

Finally, a serious graph. I went through several versions of this; this one's my favorite. I'm graphing the percent of each chapter's vocabulary that is new to that chapter. Next, I add labels to each bar, then annotate each with a list of the most common new words in that chapter.

In [11]:
MOST_COMMON_COUNT = 5
mycolor="#3E000C"


fig, ax = plt.subplots(figsize=(14, 25))
sns.set_style("whitegrid", {'axes.grid' : False})
sns.barplot('pct_new_vocab', chapterdata.index, data=chapterdata, orient='h', color=mycolor)

plt.grid(axis='x')

xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Percent New Vocabulary")
plt.ylabel('Chapter')
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #barlbl = '%2.0f%% (%d)' % (chapterdata.pct_new_vocab[chapnum], chapterdata.new_vocabulary[chapnum])
    barlbl = '%2.1f%%' % chapterdata.pct_new_vocab[chapnum]
    plt.text(1, chapnum-1, barlbl,\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)])
        #the bbox param here gives the annotation a semitransparent white background to partially hide
        # the gridlines behind it.
        plt.text(chapterdata.pct_new_vocab[chapnum]+1, chapnum-1, newwords,\
                 verticalalignment='center', bbox=dict(facecolor='#ffffff99'))

plt.show()

Python, newspaper, and NLTK

I've been continuing my experimentation with NLTK, using the Python newspaper module. While newspaper seems mainly intended for making personalized Google News-like sites, it does some language processing to support this. Below, I continue where newspaper leaves off using NLTK and seaborn to explore an online text.

Getting Ready

In [1]:
import re # For some minor data cleaning

from IPython.core.display import display, HTML
import newspaper
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

I have several test sites in the cell below--sites are testy about allowing non-browser clients access and I've found I need to alternate while experimenting. Most of these sites work well, but newspaper often gives weird results from the Bing News search.

In [2]:
#paper = newspaper.build('https://www.providencejournal.com/')
#paper = newspaper.build('https://www.bing.com/news/search?q=providence')
#paper = newspaper.build('https://www.guardian.co.uk')
paper = newspaper.build('https://www.vulture.com/')
len(paper.articles)
Out[2]:
69

newspaper basics

Here I choose an article and download() and parse() it--which makes its data available. Then we can pull some basic details like the article's title, address, images, and text:

In [3]:
#We *can* do this, but I want to make sure I have a relatively-long article to use.
a = paper.articles[0]

#Via aldaily.com. Longer articles make more interesting graphs.
a = newspaper.Article('https://themillions.com/2020/03/on-pandemic-and-literature.html')
a.download()
a.parse()


display(HTML("<h3>%s</h3>" % a.title))
print(a.url)

#display(HTML('<img src="%s"/>'%a.top_image))
print(a.top_image) # also a.images, a.movies
print(a.text[:500])

On Pandemic and Literature

https://themillions.com/2020/03/on-pandemic-and-literature.html
https://themillions.com/wp-content/uploads/2020/03/1-2-870x1024.jpg
Less than a century after the Black Death descended into Europe and killed 75 million people—as much as 60 percent of the population (90% in some places) dead in the five years after 1347—an anonymous Alsatian engraver with the fantastic appellation of “Master of the Playing Cards” saw fit to depict St. Sebastian: the patron saint of plague victims. Making his name, literally, from the series of playing cards he produced at the moment when the pastime first became popular in Germany, the engrave

Our Article also gets a list of the authors of the text. I've found this tends to be the least accurate piece of newspaper's processing.

In [4]:
a.authors
Out[4]:
['Ed Simon',
 'Madeleine Monson-Rosen',
 'Ken Hines',
 'Kirsty Logan',
 'Patrick Brown',
 'Emily St. John Mandel',
 'Diksha Basu',
 'Sonya Chung',
 'Andrew Saikali']

NLP with newspaper

The .nlp() method gives us access to a summary of the text and a list of keywords. I haven't looked at the source closely enough to figure out how it's determining these, though the keywords are approximately the most common non-stopwords in the article.

In [5]:
a.nlp()

print(a.summary)
display(HTML('<hr/>'))
print(a.keywords)
There has always been literature of pandemic because there have always been pandemics.
What marks the literature of plague, pestilence, and pandemic is a commitment to try and forge if not some sense of explanation, than at least a sense of meaning out of the raw experience of panic, horror, and despair.
Narrative is an attempt to stave off meaninglessness, and in the void of the pandemic, literature serves the purpose of trying, however desperately, to stop the bleeding.
Pandemic literature exists not just to analyze the reasons for the pestilence—that may not even be its primary purpose.
The necessity of literature in the aftermath of pandemic is movingly illustrated in Emily St. John Mandel’s novel Station Eleven.

['pandemic', 'disease', 'narrative', 'sense', 'black', 'writes', 'plague', 'death', 'literature', 'world']

NLP with NLTK

That's what Newspaper can do for us. But since I had nltk installed already (and Newspaper requires it even if I hadn't), I can take the this article's text and do some basic processing with it.

First I need to tokenize the text, breaking it into individual words and punctuation marks.

In [6]:
a.tokens = nltk.word_tokenize(a.text)
print(a.tokens[:50])
['Less', 'than', 'a', 'century', 'after', 'the', 'Black', 'Death', 'descended', 'into', 'Europe', 'and', 'killed', '75', 'million', 'people—as', 'much', 'as', '60', 'percent', 'of', 'the', 'population', '(', '90', '%', 'in', 'some', 'places', ')', 'dead', 'in', 'the', 'five', 'years', 'after', '1347—an', 'anonymous', 'Alsatian', 'engraver', 'with', 'the', 'fantastic', 'appellation', 'of', '“', 'Master', 'of', 'the', 'Playing']

Next, guess each token's part of speech, using NLTK's "off-the-shelf" English tagger. This returns a list of 2-tuples (token, tag from the Penn Treebank tagset.

In [7]:
a.pos_tags = nltk.pos_tag(a.tokens)
a.pos_tags[:15]
Out[7]:
[('Less', 'JJR'),
 ('than', 'IN'),
 ('a', 'DT'),
 ('century', 'NN'),
 ('after', 'IN'),
 ('the', 'DT'),
 ('Black', 'NNP'),
 ('Death', 'NNP'),
 ('descended', 'VBD'),
 ('into', 'IN'),
 ('Europe', 'NNP'),
 ('and', 'CC'),
 ('killed', 'VBD'),
 ('75', 'CD'),
 ('million', 'CD')]

The Treebank tagset isn't particularly intuitive, especially if your last contact with English grammar was in middle school. Here's the help text for a few of the less-obvious tags above.

In [8]:
for pos in ['NNS', 'NNP', 'IN', 'DT', 'JJ']:
    nltk.help.upenn_tagset(pos)
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

I'll also tag the article using the "Universal" tagset--it has fewer tags, which makes for a simpler visualization later on.

In [9]:
a.upos_tags = nltk.pos_tag(a.tokens, tagset="universal")
a.upos_tags[:15]
Out[9]:
[('Less', 'ADJ'),
 ('than', 'ADP'),
 ('a', 'DET'),
 ('century', 'NOUN'),
 ('after', 'ADP'),
 ('the', 'DET'),
 ('Black', 'NOUN'),
 ('Death', 'NOUN'),
 ('descended', 'VERB'),
 ('into', 'ADP'),
 ('Europe', 'NOUN'),
 ('and', 'CONJ'),
 ('killed', 'VERB'),
 ('75', 'NUM'),
 ('million', 'NUM')]

We can also have NLTK calculate a frequency distribution of the words in our article--here I'll use it to show the most common 10 tokens, most of which you probably could have guessed:

In [10]:
a.word_freqs = nltk.FreqDist(word.lower() for word in a.tokens)
a.word_freqs.most_common(10)
Out[10]:
[('the', 310),
 (',', 262),
 ('of', 209),
 ('and', 112),
 ('.', 112),
 ('a', 88),
 ('to', 85),
 ('that', 80),
 ('’', 70),
 ('in', 61)]

Visualization

NLTK's FreqDist can also generate plots. Not great plots. Here's an example.

In [11]:
plt.figure(figsize=(12, 8))
a.word_freqs.plot(25)
plt.show()

Line graphs usually make me think "time series". This should probably be a bar plot, and we can do that. Start by translating our FreqDist object's data to a pandas DataFrame:

In [12]:
wfdf = pd.DataFrame(a.word_freqs.items(), columns=['token', 'frequency'])
wfdf.head()
Out[12]:
token frequency
0 less 2
1 than 13
2 a 88
3 century 4
4 after 9

We can now generate a Seaborn barplot of the token frequency data, which is largely unsurprising.

In [13]:
mycolor="#3E000C" #"#9A1C42"
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
sns.barplot('frequency', 'token', data=wfdf.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

We can make the result (arguably) more interesting by removing stopwords--very common words that don't affect the meaning of the text--from the frequency list. Here we get the stopwords for English.

In [14]:
sw = nltk.corpus.stopwords.words('english')
print(sw)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

And next, remove stopwords (and punctuation) from our word frequency list, and create a new DataFrame and a new graph.

In [15]:
# '\W' matches one "non-word character", i.e., not a letter, number or underscore.
wf2 = [word for word in a.word_freqs.items() if word[0] not in sw and not re.match('\W', word[0])]
wf2df = pd.DataFrame(wf2, columns=['token', 'frequency'])

plt.figure(figsize=(12, 8))
sns.barplot('frequency', 'token', data=wf2df.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

This tells us more about the article. But what about the part-of-speech tags I generated earlier? Here's a function that will take those lists and generate a graph from them:

In [16]:
def posFreqGraph(tags):
    posfreqs = nltk.FreqDist(word[1] for word in tags)
    posfdf = pd.DataFrame(posfreqs.items(), columns=['pos', 'frequency'])
    plt.figure(figsize=(12, 8))
    sns.barplot('frequency', 'pos', data=posfdf.sort_values(by='frequency', ascending=False), color=mycolor)
    plt.show()

First, the graph of Penn Treebank tags. It's crowded--which shows us the richness of this tagset--but still readable. (Here's a complete list of these tags with meanings.)

In [17]:
posFreqGraph(a.pos_tags)

Here's the same visual built from the universal tagset data.

In [18]:
posFreqGraph(a.upos_tags)

Closing

I think that's a good start. Newspaper makes it easy to load web pages and get just the important text, which we can feed into NLTK for some analysis. Here I did only the very basics of an analysis with NLTK; I plan to experiment more over the next few weeks.

Simple Text Collation with Python

Although in the study of manuscript culture one of the characteristic activities is to align parallel parts of a work–and this is a common definition of collation–I speak of collation in the more narrow sense of identifying differences between texts (i.e., after passages are already “aligned”). There are three methods to collate texts: 1) read two texts side by side and note the differences, 2) compare printed page images (by allowing your eyes to merge two page images, often with a device especially for that purpose); 3) transcribe and compare transcriptions with aid of a computer.

Last semester I taught an introductory programming course to non-Computer Science graduate students at URI. My curriculum focused mostly on the Python data science toolset of Jupyter, pandas, and numpy, and using these tools to analyze the students' own datasets.

One student, an English Ph.D. candidate, asked for an alternative involving natural language processing tasks: performing a collation of two editions of a novel released about a decade apart. This kind of work was new to me, but seemed like a simple enough task for a beginning programmer to handle.

Reading PDFs

whymodel.png

My student had both editions of the book as PDFs (scanned from physical books with embedded OCRed text. We explored two modules for extracting the text:

PyPDF2 was our first try. Its getPage() method didn't include whitespace in its output, giving each page's text as a single long word, probably as a result of the PDF's internal formatting, as suggested by Ned Batchelder on StackOverflow. I suspect it would be simple enough to read each word and paste them together as needed, but it was easier to find another solution for PDF reading.

PyMuPDF just worked, at least well enough for this use. It added unnecessary newlines, which would have been a problem if we were interested in paragraph breaks but wasn't an issue here. It also failed with one file's dropcaps, which was probably more an OCR/encoding issue. Here's an example of use (output on the right; the file 01.01_Why_Model.pdf is one of the readings for Scott Page's Model Thinking course on Coursera):

import fitz
pdf = fitz.open('01.01_Why_Model.pdf')
text = ''for page in pdf: text += page.getText()

Text comparison with difflib

difflib HTML Table

difflib.HtmlDiff's output from comparing two simple strings.

It took me an embarrassing amount of time before I realized the tool we needed here was diff. Python's difflib was the ideal solutions. It has a few basic options that easily produce machine-readable (like the command-line app) or HTML table output, but can also produce more complex output with a little effort. Its HtmlDiff tool worked perfectly for this.

The image to the right shows difflib's output from this code in a Jupyter window:

import difflib
from nltk import word_tokenize
from IPython.display import display, HTML

str1="This is my short string"
str2="This is another not long string"

words1 = word_tokenize(str1)
words2 = word_tokenize(str2)

hd = difflib.HtmlDiff()

HTML(hd.make_table(words1, words2))

Two other HtmlDiff options (to display only differences in context , and to limit the number of lines of context) were ideal for this case--we don't need to show the entire book just to print a relative handful of differences. For example, the following will only show changes with three words of context around each:

hd.make_table(context=True, numlines=3)

Closing

Once the difflib's HTML was output, the rest of the student's work on this project was reading through the table, identifying individual changes as 'substantive' or 'accidental', and tabulating them. But there's more we could do with Python to simplify this or enrich the final output, for example:

  • Identify changes where a single punctuation mark was changed to another--many of these were probably either typos or OCR errors.
  • Do part-of-speech tagging on the books' text and include this data in the output--did the author systematically remove adjectives and adverbs, or is there some other trend?

Pay the library staff well

Thus, money rules the world. It determines the status of men as well as the value of the services rendered by them. Unfortunately, people are prepared to benefit by a service only in proportion to the value set on it by money. Thus, a famished staff will render the efforts of the First Law as futile as paucity of books or paucity of readers. In the trinity of the library–books, staff, and readers–the richness of the staff in worldly goods appears to be as necessary as the richness of the other two in number and variety, if the law ‘Books Are for Use’ is to be translated into practice. It will have to be so, so long as men’s status is left to the capricious and arbitrary rule of Mammon. 'Therefore, pay the library staff well,’ says the First Law.

Thinking in Systems

Hunger, poverty, environmental degradation, economic instability, unemployment, chronic disease, drug addiction, and war, for example, persist in spite of the analytical ability and technical brilliance that have been directed toward eradicating them. No one deliberately creates those problems, no one wants them to persist, but they persist nonetheless. That is because they are intrinsically systems problems–undesirable behaviors characteristic of the system structures that produce them. They will yield only as we reclaim our intuition, stop casting blame, see the system as the source of its own problems, and find the courage and wisdom to restructure it.

Because otherwise I'll forget how to do this before I have to again.

Pandoc with the --self-contained option will convert your images into Base64 and embed them as data: URIs. For example:

pandoc -o blogpost.html --self-contained --metadata pagetitle="My Blog Post" .\blogpost.md

Note that this generates a complete web page instead of a fragment. If you don’t want that, you can use a custom template. A file with nothing in it but $body$ should work fine.

(NB: Use this sparingly–it’s an inefficient way of encoding binary data.)

Raised by Wolves

Older article (2006) that uses an analogy I love:

Academic libraries now hire an increasing number of individuals to fill professional librarian positions who do not have the master’s degree in library science….

Historically, the shared graduate educational experience has provided a standard preparation and socialization into the library profession. The new professional groups have been ‘raised’ in other environments and bring to the academic library a ‘feral’ set of values, outlooks, styles, and expectations.

The Brown University Library hires a fair number of “feral professionals”–it was interesting to find out this isn’t a relatively-new issue.

Stanley Wilder, writing in The Chronicle of Higher Education also elaborated on this, mentioning the relative youth and high salaries of the “ferals”:

For example, people in nontraditional positions accounted for 23 percent of the professionals at research libraries in 2005, compared to just 7 percent in 1985.

But the most compelling aspect of the nontraditional population is its youth: 39 percent of library professionals under 35 work in such nontraditional jobs, compared with only 21 percent of those 35 and older….

Within the under-35 population, 24 percent of nontraditional library employees earn $54,000 or more, compared to just 7 percent of those in traditional positions. Our profession has no precedent for the existence of so large a cohort of young employees who begin their careers at salaries approaching those of established middle managers.

Wilder’s closing describes the issue I’m currently looking at:

The libraries that thrive in the coming years will be those that apply the full range of nontraditional expertise in the service of those timeless values, and not the other way around.

A chart

A chart describing book properties

"A chart showing the percentage of excellence in the physical properties of books published since 1910."

Does Economics Make Politicians Corrupt?

All these findings correspond with a substantial body of research in the economic literature, which, with the help of surveys, laboratory experiments, as well as field experiments showed that those who learn about markets (economists) or act in markets (businessmen) are lacking in … ‘pro-social behavior’ … I also use corruption as a proxy to show whether there are any differences in pro-social behavior between economists and non-economists, but unlike them, I observe behavior outside the artificial situation of a laboratory. By analyzing real world data of the U.S. Congress, I found that politicians holding a degree in economics are significantly more prone to engage in corrupt practices.

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.

Tags