More Natural Language Processing

This week's NLP experimentation involves Project Gutenberg's plain-text edition of The Pickwick Papers. I parsed the text into individual chapters and calculated some summary statistics about each, then build a visualization of each chapter's new vocabulary.

Setup

A few imports to start with:

The rest of these imports are just NLTK, Seaborn and the standard Python data science toolkit.

In [1]:
from collections import OrderedDict
import re

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from requests import get
import seaborn as sns

%matplotlib inline

First I grab Project Gutenberg's plain-text edition of The Pickwick Papers, then strip out (using split() and list indexing) Gutenberg's front- and backmatter and the table of contents to get just the book's text.

Also, Gutenberg's plain-text version uses underscores to indicate italics. I don't need those so I remove them here.

In [2]:
rq = get('https://www.gutenberg.org/files/580/580-0.txt')

#Not ascii, which requests assumes
rq.encoding='utf8'

pickwick = rq.text
pickwick = pickwick.split('THE POSTHUMOUS PAPERS OF THE PICKWICK CLUB')[2]
pickwick = pickwick.split('End of the Project Gutenberg EBook of The Pickwick Papers')[0]
pickwick = pickwick.replace('_', '')

First, I break the text into paragraphs, then use enumerate() and filter() to get indexes of the chapter headings (conveniently, they're the only "paragraphs" that start with the string "CHAPTER "). This gives me a list like this:

[(2, 'CHAPTER I. THE PICKWICKIANS'), 
(26, 'CHAPTER II. THE FIRST DAY’S JOURNEY, AND THE FIRST EVENING’S ADVENTURES;\r\nWITH THEIR CONSEQUENCES'),
...]

I then cycle through this list to locate the text of each chapter. The chapter texts are stored in an OrderedDict, with the chapter titles used as keys.

Then I delete my _chapterheads list, since I no longer need it and want to limit the number of copies of this book I have stored in memory.

In [3]:
paragraphs = pickwick.split('\r\n\r\n')

_chapterheads = list(filter(lambda x: x[1].startswith('CHAPTER '), enumerate(paragraphs)))

chapters = OrderedDict()
for i in range(len(_chapterheads)-1):
    ch = _chapterheads[i]
    nextch = _chapterheads[i+1]
    chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:nextch[0]]))
    
ch = _chapterheads[i+1]
chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:]))

del(_chapterheads)

Next, I need lists of tokens:

  • First, generate lists of entities, including both words and punctuation marks, one for each chapter.
  • Then, a copy of that first list, but lowercased and with punctuation removed.
  • Finally a third list with each chapter separated into sentences.
In [4]:
_chaptertexts = ['\n'.join(x) for x in chapters.values()]
_booktext = '\n\n'.join(_chaptertexts)

chaptertokens = [nltk.word_tokenize(chap) for chap in _chaptertexts]
chapterwords = [[w.lower() for w in chap if not re.match('\W', w)] for chap in chaptertokens]
chaptersentences = [nltk.sent_tokenize(chap) for chap in _chaptertexts]

del(_chaptertexts)
del(_booktext)

Next, I copy the lists from chapterwords into sets, giving me a list of the unique words in each chapter--I'll use that soon. Then I have a copy of this with stopwords removed, and a frequency distribution of the words in each chapter.

In [5]:
sw = nltk.corpus.stopwords.words('english')

chaptervocabsw = [set(chap) for chap in chapterwords]
chaptervocab = [set([word for word in chap if word not in sw]) for chap in chaptervocabsw]

#Remember we're not counting punctuation here.
chapterfreq = [nltk.FreqDist(chap) for chap in chapterwords]
In [6]:
newvocab = []
for c in range(57):
    newvoc = chaptervocabsw[c].difference(*chaptervocabsw[:c])
    newvocab.append(nltk.FreqDist(dict([x for x in chapterfreq[c].items() if x[0] in newvoc])))

#Make newvocab 1-indexed.
newvocab = dict(enumerate(newvocab, start=1))

Now I use Pandas to build a DataFrame of potentially-interesting statistics. This is done with a complex-looking list comprehension that generates an 8-tuple describing each chapter. The DataFrame constructor interprets this as a table.

In [7]:
chapterdata = pd.DataFrame([
                (
                    len(chapterwords[x-1]),
                    np.mean([len(tok) for tok in chaptertokens[x-1] if not re.match('\W', tok)]),
                    len(chaptersentences[x-1]),
                    len(chapterwords[x-1]) / len(chaptersentences[x-1]),
                    len(chaptervocabsw[x-1]),
                    len(chapterwords[x-1])/len(chaptervocabsw[x-1]),
                    len(newvocab[x]),
                    len(newvocab[x])/len(chaptervocabsw[x-1])*100,
                )
            for x in range(1,58)], 
            index=pd.Index(range(1,58), name='chapter'),
            columns=['word_count', 'avg_word_length', 'sentence_count', 'avg_sentence_length',\
                     'unique_words', 'lexical_diversity', 'new_vocabulary', 'pct_new_vocab'])

chapterdata.to_csv('~/data/nlp/pickwick_details.csv')
chapterdata
Out[7]:
word_count avg_word_length sentence_count avg_sentence_length unique_words lexical_diversity new_vocabulary pct_new_vocab
chapter
1 1774 4.929538 79 22.455696 705 2.516312 705 100.000000
2 9888 4.644114 391 25.289003 2441 4.050799 2087 85.497747
3 4650 4.440215 167 27.844311 1421 3.272343 682 47.994370
4 4657 4.518145 165 28.224242 1402 3.321683 555 39.586305
5 3719 4.536166 139 26.755396 1227 3.030970 458 37.326813
6 5969 4.388005 211 28.289100 1639 3.641855 616 37.583893
7 5322 4.558812 210 25.342857 1632 3.261029 536 32.843137
8 4678 4.478623 213 21.962441 1407 3.324805 401 28.500355
9 3305 4.384266 155 21.322581 1059 3.120869 271 25.590179
10 5407 4.240244 215 25.148837 1516 3.566623 441 29.089710
11 7350 4.334830 335 21.940299 1953 3.763441 501 25.652842
12 2205 4.503855 87 25.344828 786 2.805344 163 20.737913
13 7048 4.539728 232 30.379310 1841 3.828354 492 26.724606
14 6893 4.233425 256 26.925781 1652 4.172518 363 21.973366
15 5123 4.523131 203 25.236453 1487 3.445192 348 23.402824
16 7265 4.276944 299 24.297659 1791 4.056393 404 22.557231
17 3551 4.391157 83 42.783133 1038 3.421002 179 17.244701
18 3857 4.361162 162 23.808642 1159 3.327869 183 15.789474
19 5325 4.328263 223 23.878924 1471 3.619986 265 18.014956
20 6411 4.262674 222 28.878378 1591 4.029541 305 19.170333
21 7341 4.301594 259 28.343629 1890 3.884127 324 17.142857
22 6213 4.353774 252 24.654762 1595 3.895298 255 15.987461
23 3301 4.182672 129 25.589147 1022 3.229941 138 13.502935
24 5787 4.592189 213 27.169014 1564 3.700128 258 16.496164
25 7100 4.404930 303 23.432343 1707 4.159344 259 15.172818
26 2460 4.357317 93 26.451613 787 3.125794 83 10.546379
27 3754 4.329249 134 28.014925 1162 3.230637 170 14.629948
28 8926 4.336993 247 36.137652 2175 4.103908 370 17.011494
29 4167 4.404848 121 34.438017 1241 3.357776 162 13.053989
30 4309 4.444419 175 24.622857 1283 3.358535 156 12.159002
31 6131 4.377100 216 28.384259 1643 3.731589 239 14.546561
32 5501 4.392111 193 28.502591 1514 3.633421 189 12.483487
33 6364 4.395663 201 31.661692 1817 3.502477 326 17.941662
34 9501 4.498579 300 31.670000 2005 4.738653 290 14.463840
35 5980 4.515050 277 21.588448 1704 3.509390 255 14.964789
36 4599 4.404218 170 27.052941 1419 3.241015 190 13.389711
37 5099 4.269857 186 27.413978 1358 3.754786 168 12.371134
38 5395 4.413160 223 24.192825 1562 3.453905 197 12.612036
39 6010 4.363894 219 27.442922 1555 3.864952 174 11.189711
40 5046 4.373365 189 26.698413 1396 3.614613 169 12.106017
41 5237 4.344090 172 30.447674 1519 3.447663 182 11.981567
42 5609 4.433945 228 24.600877 1628 3.445332 177 10.872236
43 5086 4.311050 217 23.437788 1478 3.441137 203 13.734777
44 5415 4.163250 212 25.542453 1441 3.757807 141 9.784872
45 6474 4.343682 234 27.666667 1785 3.626891 226 12.661064
46 3810 4.359580 171 22.280702 1068 3.567416 92 8.614232
47 4644 4.394488 167 27.808383 1333 3.483871 100 7.501875
48 5029 4.320143 175 28.737143 1376 3.654797 108 7.848837
49 7360 4.235462 260 28.307692 1725 4.266667 208 12.057971
50 5757 4.496265 194 29.675258 1585 3.632177 149 9.400631
51 5530 4.520615 192 28.802083 1681 3.289709 207 12.314099
52 4648 4.225904 147 31.619048 1359 3.420162 143 10.522443
53 4773 4.432642 192 24.859375 1353 3.527716 109 8.056171
54 5806 4.313469 223 26.035874 1443 4.023562 103 7.137907
55 4820 4.363900 179 26.927374 1414 3.408769 153 10.820368
56 4578 4.262342 196 23.357143 1199 3.818182 66 5.504587
57 2774 4.664023 81 34.246914 1000 2.774000 77 7.700000

Skimming the data, it looks like there's not much variation in word length per chapter, but quite a bit more in sentence length and lexical diversity (the ratio of total word length to unique words in the chapter). We can quickly verify this with a simple calculation on the dataframe, normalizing the standard deviations of each column.

In [8]:
chapterdata.std()/chapterdata.mean()
Out[8]:
word_count             0.303518
avg_word_length        0.029765
sentence_count         0.309125
avg_sentence_length    0.142167
unique_words           0.221977
lexical_diversity      0.108498
new_vocabulary         0.975031
pct_new_vocab          0.846217
dtype: float64

I wonder if there happens to be any correlation between sentence length and lexical diversity:

In [9]:
print(np.corrcoef(chapterdata.avg_sentence_length, chapterdata.lexical_diversity))
[[1.         0.09856164]
 [0.09856164 1.        ]]

No, there isn't. At least not a significant one. We can see this in a quick scatterplot:

In [10]:
sns.set(style="whitegrid")
sns.scatterplot('avg_sentence_length', 'lexical_diversity', data=chapterdata)
plt.ylabel('Lexical Diversity')
plt.xlabel('Average Sentence Length')
plt.show()

Finally, a serious graph. I went through several versions of this; this one's my favorite. I'm graphing the percent of each chapter's vocabulary that is new to that chapter. Next, I add labels to each bar, then annotate each with a list of the most common new words in that chapter.

In [11]:
MOST_COMMON_COUNT = 5
mycolor="#3E000C"


fig, ax = plt.subplots(figsize=(14, 25))
sns.set_style("whitegrid", {'axes.grid' : False})
sns.barplot('pct_new_vocab', chapterdata.index, data=chapterdata, orient='h', color=mycolor)

plt.grid(axis='x')

xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Percent New Vocabulary")
plt.ylabel('Chapter')
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #barlbl = '%2.0f%% (%d)' % (chapterdata.pct_new_vocab[chapnum], chapterdata.new_vocabulary[chapnum])
    barlbl = '%2.1f%%' % chapterdata.pct_new_vocab[chapnum]
    plt.text(1, chapnum-1, barlbl,\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)])
        #the bbox param here gives the annotation a semitransparent white background to partially hide
        # the gridlines behind it.
        plt.text(chapterdata.pct_new_vocab[chapnum]+1, chapnum-1, newwords,\
                 verticalalignment='center', bbox=dict(facecolor='#ffffff99'))

plt.show()

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.