Python, newspaper, and NLTK

I've been continuing my experimentation with NLTK, using the Python newspaper module. While newspaper seems mainly intended for making personalized Google News-like sites, it does some language processing to support this. Below, I continue where newspaper leaves off using NLTK and seaborn to explore an online text.

Getting Ready

In [1]:
import re # For some minor data cleaning

from IPython.core.display import display, HTML
import newspaper
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

I have several test sites in the cell below--sites are testy about allowing non-browser clients access and I've found I need to alternate while experimenting. Most of these sites work well, but newspaper often gives weird results from the Bing News search.

In [2]:
#paper ='')
#paper ='')
#paper ='')
paper ='')

newspaper basics

Here I choose an article and download() and parse() it--which makes its data available. Then we can pull some basic details like the article's title, address, images, and text:

In [3]:
#We *can* do this, but I want to make sure I have a relatively-long article to use.
a = paper.articles[0]

#Via Longer articles make more interesting graphs.
a = newspaper.Article('')

display(HTML("<h3>%s</h3>" % a.title))

#display(HTML('<img src="%s"/>'%a.top_image))
print(a.top_image) # also a.images, a.movies

On Pandemic and Literature
Less than a century after the Black Death descended into Europe and killed 75 million people—as much as 60 percent of the population (90% in some places) dead in the five years after 1347—an anonymous Alsatian engraver with the fantastic appellation of “Master of the Playing Cards” saw fit to depict St. Sebastian: the patron saint of plague victims. Making his name, literally, from the series of playing cards he produced at the moment when the pastime first became popular in Germany, the engrave

Our Article also gets a list of the authors of the text. I've found this tends to be the least accurate piece of newspaper's processing.

In [4]:
['Ed Simon',
 'Madeleine Monson-Rosen',
 'Ken Hines',
 'Kirsty Logan',
 'Patrick Brown',
 'Emily St. John Mandel',
 'Diksha Basu',
 'Sonya Chung',
 'Andrew Saikali']

NLP with newspaper

The .nlp() method gives us access to a summary of the text and a list of keywords. I haven't looked at the source closely enough to figure out how it's determining these, though the keywords are approximately the most common non-stopwords in the article.

In [5]:

There has always been literature of pandemic because there have always been pandemics.
What marks the literature of plague, pestilence, and pandemic is a commitment to try and forge if not some sense of explanation, than at least a sense of meaning out of the raw experience of panic, horror, and despair.
Narrative is an attempt to stave off meaninglessness, and in the void of the pandemic, literature serves the purpose of trying, however desperately, to stop the bleeding.
Pandemic literature exists not just to analyze the reasons for the pestilence—that may not even be its primary purpose.
The necessity of literature in the aftermath of pandemic is movingly illustrated in Emily St. John Mandel’s novel Station Eleven.

['pandemic', 'disease', 'narrative', 'sense', 'black', 'writes', 'plague', 'death', 'literature', 'world']


That's what Newspaper can do for us. But since I had nltk installed already (and Newspaper requires it even if I hadn't), I can take the this article's text and do some basic processing with it.

First I need to tokenize the text, breaking it into individual words and punctuation marks.

In [6]:
a.tokens = nltk.word_tokenize(a.text)
['Less', 'than', 'a', 'century', 'after', 'the', 'Black', 'Death', 'descended', 'into', 'Europe', 'and', 'killed', '75', 'million', 'people—as', 'much', 'as', '60', 'percent', 'of', 'the', 'population', '(', '90', '%', 'in', 'some', 'places', ')', 'dead', 'in', 'the', 'five', 'years', 'after', '1347—an', 'anonymous', 'Alsatian', 'engraver', 'with', 'the', 'fantastic', 'appellation', 'of', '“', 'Master', 'of', 'the', 'Playing']

Next, guess each token's part of speech, using NLTK's "off-the-shelf" English tagger. This returns a list of 2-tuples (token, tag from the Penn Treebank tagset.

In [7]:
a.pos_tags = nltk.pos_tag(a.tokens)
[('Less', 'JJR'),
 ('than', 'IN'),
 ('a', 'DT'),
 ('century', 'NN'),
 ('after', 'IN'),
 ('the', 'DT'),
 ('Black', 'NNP'),
 ('Death', 'NNP'),
 ('descended', 'VBD'),
 ('into', 'IN'),
 ('Europe', 'NNP'),
 ('and', 'CC'),
 ('killed', 'VBD'),
 ('75', 'CD'),
 ('million', 'CD')]

The Treebank tagset isn't particularly intuitive, especially if your last contact with English grammar was in middle school. Here's the help text for a few of the less-obvious tags above.

In [8]:
for pos in ['NNS', 'NNP', 'IN', 'DT', 'JJ']:
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

I'll also tag the article using the "Universal" tagset--it has fewer tags, which makes for a simpler visualization later on.

In [9]:
a.upos_tags = nltk.pos_tag(a.tokens, tagset="universal")
[('Less', 'ADJ'),
 ('than', 'ADP'),
 ('a', 'DET'),
 ('century', 'NOUN'),
 ('after', 'ADP'),
 ('the', 'DET'),
 ('Black', 'NOUN'),
 ('Death', 'NOUN'),
 ('descended', 'VERB'),
 ('into', 'ADP'),
 ('Europe', 'NOUN'),
 ('and', 'CONJ'),
 ('killed', 'VERB'),
 ('75', 'NUM'),
 ('million', 'NUM')]

We can also have NLTK calculate a frequency distribution of the words in our article--here I'll use it to show the most common 10 tokens, most of which you probably could have guessed:

In [10]:
a.word_freqs = nltk.FreqDist(word.lower() for word in a.tokens)
[('the', 310),
 (',', 262),
 ('of', 209),
 ('and', 112),
 ('.', 112),
 ('a', 88),
 ('to', 85),
 ('that', 80),
 ('’', 70),
 ('in', 61)]


NLTK's FreqDist can also generate plots. Not great plots. Here's an example.

In [11]:
plt.figure(figsize=(12, 8))

Line graphs usually make me think "time series". This should probably be a bar plot, and we can do that. Start by translating our FreqDist object's data to a pandas DataFrame:

In [12]:
wfdf = pd.DataFrame(a.word_freqs.items(), columns=['token', 'frequency'])
token frequency
0 less 2
1 than 13
2 a 88
3 century 4
4 after 9

We can now generate a Seaborn barplot of the token frequency data, which is largely unsurprising.

In [13]:
mycolor="#3E000C" #"#9A1C42"
plt.figure(figsize=(12, 8))
sns.barplot('frequency', 'token', data=wfdf.sort_values(by='frequency', ascending=False)[:25], color=mycolor)

We can make the result (arguably) more interesting by removing stopwords--very common words that don't affect the meaning of the text--from the frequency list. Here we get the stopwords for English.

In [14]:
sw = nltk.corpus.stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

And next, remove stopwords (and punctuation) from our word frequency list, and create a new DataFrame and a new graph.

In [15]:
# '\W' matches one "non-word character", i.e., not a letter, number or underscore.
wf2 = [word for word in a.word_freqs.items() if word[0] not in sw and not re.match('\W', word[0])]
wf2df = pd.DataFrame(wf2, columns=['token', 'frequency'])

plt.figure(figsize=(12, 8))
sns.barplot('frequency', 'token', data=wf2df.sort_values(by='frequency', ascending=False)[:25], color=mycolor)

This tells us more about the article. But what about the part-of-speech tags I generated earlier? Here's a function that will take those lists and generate a graph from them:

In [16]:
def posFreqGraph(tags):
    posfreqs = nltk.FreqDist(word[1] for word in tags)
    posfdf = pd.DataFrame(posfreqs.items(), columns=['pos', 'frequency'])
    plt.figure(figsize=(12, 8))
    sns.barplot('frequency', 'pos', data=posfdf.sort_values(by='frequency', ascending=False), color=mycolor)

First, the graph of Penn Treebank tags. It's crowded--which shows us the richness of this tagset--but still readable. (Here's a complete list of these tags with meanings.)

In [17]:

Here's the same visual built from the universal tagset data.

In [18]:


I think that's a good start. Newspaper makes it easy to load web pages and get just the important text, which we can feed into NLTK for some analysis. Here I did only the very basics of an analysis with NLTK; I plan to experiment more over the next few weeks.

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.