I've been continuing my experimentation with NLTK, using the Python newspaper module. While newspaper seems mainly intended for making personalized Google News-like sites, it does some language processing to support this. Below, I continue where newspaper leaves off using NLTK and seaborn to explore an online text.
import re # For some minor data cleaning from IPython.core.display import display, HTML import newspaper import nltk import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline
I have several test sites in the cell below--sites are testy about allowing non-browser clients access and I've found I need to alternate while experimenting. Most of these sites work well, but newspaper often gives weird results from the Bing News search.
#paper = newspaper.build('https://www.providencejournal.com/') #paper = newspaper.build('https://www.bing.com/news/search?q=providence') #paper = newspaper.build('https://www.guardian.co.uk') paper = newspaper.build('https://www.vulture.com/') len(paper.articles)
Here I choose an article and
parse() it--which makes its data available. Then we can pull some basic details like the article's title, address, images, and text:
#We *can* do this, but I want to make sure I have a relatively-long article to use. a = paper.articles #Via aldaily.com. Longer articles make more interesting graphs. a = newspaper.Article('https://themillions.com/2020/03/on-pandemic-and-literature.html') a.download() a.parse() display(HTML("<h3>%s</h3>" % a.title)) print(a.url) #display(HTML('<img src="%s"/>'%a.top_image)) print(a.top_image) # also a.images, a.movies print(a.text[:500])
On Pandemic and Literature
https://themillions.com/2020/03/on-pandemic-and-literature.html https://themillions.com/wp-content/uploads/2020/03/1-2-870x1024.jpg Less than a century after the Black Death descended into Europe and killed 75 million people—as much as 60 percent of the population (90% in some places) dead in the five years after 1347—an anonymous Alsatian engraver with the fantastic appellation of “Master of the Playing Cards” saw fit to depict St. Sebastian: the patron saint of plague victims. Making his name, literally, from the series of playing cards he produced at the moment when the pastime first became popular in Germany, the engrave
Article also gets a list of the authors of the text. I've found this tends to be the least accurate piece of
['Ed Simon', 'Madeleine Monson-Rosen', 'Ken Hines', 'Kirsty Logan', 'Patrick Brown', 'Emily St. John Mandel', 'Diksha Basu', 'Sonya Chung', 'Andrew Saikali']
.nlp() method gives us access to a summary of the text and a list of keywords. I haven't looked at the source closely enough to figure out how it's determining these, though the keywords are approximately the most common non-stopwords in the article.
a.nlp() print(a.summary) display(HTML('<hr/>')) print(a.keywords)
There has always been literature of pandemic because there have always been pandemics. What marks the literature of plague, pestilence, and pandemic is a commitment to try and forge if not some sense of explanation, than at least a sense of meaning out of the raw experience of panic, horror, and despair. Narrative is an attempt to stave off meaninglessness, and in the void of the pandemic, literature serves the purpose of trying, however desperately, to stop the bleeding. Pandemic literature exists not just to analyze the reasons for the pestilence—that may not even be its primary purpose. The necessity of literature in the aftermath of pandemic is movingly illustrated in Emily St. John Mandel’s novel Station Eleven.
['pandemic', 'disease', 'narrative', 'sense', 'black', 'writes', 'plague', 'death', 'literature', 'world']
That's what Newspaper can do for us. But since I had
nltk installed already (and Newspaper requires it even if I hadn't), I can take the this article's text and do some basic processing with it.
First I need to tokenize the text, breaking it into individual words and punctuation marks.
a.tokens = nltk.word_tokenize(a.text) print(a.tokens[:50])
['Less', 'than', 'a', 'century', 'after', 'the', 'Black', 'Death', 'descended', 'into', 'Europe', 'and', 'killed', '75', 'million', 'people—as', 'much', 'as', '60', 'percent', 'of', 'the', 'population', '(', '90', '%', 'in', 'some', 'places', ')', 'dead', 'in', 'the', 'five', 'years', 'after', '1347—an', 'anonymous', 'Alsatian', 'engraver', 'with', 'the', 'fantastic', 'appellation', 'of', '“', 'Master', 'of', 'the', 'Playing']
Next, guess each token's part of speech, using NLTK's "off-the-shelf" English tagger. This returns a list of 2-tuples (token, tag from the Penn Treebank tagset.
a.pos_tags = nltk.pos_tag(a.tokens) a.pos_tags[:15]
[('Less', 'JJR'), ('than', 'IN'), ('a', 'DT'), ('century', 'NN'), ('after', 'IN'), ('the', 'DT'), ('Black', 'NNP'), ('Death', 'NNP'), ('descended', 'VBD'), ('into', 'IN'), ('Europe', 'NNP'), ('and', 'CC'), ('killed', 'VBD'), ('75', 'CD'), ('million', 'CD')]
The Treebank tagset isn't particularly intuitive, especially if your last contact with English grammar was in middle school. Here's the help text for a few of the less-obvious tags above.
for pos in ['NNS', 'NNP', 'IN', 'DT', 'JJ']: nltk.help.upenn_tagset(pos)
NNS: noun, common, plural undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ... NNP: noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ... DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those JJ: adjective or numeral, ordinal third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ...
I'll also tag the article using the "Universal" tagset--it has fewer tags, which makes for a simpler visualization later on.
a.upos_tags = nltk.pos_tag(a.tokens, tagset="universal") a.upos_tags[:15]
[('Less', 'ADJ'), ('than', 'ADP'), ('a', 'DET'), ('century', 'NOUN'), ('after', 'ADP'), ('the', 'DET'), ('Black', 'NOUN'), ('Death', 'NOUN'), ('descended', 'VERB'), ('into', 'ADP'), ('Europe', 'NOUN'), ('and', 'CONJ'), ('killed', 'VERB'), ('75', 'NUM'), ('million', 'NUM')]
We can also have NLTK calculate a frequency distribution of the words in our article--here I'll use it to show the most common 10 tokens, most of which you probably could have guessed:
a.word_freqs = nltk.FreqDist(word.lower() for word in a.tokens) a.word_freqs.most_common(10)
[('the', 310), (',', 262), ('of', 209), ('and', 112), ('.', 112), ('a', 88), ('to', 85), ('that', 80), ('’', 70), ('in', 61)]
plt.figure(figsize=(12, 8)) a.word_freqs.plot(25) plt.show()
Line graphs usually make me think "time series". This should probably be a bar plot, and we can do that. Start by translating our FreqDist object's data to a pandas DataFrame:
wfdf = pd.DataFrame(a.word_freqs.items(), columns=['token', 'frequency']) wfdf.head()
mycolor="#3E000C" #"#9A1C42" plt.figure(figsize=(12, 8)) sns.set(style="whitegrid") sns.barplot('frequency', 'token', data=wfdf.sort_values(by='frequency', ascending=False)[:25], color=mycolor) plt.show()
We can make the result (arguably) more interesting by removing stopwords--very common words that don't affect the meaning of the text--from the frequency list. Here we get the stopwords for English.
sw = nltk.corpus.stopwords.words('english') print(sw)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
And next, remove stopwords (and punctuation) from our word frequency list, and create a new DataFrame and a new graph.
# '\W' matches one "non-word character", i.e., not a letter, number or underscore. wf2 = [word for word in a.word_freqs.items() if word not in sw and not re.match('\W', word)] wf2df = pd.DataFrame(wf2, columns=['token', 'frequency']) plt.figure(figsize=(12, 8)) sns.barplot('frequency', 'token', data=wf2df.sort_values(by='frequency', ascending=False)[:25], color=mycolor) plt.show()
This tells us more about the article. But what about the part-of-speech tags I generated earlier? Here's a function that will take those lists and generate a graph from them:
def posFreqGraph(tags): posfreqs = nltk.FreqDist(word for word in tags) posfdf = pd.DataFrame(posfreqs.items(), columns=['pos', 'frequency']) plt.figure(figsize=(12, 8)) sns.barplot('frequency', 'pos', data=posfdf.sort_values(by='frequency', ascending=False), color=mycolor) plt.show()
First, the graph of Penn Treebank tags. It's crowded--which shows us the richness of this tagset--but still readable. (Here's a complete list of these tags with meanings.)
Here's the same visual built from the universal tagset data.
I think that's a good start. Newspaper makes it easy to load web pages and get just the important text, which we can feed into NLTK for some analysis. Here I did only the very basics of an analysis with NLTK; I plan to experiment more over the next few weeks.