Friday Links

We might be seeing a significant political change, with the left reclaiming freedom and anti-statism from the right.

I'm prompted to say this by the Black Lives Matter slogan, "defund the police" which invites us to see the state as an oppressor.

Many of you might think the slogan "defund the police" goes too far. No matter: we don't know what's right unless we know what's too much. And what is right - as Elinor Ostrom showed - is that the police should be small and locally accountable.

Is this improvement in the circumstances of the lower ranks of the people to be regarded as an advantage, or as an inconveniency, to the society? The answer seems at first abundantly plain. Servants, labourers, and workmen of different kinds, make up the far greater part of every great political society. But what improves the circumstances of the greater part, can never be regarded as any inconveniency to the whole. No society can surely be flourishing and happy, of which the far greater part of the members are poor & miserable.

It is but equity, besides, that they who feed, clothe, & lodge the whole body of the people, should have such a share of the produce of their own labour as to be themselves tolerably well fed, clothed, and lodged...

Friday Links

Libraries are still just about the only place in America anyone can go and sit and use a computer and the internet without buying anything. All over the country, library closures during the pandemic have highlighted just how many people have no dependable source of internet on their own.  

According to a Pew survey published last year, less than two-thirds of Americans in rural areas have a broadband internet connection at home. Among Americans with household incomes below $30,000, four out of 10 don’t have a computer, and three out of 10 don’t have a smartphone.

As I’ve written in the recent past, I believe that the current political uprising has a chance at being an enormously positive development. I worry though that it will be limited by the power of political Calvinism.

I know I’ve heard the term used before, though I can’t remember from who. By political Calvinism I mean the tendency within the left to see the structural injustices of the world as inherent and immutable, so baked into the cake of the current context, history, the United States of America, etc., that they will always exist. The stain of injustice can never be rubbed out. This is most obvious when discussing racial dynamics. White people are inherently in possession of white privilege, as many will tell you – most insistently white liberals, in my experience. Well, yes, today all white people enjoy white privilege, though the valence of that advantage varies with other factors in their lives. But the degree and intensity and in fact existence of white privilege is mutable; if we had a real racial awakening and all people worked to end white privilege, it would end. And not only do I not think this is a crazy thing to believe, I think believing it is a necessary precondition to being an agent of positive change!

Eternally unprofitable and burning cash because the name of the game is now consolidation or bust, Uber has made an offer to buy rival food delivery company Postmates as the mobility war intensifies.

As it stands, ride-hailing alone isn’t going to cut it, and investors still seem to not be shying away from a company itself it has admitted may never be profitable.

During the four-way race for President in 1860, the six-year-old Republican Party found support from a new generation of voters who helped push Abraham Lincoln to victory. They were known as the “Wide Awakes,” because of their youth, enthusiasm, and torch-lit nighttime marches. “Now the old men are folding their arms and going to sleep,” said William H. Seward while campaigning for Lincoln, “and the young men are Wide Awake.”

The only polling in 1860 was the actual voting. Turnout was 81.2% of those eligible to vote (a cohort that consisted only of white men). That’s the the second highest turnout in our history, after the election of 1876. The highest turnout of the twenty-first century so far has been 58.2%, in 2008.

And, at least until recently, SROs were hugely popular. In a 2018 poll, Starr writes, 80 percent of parents favored posting armed police officers at their child’s school. That made superintendents like him cautious about airing “even the mildest criticism” of SROs,” he writes. “Privately, though, many district leaders will tell you that if they had a choice, they’d rather not have armed officers in the schools at all.”

Friday Links

Cut to a couple of days ago, when I come across this article in Nature – the first deep dive attempting to answer the question of just how protective those coronavirus antibodies are.

And, at first blush at least, the news isn’t great.

On July 28, 1932, President Herbert Hoover dispatched federal troops and tanks to disperse the “Bonus Army,” tens of thousands of jobless World War I veterans and their families who’d been protesting in the nation’s capital. The troops used tear gas. Two men and two infants were reported dead. As one of the first major protests in which the American government used tear gas—which is considered a weapon of war—on its own citizens, the Bonus Army incident created public outrage, ruining any chance of Hoover’s reelection. For the chemical companies trying to sell tear gas to law enforcement, however, the Bonus Army was a successful demonstration of their product.

It’s one of the more important things in game theory that a signal has to be a costly signal.... A reputation in deterrence theory is something that is worth having, but not worth getting.

ZEIT: Der ehemalige schwedische Staatsepidemiologe Johan Giesecke hat gesagt: “Der Unterschied zwischen Schweden und Deutschland ist, dass Deutschland seine Wirtschaft ruiniert.”

Angner: Johan Giesecke ist so viel in den Medien, weil er sehr selbstbewusst ist und einfach sachen raushaut. Ein klarer fall von Selbstüberchätzung.

Before looking at the impact on New York City and other dense urban centers in the US, the fact that work from home has worked at all calls into question many heretofore unquestioned norms of office like, like the true utility of having workers on premises. How much of formal and informal information-sharing is actually productive, as opposed to gossip and political jockeying? For instance, the pre-Carly-Fiorina Hewlett Packard was cognizant of the potential for meetings to be time-wasters, and required that they be held with all participants standing up.

Caveats: data is really limited.  Studies are sort of trickling out in pre-print form and in various esoteric journals. But I’ll point out a couple that hold water for me. The first, a pre-print out of China, looked at just over 2000 COVID-positive individuals and reported that there was a higher infection rate in people with Type A blood.

Friday Links

Transmission of viruses was lower with physical distancing of 1 m or more, compared with a distance of less than 1 m (n=10 736, pooled adjusted odds ratio [aOR] 0·18, 95% CI 0·09 to 0·38; risk difference [RD] −10·2%, 95% CI −11·5 to −7·5; moderate certainty); protection was increased as distance was lengthened (change in relative risk [RR] 2·02 per m; pinteraction=0·041; moderate certainty). Face mask use could result in a large reduction in risk of infection (n=2647; aOR 0·15, 95% CI 0·07 to 0·34, RD −14·3%, −15·9 to −10·7; low certainty), with stronger associations with N95 or similar respirators compared with disposable surgical masks or similar (eg, reusable 12–16-layer cotton masks; pinteraction=0·090; posterior probability >95%, low certainty).

There’s a joke: What do you call 1000 good cops and ten bad cops? 1010 bad cops....

It seems paradoxical at first. While people do escape to cheery landscapes, like the farm of Nintendo’s Animal Crossing, they also spend their money on games engineered to inspire terror, fear, and anxiety. Doom Eternal, Nioh, and Resident Evil all saw high download numbers in the last few months.

What’s the appeal? In the journal Preternature, authors Robert M. Geraci, Nat Recine, and Samantha Fox make a compelling case that video games like these have a meaningful psychological role, especially today. “Faced with physical and psychological dangers, human beings imagine them as monsters and seek to master them,” they explain.

“The horrific experience of videogames, and hence their cathartic appeal, emerges when a game produces a constant level of anxiety in players while allowing the players to act on it,” the authors explain.

Gloves, masks, and other personal protective equipment (PPE) are key to keeping us safe, especially as we began to ease the lockdown rules. Yet, the environmental watchdogs worry that all that PPE will flow into the ocean. ” If they’re thrown on the streets, when it rains the gloves and masks will eventually end up in the sea,” biologist Anastasia Miliou at the Archipelagos Institute of Marine Conservation in Greece told Deutsche Welle.

Heroism by the many or the repeated heroism by occupant after occupant of a given role indicates the failure of the surrounding system. Adequate supplies, for example, or prompt pre-emptive action would have changed the effort required of many health care providers from heroic to merely demanding. In that sense, the accolades, deeply deserved as they are, can serve to divert our attention from the less glamorous, indeed the mundane work of repairing the systems so that heroes need not show up en masse to hold together a wheezing and crippled health care system.

Vavilov hypothesized that farmers never intentionally tried to domesticate the rye plant. Ancient weeding methods were based on visual cues—if something looked like a weed, it was plucked. Farmers spent generations unintentionally selecting for rye plants that looked like useful wheat, not weeds. Eventually, rye mimicked wheat so successfully, the two became almost indistinguishable.

The pants allegedly disappeared in 2005. Whenever the business offered to settle, Pearson moved the goalposts, saying he remained unsatisfied. His demands continued to escalate, and he also offered a ridiculous theory of the damages he sought under the D.C. consumer-protection statute. By April 2007, he was demanding over $65 million to settle a claim for a pair of missing pants. The case went to trial, and he lost. He appealed and lost. He sought en banc review and lost. That was, at least, the end of the litigation against the dry cleaners. But Pearson wasn’t done. After losing his ALJ job—some believed the litigation showed poor judgment, you see—he sued for wrongful termination, still insisting he had been in the right. He lost. He appealed, and lost.

Turns out that if you persist in making the same frivolous arguments for a sufficient number of years, the bar association may take notice. In 2015, the D.C. Office of Disciplinary Counsel filed ethics charges against Pearson, which, of course, he furiously contested on the grounds that he had been right all along. He lost. And, of course, he appealed....

Friday Links

US cities vary widely in the number of cops they have relative to their population, as the graph below (drawn from data assembled by Governing magazine). Among big cities, DC, Chicago, New York, Baltimore, and Philadelphia top the list, with over 40 officers per 10,000 people. These are well above the national average of just under 28 per 10,000. Cities toward the bottom of the list have 20 or fewer.

But one question has been sitting in the back of my mind since the talk of curve-flattening started.  Does everyone get the coronavirus eventually? It’s an important question. This is a novel virus for which none of us are likely to have any existing immunity. We are ripe for infection and the rate of spread (without social distancing) is rapid. The presence of asymptomatic spread makes the situation even worse.

For COVID-19, we probably have to have 65-70% of the population immune before the thing dies out. I’ll just point out we are nowhere close to that. Even in New York city, American epicenter of the disease, seroprevalence studies only suggest about 25% of the population is immune

“Our whole reason for lobbying for looser gun laws and amassing huge personal arsenals of weapons these past years was so that we could ensure the security of a free state and protect the people from an oppressive government. And then it actually happened, and the whole rising up against a tyrannical government thing just totally slipped our minds, which is a little embarrassing,” a sheepish NRA CEO Wayne LaPierre said.

When property damage and theft happens as a side-effect of real mass protest, authorities in a democracy cannot baton, tear gas, or shoot their way to legitimacy. People want social order, but this isn’t like quelling a riot after a sports game. The key issue—as the Governor of Minnesota put it the other day—is that “there are more of them than us”. All the tactical gear in the world isn’t worth a damn, ultimately, if enough of the population ends up in open revolt against civil authority. There are just too many people.

But if the bulk of a city’s population really is directly engaged in mass protest or indirectly supportive of it, and these protests are met with force by the authorities, then violent disorder will start to look less like pockets of disruption disapproved of by all and more like the loss of legitimacy.

In the absence of mass mobilization for protest, imposing “Law and Order” by force is usually a politically successful tactic, at least in the short-run. The demand for order is the most basic demand of political life. But attempting to impose order by force when people are protesting in the streets en masse is much riskier, both for the leader wanting to “dominate” and for political institutions generally.

Friday Links

With a background in indexing, I like to compare the index of a book with the taxonomy-enhanced search capabilities of a website, whereas the table of contents of a book is like the navigation scheme. A table of contents or navigation scheme is a higher-level, pre-defined structure of content, that guides users to the general organization of content and tasks. It helps users understand the scope of the content available, provides guidance on where and what content to find, and aids in exploration. An index or search feature, including faceted search, on the other hand, enables to user to find specific information or content items of interest. A taxonomy, regardless of its display type, serves the function of an index, not the table of contents.

The people we see are, by definition, those who are outdoors and thus who are disproportionately likely to be breaching the lockdown. What Mr Hannan isn’t seeing are the countless thousands of us staying indoors and observing the lockdown.

The social sciences are, as Jon Elster said, fundamentally a collection of mechanisms. But many of these are unseen. It is the role of social science to expose these mechanisms, and to show us that what we see is not all there is. As Marx said: "If there were no difference between essence and appearance, there would be no need for science."

On June 20, 1917, Lucy Burns, co-founder of the National Woman’s Party (NWP), and Dora Lewis gathered with other suffragists in front of the White House. They held a banner criticizing President Wilson’s opposition to women’s suffrage: “We, the Women of America, tell you that America is not a democracy…. President Wilson is the chief opponent of their national enfranchisement.”

Friday Links

With several universities now coming to grips with the fact that they will still be online in the Summer (and most likely the Fall), several are turning to how to quickly train their entire faculty in online teaching in a hurry.

Out of all the different ways to approach learning theory, I like focus on power dynamics first when it comes to designing a course. So think about the overall power dynamic you want to see happening in your course. This can change from week to week, but in general most courses stick to one for the most part. The question is: who determines what learners will learn in your course, and who directs how it is learned?

This is such a strange and necessary time to talk about education technology, to take a class about education technology, to get a degree in education technology because what, in the past, was so often framed as optional or aspirational is now compulsory — and compulsory under some of the worst possible circumstances

One of the reasons that I am less than sanguine about most education technology is because I don't consider it this autonomous, context-free entity. Ed-tech is not a tool that exists only in the service of improving teaching and learning, although that's very much how it gets talked about. There's much more to think about than the pedagogy too, than whether ed-tech makes that better or worse or about the same just more expensive. Pedagogy doesn't occur in a vacuum. It has an institutional history; pedagogies have politics. Tools have politics. They have histories. They're developed and funded and adopted and rejected for a variety of reasons other than "what works." Even the notion of "what works" should prompt us to ask all sorts of questions about "for whom," "in what way," and "why."

Surveillance is not prevalent simply because that's the technology that's being sold to schools. Rather, in many ways, surveillance reflects the values we have prioritized: control, compulsion, efficiency.

Independent learning is a skill, and like most skills, you need to start slowly and carefully. Suddenly being thrown into ten courses online is not the best way to go. Many will sink, although some will certainly swim. However, experience tells us that graduate, older and lifelong learners all do much better in online learning than undergraduates. Blended learning – a mix of face-to-face and online – though is a very good way to ease gently into online learning. Introducing online or digital learning gradually in first year, supported by face-to-face classes, is a much better strategy.

As the author of a book on opportunity cost, I might be expected to be enthusiastic about the idea that trade-offs are always important in economic and policy choices. This idea is summed up in the acryonymic slogan TANSTAAFL (There Ain’t No Such Thing As A Free Lunch). In fact, however, a crucial section of Economics in Two Lessons is devoted to showing that There Is Such A Thing As A Free Lunch. It is only when all free lunches have been taken off the table that we reach a position described, in the standard jargon, as Pareto-optimal.

To me, this is an example (and there are many right now) of the extent to which the fairness of the legal system may turn less on the words we use in a law than on the discretion of those who have the power to enforce it.

This evolved into an entire subcultural practice, called Grangerization. Hobbyists used printed books as the basis for a multidimensional media project. They pasted prints, as well as pages of text from other books, into the original volume, making connections between related topics.

In some cases, the resulting work smacked of obsessive fandom. One collector expanded a copy of an 1828 biography of Lord Byron from two volumes to five, rebinding the pages to accommodate 184 illustrations and 14 letters and autographs. Another turned a three-volume 1872 biography of Charles Dickens into nine oversized books packed with broadsides for performances, actor portraits, letters, and images taken from illustrated editions of the author’s books.

Grangerization reached its height of popularity in the first half of the nineteenth century. But not everyone saw it as an innovative, creative hobby. The idea of removing pages from one book to create something new infuriated some critics. One called Grangerization a “monstrous practice” of “hungry and rapacious book-collectors.” Another diagnosed its practitioners with “a vehement passion, a furious perturbation to be closely observed and radically treated wherever it appears, for it is a contagious and delirious mania.”

One advantage of today’s digital media is that we can freely copy material without tearing up precious original work. Of course, today’s Grangerizers have their own ethical questions, like plagiarism, to consider.

"The variation being meant as an evident one, accordingly as presenting in pure intuition the possibilities themselves as possibilities, its correlate is an intuitive and apodictic consciousness of something universal. The eidos itself is a beheld or beholdable universal, one that is pure, 'unconditioned,' that is to say according to its own intuition sense, a universal not conditioned by any fact."

A little-known Democratic senator from Missouri rides the public anger, consequently emerging as a national leader. “Their greed knows no limit,” said Harry Truman in February 1942 in talking about military contractors accused of gouging the government at such a critical time.

The public agreed. A Gallup Poll noted that 69 percent of Americans wanted the government to exert controls on the profits earned by contractors during the war.

Private sector partnership in the face of community need is nothing new, and has long been integral in national response and rebuilding. Take, for example, the case of the Waffle House Index.

Waffle Houses are what they sound like: homey diners that dominate the southern part of the United States, serving up staple favorites like pies and iced tea. With that in mind, the index sounds like a whimsical measurement, but it actually refers to a serious, though informal, measurement of a crisis’s severity. The Federal Emergency Management Agency (FEMA) uses the restaurant chain to gauge how badly an area is affected. As a former FEMA official told NPR, “If the Waffle House is open, everything’s good.”

Renewable Resources

In many economies based on real renewable resources, the very small surviving population retains the potential to build its numbers back up again, once the capital driving the harvest is gone. The whole pattern is repeated, decades later. Very long-term renewable-resource cycles like these have been observed, for example, in the logging industry in New England, now in its third cycle of growth, overcutting, collapse, and eventual regeneration of the resource. But this is not true for all resource populations. More and more, increases in technology and harvest efficiency have the ability to drive resource populations to extinction.

Feedback Systems

One of the central insights of systems theory, as central as the observation that systems largely cause their own behavior, is that systems with similar feedback structures produce similar dynamic behaviors, even if the outward appearance of these systems is completely dissimilar.

Feedback

The information delivered by a feedback loop can only affect future behavior; it can'd deliver the information, and so can't have an impact fast enough to correct behavior that drove the current feedback....

Why is that important? Because it means there will always be delays in responding.

Another Pickwick Discard

This visualization was the first draft of the last chart in this post. I added extra spacing between the bars to provide space for annotations; eventually I decided this was unnecessary.

Pickwick Graph: New Vocabulary by Chapter
#y-positions for the next graph. Add empty spaces (label='' and width=0) under each bar.
ypos = np.arange(chapterdata.shape[0]*2 - 1, 0, -1)
widths = list(itertools.chain.from_iterable((x, 0) for x in chapterdata['new_vocabulary']))[:-1]
labels = list(itertools.chain.from_iterable((str(x), '') for x in chapterdata.index))[:-1]

#Don't need extra room after the first two chapters.
ypos = np.delete(ypos, (0,1))
del(widths[1])
del(widths[2])
del(labels[1])
del(labels[2])
MOST_COMMON_COUNT = 5

fig, ax = plt.subplots(figsize=(14, 40))
#plt.barh('new_vocabulary', chapterdata.index, data=chapterdata,orient='h', color=mycolor)

ax.barh(ypos, widths, .8, tick_label=labels, color=mycolor)
plt.ylim([0, ypos.max()+1])
plt.grid(axis='x')
xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Count of New Vocabulary")
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #print(str(widths[labels.index(chaplbl)]))
    plt.text(10, ypos[labels.index(chaplbl)], str(widths[labels.index(chaplbl)]),\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)]) 
        #print(newwords)
        plt.text(10, ypos[labels.index(chaplbl)]-1.1, 'Major new vocabulary:', weight='bold',\
                 bbox=dict(facecolor='#ffffff99'))
        plt.text(500, ypos[labels.index(chaplbl)]-1.1, newwords,\
                 bbox=dict(facecolor='#ffffff99'))

plt.show()

Pickwick Discard Graph

Pickwick Graph: Count of Unique Words by Chapter

The Pickwick Papers: Count of unique words by chapter.

Here's one of the also-ran graphs I mentioned in my last post about Pickwick. They were part of my exploration of the data and didn't seem interesting enough to include in the last post.

The first shows the count of unique vocabulary per chapter, which isn't all that interesting without any context. It might work better as a stacked bar with unique vocabulary and total word count for each chapter.

#First make a DataFrame with chapter #s and word counts.
clengths = pd.DataFrame((
    (x, len(chaptervocab[x-1])) for x in range(1,58)), 
    columns=['Chapter', 'Count of Unique Words'])
#Bar color for the rest of the plots.
mycolor="#3E000C"
Set the size of the plot.
plt.figure(figsize=(12, 20))
#plt.xlim(0,10000)
#Choose Seaborn display settings. 
sns.set(style="whitegrid")
#Make a horizontal (orient='h') barplot.
sns.barplot('Count of Unique Words', 'Chapter', 
            data=clengths, orient='h', color=mycolor)

plt.show()
#I don't need this any more.
del(clengths)

More Natural Language Processing

This week's NLP experimentation involves Project Gutenberg's plain-text edition of The Pickwick Papers. I parsed the text into individual chapters and calculated some summary statistics about each, then build a visualization of each chapter's new vocabulary.

Setup

A few imports to start with:

The rest of these imports are just NLTK, Seaborn and the standard Python data science toolkit.

In [1]:
from collections import OrderedDict
import re

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from requests import get
import seaborn as sns

%matplotlib inline

First I grab Project Gutenberg's plain-text edition of The Pickwick Papers, then strip out (using split() and list indexing) Gutenberg's front- and backmatter and the table of contents to get just the book's text.

Also, Gutenberg's plain-text version uses underscores to indicate italics. I don't need those so I remove them here.

In [2]:
rq = get('https://www.gutenberg.org/files/580/580-0.txt')

#Not ascii, which requests assumes
rq.encoding='utf8'

pickwick = rq.text
pickwick = pickwick.split('THE POSTHUMOUS PAPERS OF THE PICKWICK CLUB')[2]
pickwick = pickwick.split('End of the Project Gutenberg EBook of The Pickwick Papers')[0]
pickwick = pickwick.replace('_', '')

First, I break the text into paragraphs, then use enumerate() and filter() to get indexes of the chapter headings (conveniently, they're the only "paragraphs" that start with the string "CHAPTER "). This gives me a list like this:

[(2, 'CHAPTER I. THE PICKWICKIANS'), 
(26, 'CHAPTER II. THE FIRST DAY’S JOURNEY, AND THE FIRST EVENING’S ADVENTURES;\r\nWITH THEIR CONSEQUENCES'),
...]

I then cycle through this list to locate the text of each chapter. The chapter texts are stored in an OrderedDict, with the chapter titles used as keys.

Then I delete my _chapterheads list, since I no longer need it and want to limit the number of copies of this book I have stored in memory.

In [3]:
paragraphs = pickwick.split('\r\n\r\n')

_chapterheads = list(filter(lambda x: x[1].startswith('CHAPTER '), enumerate(paragraphs)))

chapters = OrderedDict()
for i in range(len(_chapterheads)-1):
    ch = _chapterheads[i]
    nextch = _chapterheads[i+1]
    chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:nextch[0]]))
    
ch = _chapterheads[i+1]
chapters[ch[1]] = list(filter(lambda x: x.strip() != '', paragraphs[ch[0]+1:]))

del(_chapterheads)

Next, I need lists of tokens:

  • First, generate lists of entities, including both words and punctuation marks, one for each chapter.
  • Then, a copy of that first list, but lowercased and with punctuation removed.
  • Finally a third list with each chapter separated into sentences.
In [4]:
_chaptertexts = ['\n'.join(x) for x in chapters.values()]
_booktext = '\n\n'.join(_chaptertexts)

chaptertokens = [nltk.word_tokenize(chap) for chap in _chaptertexts]
chapterwords = [[w.lower() for w in chap if not re.match('\W', w)] for chap in chaptertokens]
chaptersentences = [nltk.sent_tokenize(chap) for chap in _chaptertexts]

del(_chaptertexts)
del(_booktext)

Next, I copy the lists from chapterwords into sets, giving me a list of the unique words in each chapter--I'll use that soon. Then I have a copy of this with stopwords removed, and a frequency distribution of the words in each chapter.

In [5]:
sw = nltk.corpus.stopwords.words('english')

chaptervocabsw = [set(chap) for chap in chapterwords]
chaptervocab = [set([word for word in chap if word not in sw]) for chap in chaptervocabsw]

#Remember we're not counting punctuation here.
chapterfreq = [nltk.FreqDist(chap) for chap in chapterwords]
In [6]:
newvocab = []
for c in range(57):
    newvoc = chaptervocabsw[c].difference(*chaptervocabsw[:c])
    newvocab.append(nltk.FreqDist(dict([x for x in chapterfreq[c].items() if x[0] in newvoc])))

#Make newvocab 1-indexed.
newvocab = dict(enumerate(newvocab, start=1))

Now I use Pandas to build a DataFrame of potentially-interesting statistics. This is done with a complex-looking list comprehension that generates an 8-tuple describing each chapter. The DataFrame constructor interprets this as a table.

In [7]:
chapterdata = pd.DataFrame([
                (
                    len(chapterwords[x-1]),
                    np.mean([len(tok) for tok in chaptertokens[x-1] if not re.match('\W', tok)]),
                    len(chaptersentences[x-1]),
                    len(chapterwords[x-1]) / len(chaptersentences[x-1]),
                    len(chaptervocabsw[x-1]),
                    len(chapterwords[x-1])/len(chaptervocabsw[x-1]),
                    len(newvocab[x]),
                    len(newvocab[x])/len(chaptervocabsw[x-1])*100,
                )
            for x in range(1,58)], 
            index=pd.Index(range(1,58), name='chapter'),
            columns=['word_count', 'avg_word_length', 'sentence_count', 'avg_sentence_length',\
                     'unique_words', 'lexical_diversity', 'new_vocabulary', 'pct_new_vocab'])

chapterdata.to_csv('~/data/nlp/pickwick_details.csv')
chapterdata
Out[7]:
word_count avg_word_length sentence_count avg_sentence_length unique_words lexical_diversity new_vocabulary pct_new_vocab
chapter
1 1774 4.929538 79 22.455696 705 2.516312 705 100.000000
2 9888 4.644114 391 25.289003 2441 4.050799 2087 85.497747
3 4650 4.440215 167 27.844311 1421 3.272343 682 47.994370
4 4657 4.518145 165 28.224242 1402 3.321683 555 39.586305
5 3719 4.536166 139 26.755396 1227 3.030970 458 37.326813
6 5969 4.388005 211 28.289100 1639 3.641855 616 37.583893
7 5322 4.558812 210 25.342857 1632 3.261029 536 32.843137
8 4678 4.478623 213 21.962441 1407 3.324805 401 28.500355
9 3305 4.384266 155 21.322581 1059 3.120869 271 25.590179
10 5407 4.240244 215 25.148837 1516 3.566623 441 29.089710
11 7350 4.334830 335 21.940299 1953 3.763441 501 25.652842
12 2205 4.503855 87 25.344828 786 2.805344 163 20.737913
13 7048 4.539728 232 30.379310 1841 3.828354 492 26.724606
14 6893 4.233425 256 26.925781 1652 4.172518 363 21.973366
15 5123 4.523131 203 25.236453 1487 3.445192 348 23.402824
16 7265 4.276944 299 24.297659 1791 4.056393 404 22.557231
17 3551 4.391157 83 42.783133 1038 3.421002 179 17.244701
18 3857 4.361162 162 23.808642 1159 3.327869 183 15.789474
19 5325 4.328263 223 23.878924 1471 3.619986 265 18.014956
20 6411 4.262674 222 28.878378 1591 4.029541 305 19.170333
21 7341 4.301594 259 28.343629 1890 3.884127 324 17.142857
22 6213 4.353774 252 24.654762 1595 3.895298 255 15.987461
23 3301 4.182672 129 25.589147 1022 3.229941 138 13.502935
24 5787 4.592189 213 27.169014 1564 3.700128 258 16.496164
25 7100 4.404930 303 23.432343 1707 4.159344 259 15.172818
26 2460 4.357317 93 26.451613 787 3.125794 83 10.546379
27 3754 4.329249 134 28.014925 1162 3.230637 170 14.629948
28 8926 4.336993 247 36.137652 2175 4.103908 370 17.011494
29 4167 4.404848 121 34.438017 1241 3.357776 162 13.053989
30 4309 4.444419 175 24.622857 1283 3.358535 156 12.159002
31 6131 4.377100 216 28.384259 1643 3.731589 239 14.546561
32 5501 4.392111 193 28.502591 1514 3.633421 189 12.483487
33 6364 4.395663 201 31.661692 1817 3.502477 326 17.941662
34 9501 4.498579 300 31.670000 2005 4.738653 290 14.463840
35 5980 4.515050 277 21.588448 1704 3.509390 255 14.964789
36 4599 4.404218 170 27.052941 1419 3.241015 190 13.389711
37 5099 4.269857 186 27.413978 1358 3.754786 168 12.371134
38 5395 4.413160 223 24.192825 1562 3.453905 197 12.612036
39 6010 4.363894 219 27.442922 1555 3.864952 174 11.189711
40 5046 4.373365 189 26.698413 1396 3.614613 169 12.106017
41 5237 4.344090 172 30.447674 1519 3.447663 182 11.981567
42 5609 4.433945 228 24.600877 1628 3.445332 177 10.872236
43 5086 4.311050 217 23.437788 1478 3.441137 203 13.734777
44 5415 4.163250 212 25.542453 1441 3.757807 141 9.784872
45 6474 4.343682 234 27.666667 1785 3.626891 226 12.661064
46 3810 4.359580 171 22.280702 1068 3.567416 92 8.614232
47 4644 4.394488 167 27.808383 1333 3.483871 100 7.501875
48 5029 4.320143 175 28.737143 1376 3.654797 108 7.848837
49 7360 4.235462 260 28.307692 1725 4.266667 208 12.057971
50 5757 4.496265 194 29.675258 1585 3.632177 149 9.400631
51 5530 4.520615 192 28.802083 1681 3.289709 207 12.314099
52 4648 4.225904 147 31.619048 1359 3.420162 143 10.522443
53 4773 4.432642 192 24.859375 1353 3.527716 109 8.056171
54 5806 4.313469 223 26.035874 1443 4.023562 103 7.137907
55 4820 4.363900 179 26.927374 1414 3.408769 153 10.820368
56 4578 4.262342 196 23.357143 1199 3.818182 66 5.504587
57 2774 4.664023 81 34.246914 1000 2.774000 77 7.700000

Skimming the data, it looks like there's not much variation in word length per chapter, but quite a bit more in sentence length and lexical diversity (the ratio of total word length to unique words in the chapter). We can quickly verify this with a simple calculation on the dataframe, normalizing the standard deviations of each column.

In [8]:
chapterdata.std()/chapterdata.mean()
Out[8]:
word_count             0.303518
avg_word_length        0.029765
sentence_count         0.309125
avg_sentence_length    0.142167
unique_words           0.221977
lexical_diversity      0.108498
new_vocabulary         0.975031
pct_new_vocab          0.846217
dtype: float64

I wonder if there happens to be any correlation between sentence length and lexical diversity:

In [9]:
print(np.corrcoef(chapterdata.avg_sentence_length, chapterdata.lexical_diversity))
[[1.         0.09856164]
 [0.09856164 1.        ]]

No, there isn't. At least not a significant one. We can see this in a quick scatterplot:

In [10]:
sns.set(style="whitegrid")
sns.scatterplot('avg_sentence_length', 'lexical_diversity', data=chapterdata)
plt.ylabel('Lexical Diversity')
plt.xlabel('Average Sentence Length')
plt.show()

Finally, a serious graph. I went through several versions of this; this one's my favorite. I'm graphing the percent of each chapter's vocabulary that is new to that chapter. Next, I add labels to each bar, then annotate each with a list of the most common new words in that chapter.

In [11]:
MOST_COMMON_COUNT = 5
mycolor="#3E000C"


fig, ax = plt.subplots(figsize=(14, 25))
sns.set_style("whitegrid", {'axes.grid' : False})
sns.barplot('pct_new_vocab', chapterdata.index, data=chapterdata, orient='h', color=mycolor)

plt.grid(axis='x')

xax = ax.get_xaxis()
xax.set_label_position('top')
plt.xlabel("Percent New Vocabulary")
plt.ylabel('Chapter')
ax.xaxis.tick_top()

for chapnum in range(1, 58):
    chaplbl = str(chapnum)
    #barlbl = '%2.0f%% (%d)' % (chapterdata.pct_new_vocab[chapnum], chapterdata.new_vocabulary[chapnum])
    barlbl = '%2.1f%%' % chapterdata.pct_new_vocab[chapnum]
    plt.text(1, chapnum-1, barlbl,\
             weight='bold', color='white', verticalalignment='center')
    if chapnum >= 3:
        newwords = ', '.join(['"%s" (%d)'%word for word in newvocab[chapnum].most_common(MOST_COMMON_COUNT)])
        #the bbox param here gives the annotation a semitransparent white background to partially hide
        # the gridlines behind it.
        plt.text(chapterdata.pct_new_vocab[chapnum]+1, chapnum-1, newwords,\
                 verticalalignment='center', bbox=dict(facecolor='#ffffff99'))

plt.show()

Python, newspaper, and NLTK

I've been continuing my experimentation with NLTK, using the Python newspaper module. While newspaper seems mainly intended for making personalized Google News-like sites, it does some language processing to support this. Below, I continue where newspaper leaves off using NLTK and seaborn to explore an online text.

Getting Ready

In [1]:
import re # For some minor data cleaning

from IPython.core.display import display, HTML
import newspaper
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

%matplotlib inline

I have several test sites in the cell below--sites are testy about allowing non-browser clients access and I've found I need to alternate while experimenting. Most of these sites work well, but newspaper often gives weird results from the Bing News search.

In [2]:
#paper = newspaper.build('https://www.providencejournal.com/')
#paper = newspaper.build('https://www.bing.com/news/search?q=providence')
#paper = newspaper.build('https://www.guardian.co.uk')
paper = newspaper.build('https://www.vulture.com/')
len(paper.articles)
Out[2]:
69

newspaper basics

Here I choose an article and download() and parse() it--which makes its data available. Then we can pull some basic details like the article's title, address, images, and text:

In [3]:
#We *can* do this, but I want to make sure I have a relatively-long article to use.
a = paper.articles[0]

#Via aldaily.com. Longer articles make more interesting graphs.
a = newspaper.Article('https://themillions.com/2020/03/on-pandemic-and-literature.html')
a.download()
a.parse()


display(HTML("<h3>%s</h3>" % a.title))
print(a.url)

#display(HTML('<img src="%s"/>'%a.top_image))
print(a.top_image) # also a.images, a.movies
print(a.text[:500])

On Pandemic and Literature

https://themillions.com/2020/03/on-pandemic-and-literature.html
https://themillions.com/wp-content/uploads/2020/03/1-2-870x1024.jpg
Less than a century after the Black Death descended into Europe and killed 75 million people—as much as 60 percent of the population (90% in some places) dead in the five years after 1347—an anonymous Alsatian engraver with the fantastic appellation of “Master of the Playing Cards” saw fit to depict St. Sebastian: the patron saint of plague victims. Making his name, literally, from the series of playing cards he produced at the moment when the pastime first became popular in Germany, the engrave

Our Article also gets a list of the authors of the text. I've found this tends to be the least accurate piece of newspaper's processing.

In [4]:
a.authors
Out[4]:
['Ed Simon',
 'Madeleine Monson-Rosen',
 'Ken Hines',
 'Kirsty Logan',
 'Patrick Brown',
 'Emily St. John Mandel',
 'Diksha Basu',
 'Sonya Chung',
 'Andrew Saikali']

NLP with newspaper

The .nlp() method gives us access to a summary of the text and a list of keywords. I haven't looked at the source closely enough to figure out how it's determining these, though the keywords are approximately the most common non-stopwords in the article.

In [5]:
a.nlp()

print(a.summary)
display(HTML('<hr/>'))
print(a.keywords)
There has always been literature of pandemic because there have always been pandemics.
What marks the literature of plague, pestilence, and pandemic is a commitment to try and forge if not some sense of explanation, than at least a sense of meaning out of the raw experience of panic, horror, and despair.
Narrative is an attempt to stave off meaninglessness, and in the void of the pandemic, literature serves the purpose of trying, however desperately, to stop the bleeding.
Pandemic literature exists not just to analyze the reasons for the pestilence—that may not even be its primary purpose.
The necessity of literature in the aftermath of pandemic is movingly illustrated in Emily St. John Mandel’s novel Station Eleven.

['pandemic', 'disease', 'narrative', 'sense', 'black', 'writes', 'plague', 'death', 'literature', 'world']

NLP with NLTK

That's what Newspaper can do for us. But since I had nltk installed already (and Newspaper requires it even if I hadn't), I can take the this article's text and do some basic processing with it.

First I need to tokenize the text, breaking it into individual words and punctuation marks.

In [6]:
a.tokens = nltk.word_tokenize(a.text)
print(a.tokens[:50])
['Less', 'than', 'a', 'century', 'after', 'the', 'Black', 'Death', 'descended', 'into', 'Europe', 'and', 'killed', '75', 'million', 'people—as', 'much', 'as', '60', 'percent', 'of', 'the', 'population', '(', '90', '%', 'in', 'some', 'places', ')', 'dead', 'in', 'the', 'five', 'years', 'after', '1347—an', 'anonymous', 'Alsatian', 'engraver', 'with', 'the', 'fantastic', 'appellation', 'of', '“', 'Master', 'of', 'the', 'Playing']

Next, guess each token's part of speech, using NLTK's "off-the-shelf" English tagger. This returns a list of 2-tuples (token, tag from the Penn Treebank tagset.

In [7]:
a.pos_tags = nltk.pos_tag(a.tokens)
a.pos_tags[:15]
Out[7]:
[('Less', 'JJR'),
 ('than', 'IN'),
 ('a', 'DT'),
 ('century', 'NN'),
 ('after', 'IN'),
 ('the', 'DT'),
 ('Black', 'NNP'),
 ('Death', 'NNP'),
 ('descended', 'VBD'),
 ('into', 'IN'),
 ('Europe', 'NNP'),
 ('and', 'CC'),
 ('killed', 'VBD'),
 ('75', 'CD'),
 ('million', 'CD')]

The Treebank tagset isn't particularly intuitive, especially if your last contact with English grammar was in middle school. Here's the help text for a few of the less-obvious tags above.

In [8]:
for pos in ['NNS', 'NNP', 'IN', 'DT', 'JJ']:
    nltk.help.upenn_tagset(pos)
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...

I'll also tag the article using the "Universal" tagset--it has fewer tags, which makes for a simpler visualization later on.

In [9]:
a.upos_tags = nltk.pos_tag(a.tokens, tagset="universal")
a.upos_tags[:15]
Out[9]:
[('Less', 'ADJ'),
 ('than', 'ADP'),
 ('a', 'DET'),
 ('century', 'NOUN'),
 ('after', 'ADP'),
 ('the', 'DET'),
 ('Black', 'NOUN'),
 ('Death', 'NOUN'),
 ('descended', 'VERB'),
 ('into', 'ADP'),
 ('Europe', 'NOUN'),
 ('and', 'CONJ'),
 ('killed', 'VERB'),
 ('75', 'NUM'),
 ('million', 'NUM')]

We can also have NLTK calculate a frequency distribution of the words in our article--here I'll use it to show the most common 10 tokens, most of which you probably could have guessed:

In [10]:
a.word_freqs = nltk.FreqDist(word.lower() for word in a.tokens)
a.word_freqs.most_common(10)
Out[10]:
[('the', 310),
 (',', 262),
 ('of', 209),
 ('and', 112),
 ('.', 112),
 ('a', 88),
 ('to', 85),
 ('that', 80),
 ('’', 70),
 ('in', 61)]

Visualization

NLTK's FreqDist can also generate plots. Not great plots. Here's an example.

In [11]:
plt.figure(figsize=(12, 8))
a.word_freqs.plot(25)
plt.show()

Line graphs usually make me think "time series". This should probably be a bar plot, and we can do that. Start by translating our FreqDist object's data to a pandas DataFrame:

In [12]:
wfdf = pd.DataFrame(a.word_freqs.items(), columns=['token', 'frequency'])
wfdf.head()
Out[12]:
token frequency
0 less 2
1 than 13
2 a 88
3 century 4
4 after 9

We can now generate a Seaborn barplot of the token frequency data, which is largely unsurprising.

In [13]:
mycolor="#3E000C" #"#9A1C42"
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")
sns.barplot('frequency', 'token', data=wfdf.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

We can make the result (arguably) more interesting by removing stopwords--very common words that don't affect the meaning of the text--from the frequency list. Here we get the stopwords for English.

In [14]:
sw = nltk.corpus.stopwords.words('english')
print(sw)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

And next, remove stopwords (and punctuation) from our word frequency list, and create a new DataFrame and a new graph.

In [15]:
# '\W' matches one "non-word character", i.e., not a letter, number or underscore.
wf2 = [word for word in a.word_freqs.items() if word[0] not in sw and not re.match('\W', word[0])]
wf2df = pd.DataFrame(wf2, columns=['token', 'frequency'])

plt.figure(figsize=(12, 8))
sns.barplot('frequency', 'token', data=wf2df.sort_values(by='frequency', ascending=False)[:25], color=mycolor)
plt.show()

This tells us more about the article. But what about the part-of-speech tags I generated earlier? Here's a function that will take those lists and generate a graph from them:

In [16]:
def posFreqGraph(tags):
    posfreqs = nltk.FreqDist(word[1] for word in tags)
    posfdf = pd.DataFrame(posfreqs.items(), columns=['pos', 'frequency'])
    plt.figure(figsize=(12, 8))
    sns.barplot('frequency', 'pos', data=posfdf.sort_values(by='frequency', ascending=False), color=mycolor)
    plt.show()

First, the graph of Penn Treebank tags. It's crowded--which shows us the richness of this tagset--but still readable. (Here's a complete list of these tags with meanings.)

In [17]:
posFreqGraph(a.pos_tags)

Here's the same visual built from the universal tagset data.

In [18]:
posFreqGraph(a.upos_tags)

Closing

I think that's a good start. Newspaper makes it easy to load web pages and get just the important text, which we can feed into NLTK for some analysis. Here I did only the very basics of an analysis with NLTK; I plan to experiment more over the next few weeks.

Simple Text Collation with Python

Although in the study of manuscript culture one of the characteristic activities is to align parallel parts of a work–and this is a common definition of collation–I speak of collation in the more narrow sense of identifying differences between texts (i.e., after passages are already “aligned”). There are three methods to collate texts: 1) read two texts side by side and note the differences, 2) compare printed page images (by allowing your eyes to merge two page images, often with a device especially for that purpose); 3) transcribe and compare transcriptions with aid of a computer.

Last semester I taught an introductory programming course to non-Computer Science graduate students at URI. My curriculum focused mostly on the Python data science toolset of Jupyter, pandas, and numpy, and using these tools to analyze the students' own datasets.

One student, an English Ph.D. candidate, asked for an alternative involving natural language processing tasks: performing a collation of two editions of a novel released about a decade apart. This kind of work was new to me, but seemed like a simple enough task for a beginning programmer to handle.

Reading PDFs

whymodel.png

My student had both editions of the book as PDFs (scanned from physical books with embedded OCRed text. We explored two modules for extracting the text:

PyPDF2 was our first try. Its getPage() method didn't include whitespace in its output, giving each page's text as a single long word, probably as a result of the PDF's internal formatting, as suggested by Ned Batchelder on StackOverflow. I suspect it would be simple enough to read each word and paste them together as needed, but it was easier to find another solution for PDF reading.

PyMuPDF just worked, at least well enough for this use. It added unnecessary newlines, which would have been a problem if we were interested in paragraph breaks but wasn't an issue here. It also failed with one file's dropcaps, which was probably more an OCR/encoding issue. Here's an example of use (output on the right; the file 01.01_Why_Model.pdf is one of the readings for Scott Page's Model Thinking course on Coursera):

import fitz
pdf = fitz.open('01.01_Why_Model.pdf')
text = ''for page in pdf: text += page.getText()

Text comparison with difflib

difflib HTML Table

difflib.HtmlDiff's output from comparing two simple strings.

It took me an embarrassing amount of time before I realized the tool we needed here was diff. Python's difflib was the ideal solutions. It has a few basic options that easily produce machine-readable (like the command-line app) or HTML table output, but can also produce more complex output with a little effort. Its HtmlDiff tool worked perfectly for this.

The image to the right shows difflib's output from this code in a Jupyter window:

import difflib
from nltk import word_tokenize
from IPython.display import display, HTML

str1="This is my short string"
str2="This is another not long string"

words1 = word_tokenize(str1)
words2 = word_tokenize(str2)

hd = difflib.HtmlDiff()

HTML(hd.make_table(words1, words2))

Two other HtmlDiff options (to display only differences in context , and to limit the number of lines of context) were ideal for this case--we don't need to show the entire book just to print a relative handful of differences. For example, the following will only show changes with three words of context around each:

hd.make_table(context=True, numlines=3)

Closing

Once the difflib's HTML was output, the rest of the student's work on this project was reading through the table, identifying individual changes as 'substantive' or 'accidental', and tabulating them. But there's more we could do with Python to simplify this or enrich the final output, for example:

  • Identify changes where a single punctuation mark was changed to another--many of these were probably either typos or OCR errors.
  • Do part-of-speech tagging on the books' text and include this data in the output--did the author systematically remove adjectives and adverbs, or is there some other trend?

About Me

Developer at Brown University Library specializing in instructional design and technology, Python-based data science, and XML-driven web development.

Tags