Adam T. Bradley

Posts in tag “ggplot2”

Mapping Colleges with R

Two things going on here. First I scrape some websites (an article on About.com listing top colleges in New England and the Wikipedia articles for those schools) using YQL. I’m just grabbing geographical information here, though I could grab more details (enrollment, year of founding, etc.) if I wanted them.

library(rjson)
library(RCurl)
library(xtable)
library(reshape)
queryYQL <- function(query) {
url <- paste("http://query.yahooapis.com/v1/public/yql?q=",
curlEscape(query), "&format=json&callback=", sep='')
resp <- fromJSON(file=url)
resp$query$results
}
getLoc <- function(name) {
#remove the "(RISD)" after RISD's full name.
url <- sub('\s\(.+\)', '', name)
url <- gsub(' ', '_', url)
url <- paste('http://en.wikipedia.org/wiki/', url, sep='')
qry <- paste("SELECT * FROM html WHERE url='",
url, "' AND xpath='//*[@class="geo"]'",
sep='')
loc <- queryYQL(qry)
loc <- ifelse(
is.null(loc), "",
ifelse(
is.null(names(loc$span)),
loc$span[[1]]$content,
loc$span$content))
loc
}
colLists = c('http://collegeapps.about.com/od/collegerankings/tp/Top-New-England-Colleges-And-Universities.htm',
'http://collegeapps.about.com/od/collegerankings/tp/Top-New-England-Colleges-And-Universities.01.htm',
'http://collegeapps.about.com/od/collegerankings/tp/Top-New-England-Colleges-And-Universities.02.htm')
names = c()
#I always feel like I've failed when I use a for loop in R.
for ( x in colLists ) {
qry <- paste("SELECT * FROM html WHERE url='", x,
"' AND xpath='//h3[@class="dsc"]/a'", sep='')
newcols <- queryYQL(qry)
newcols <- unlist(newcols$a)
newcols <- newcols[names(newcols)=='content']
names(newcols) <- NULL
#Never do this. Except this once when it only happens three times.
names <- c(names, newcols)
}
locs <- sapply(names, getLoc)
names = names[locs!='']
locs = locs[locs!='']
colleges <- data.frame(names, colsplit(locs, '; ', c('lat', 'long')))

Here’s my data, now that I have a data frame of top New England colleges with geographic coordinates. We lose Trinity College, since (unsurprisingly) http://en.wikipedia.org/wiki/Trinity_college is a disambiguation page, and not a description of the one in Connecticut.

print(xtable(colleges), type="html", include.rownames=F)
names lat long
Amherst College 42.37  -72.52 
Babson College 42.30  -71.26 
Bates College 44.11  -70.20 
Bentley University 42.39  -71.22 
Boston College 42.34  -71.17 
Bowdoin College 43.91  -69.96 
Brandeis University 42.37  -71.26 
Brown University 41.83  -71.40 
Coast Guard Academy 41.37  -72.10 
Colby College 44.56  -69.66 
Connecticut College 41.38  -72.10 
Dartmouth College 43.70  -72.29 
Harvard University 42.37  -71.12 
Holy Cross, College of the 42.24  -71.81 
Massachusetts Institute of Technology 42.36  -71.09 
Middlebury College 44.01  -73.18 
Olin College of Engineering 42.29  -71.26 
Rhode Island School of Design (RISD) 41.83  -71.41 
Smith College 42.32  -72.64 
Tufts University 42.41  -71.12 
Wellesley College 42.30  -71.31 
Wesleyan University 41.56  -72.66 
Williams College 42.71  -73.20 
Yale University 41.31  -72.93 

Finally, I plot the data, using code heavily borrowed from here. Positioning of the names could use some work, but this is good enough for now.

library(maps)
library(ggplot2)
all_states <- map_data("state")
new_england <- subset(all_states, region %in%
c('connecticut', 'rhode island',
'massachusetts', 'vermont',
'new hampshire', 'maine'))
map_theme <- theme(
line = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.position="none")
plt <- ggplot() +
map_theme +
ggtitle("Some colleges in New England") +
geom_polygon(data=new_england,
aes(long,lat,group=group)) +
geom_path(data=new_england,
aes(long,lat,group=group),
color="white")+
geom_point(data=colleges,
aes(long, lat, color=names)) +
geom_text(data=colleges,
hjust=-0.07, vjust=0.4,
aes(x=long, y=lat, label=names,
colour=names),
size=3)
plt

plot of chunk college-display