Exercise 7 - Case Study Challenge

We’ve considered several types of data now, from keystrokes to eye tracking to stock data – and yesterday, Till gave you an advanced survey of text-based methods.

Today let’s try something quite different – historical language data – and use it as a “data challenge.” Let’s analyze Google Ngram data of English (I mention other languages below, too). We will explore how words change from 1900 to 2000 in a subset of words. Google digitized millions of books, and researchers (like us) are able to get the data and see how words are more or less common across a long period of time (here, a century). You can read about the project here, and even get some of your own data from other languages, if you wish: Google Ngram.

The Google files are huge, so I’ve filtered some of it for you. It’s still lots of data, and you can download a zip of it here.

english_data = read.table('allWords1900on_english.txt',stringsAsFactors=F)

Question 1

Use your new R knowledge to answer this simple question: How many rows does english_data have? Does it seem like a lot of data? (Use the dim() function, from before.)

(Note: I focus on English, but we have several other filtered data sets too, including Spanish, Russian, Hebrew, French, Italian, and more. Let me know if you are interested in these data sets and I can share them with you later.)

Let’s see what’s in the English data set:

english_data[1:10,]

##     V1   V2   V3   V4   V5           V6
## 1  ALL 1901 4603 4148 1724 3.826543e-06
## 2  ALL 1902 4647 4249 2025 3.833809e-06
## 3  ALL 1903 4555 4168 1953 3.792565e-06
## 4  ALL 1904 5131 4695 2117 3.894088e-06
## 5  ALL 1905 4753 4332 1994 3.667261e-06
## 6  ALL 1906 4584 4137 1870 3.567009e-06
## 7  ALL 1907 4806 4323 2054 3.650363e-06
## 8  ALL 1908 5756 5045 2053 4.377061e-06
## 9  ALL 1909 7825 6969 2895 6.724548e-06
## 10 ALL 1910 8445 7197 2932 7.040869e-06

We appear to have a few columns this time, rather than just 2. Obviously words appear in the leftmost column but… what’s with the column names? When an irritating collaborator (me) sends you data without column names, you sometimes have to set them yourself. R gives them default names (VX). Maryam briefly showed this function on Tuesday. Let’s do this here:

colnames(english_data) = list('word','year','occurrences','pages','books','percentage')

These data are easy to understand. One column contains a word, the year it occurred, the number of occurrences, the number of pages and books it occurred on, and finally the percentage it represents that year.

We can see how the percentage of occurrences of the word ‘yes’ have occurred across the century (from 1900 to about 2000). Notice any patterns? (Note: see how I’m subsetting here!)

one_word_data = english_data[english_data$word=='yes',]
plot(one_word_data$percentage,type='b',xlab='Year',ylab='Percentage',lwd=2)

Note that I’m subsetting our huge data set into a smaller set of data, now contained in the variable “one_word_data.”

Question 2

Using your subsetting ability now, which of the following words tends to be the most frequent: yes, no, maybe, okay.

Question 3

By plotting the change over time for these four words, do you notice changes in these common English words?

Google’s researchers assembled these data so that we could explore cultural change using data. Let’s try a word that would seem to be relevant to, say, the 1960’s, right? ‘peace’!

one_word_data = english_data[english_data$word=='peace',]
plot(one_word_data$percentage,type='b',xlab='Year',ylab='Percentage',lwd=2)

Whoa… what happened here? Why do you think the data look this way, given your knowledge of US/World history?

Let’s plot another word on top of this one to see if it helps us understand… notice to superimpose plots (put one plot on another) we use “points,” like this:

one_word_data = english_data[english_data$word=='peace',]
plot(one_word_data$percentage,type='b',xlab='Year',ylab='Percentage',lwd=2)
one_word_data = english_data[english_data$word=='war',]
points(one_word_data$percentage,type='b',xlab='Year',ylab='Percentage',lwd=2)

Question 4

If the plot doesn’t look quite right, use your plotting skills with xlim and ylim to get ‘war’ and ‘peace’ to show up completely on the same plot. I showed this in a prior exercise. Color one of them green (peace, of course) and the other red (war).

It would make sense that these two words are correlated. Let me show you how to use R to calculate a correlation. It’s easy. First, let’s subset our data to get two new variables: war, and peace. Then, we simply correlate them with the cor() function, which gives us a number from -1 (negative) to 1 (positive) expressing the correlation.

war = english_data[english_data$word=='war',]
peace = english_data[english_data$word=='peace',]
cor(war$percentage,peace$percentage)

## [1] 0.9412848

Notice you can’t do cor(war,peace) because, of course, these are data tables and not numbers – the numbers are located in the variables, and we use the dollar sign to tell cor() which two sequences of numbers to correlate.

This is a high correlation! .94 is almost a perfect correlation.

We can also plot these two numbers against each other, like this. Let’s put war on the x-axis and peace on the y-axis:

plot(war$percentage,peace$percentage,xlab='Percentage \'war\'',ylab='Percentage \'peace\'')

This plot tells us that we can expect war and peace to be words that occur more or less in similar ways. When ‘war’ is used a lot, we can predict that ‘peace’ will be used a lot, too.

Question 5

The data don’t contain all words from English, but a subset – the most common words (so that we can easily load them into RStudio!). Think of some other pairs of words that you think might be correlated, and compute some correlation as above (in other words: think of pairs, subset your data, then use the cor() or plot() function to inspect).

Here’s one I thought of just now. Singular and plural will work for almost all words. Let’s try ‘dog’ and ‘dogs’.

dog = english_data[english_data$word=='dog',]
dogs = english_data[english_data$word=='dogs',]
cor(dog$percentage,dogs$percentage)

## [1] 0.979075

Question 6

Awesome! That worked. Almost perfect correlation, as expected. But this is trivial. Can you think of other non-trivial ones, maybe? What pairs of words are going to be historically correlated in their use?

Final Challenge Question!

Can you think of words that will be negatively correlated? First person who finds two words correlated at less than -0.5, and has a good explanation why, wins a prize!