Manipulating Some Big Data

Goal: Explore the relationship among some keyterms by transforming variables in Excel and replotting in various ways. This will give you a refresher regarding standardizing variables, the meaning of scatterplots, and spurious relationships (a tricky problem in Big Data research).


Step 1: Download and open these data in Microsoft Excel

Google Ngram lets you download its data, but its packages are very very large and the process takes much time. So, I've done some work for you to get some initial data. Here [download this] are the top 250 words used in English Fiction from 1900 to 2000. Lots of data! Browse the data and make sure you understand it.

Step 2: Choose a subset of data to extract

After meandering the dataset a bit, look for three words of interest to you that radically differ in frequency. In other words, pick a word that is very high in frequency, and another word that is medium frequency, and another one that appears to be very low frequency. This may take you a few minutes, but should be easy enough (the columns are ordered by frequency, so that helps).

Step 3: Extract these 3 words (columns) into a new spreadsheet

Create a new spreadwsheet in Excel and copy and paste the words that are of interest to you.

Step 4: Plot these words over time

Plot these three words using their frequency over time (year = x-axis, frequency = y-axis). Of course, you should see a problem you've seen before: The words vary too greatly in frequency to judge their relative patterns.

We need to transform the frequencies so that they are in the same scale. The most common way of doing this is to standardize your variables. Do you remember doing this in PSYC 10? It's quite easy. In a fourth column, type the following formula in the first cell containing frequency data:

=(A2-average($A$2:$A$101))/stdev($A$2:$A$101)

See it in action here:



Notice the "$" signs in the formula in Excel. This tells Excel to calculate average and standard deviation using *precisely* rows 2 to 101 (our 100 years). The dollar signs let Excel know that this and only this range of cells should be included in that part of the formula.

Recognize this formula? This is the formula for a z-score, and it will transform your variable into a scale that is in terms of the number of standard deviations a given frequency is for that year (for a word). That's it! Now, use your mouse to click on the bottom-right corner of this cell, and drag it all the way down the same column to the last year. Zowy! The standardization is accomplished across all years. Now do this for each word in columns E and F. Now plot the standard variables. Your variables are now in the same scale -- standard deviations! Now we are able to compare the patterns of their frequency change over time directly. Here are some screenshots from my own work to give you clues:





Step 5: Plot a scatterplot of your variables

You chose 3 words and so could produce several scatterplots to represent the potential relationship between pairs of them. Just choose a single pair of words, and use Excel's graphing to plot the relationship between one word's frequency in one year and another word across its years. It's easy to do this. Choose both columns, then select the scatterplot option in Excel's graphing menus. What do you see? Does there appear to be a correlation? What could be generating this pattern?

Some of you may get some curious looking patterns, but we should interpret these data carefully. With so much data, it is possible that subtle relationships may exist in our data set that may not reflect causal forces or true relationships in the world. You may simply report no relationship in your writeup; however if you do see a relationship, write briefly a careful reasoned analysis about whether you believe these relationships might represent something meaningful. There is no right answer.

Step 6: Brief description of your findings

Write 100-200 words as in last week, and include the first time plot (unstandardized), the standardized plot, and the scatterplot. Submit your writeup and your figure on CROPS.