Goal: Explore the relationship among some keyterms by transforming variables in Excel and replotting in various ways. This will give you a refresher regarding standardizing variables, the meaning of scatterplots, and spurious relationships (a tricky problem in Big Data research).
Google Ngram lets you download its data, but its packages are very very large and the process takes much time. So, I've done some work for you to get some initial data. Here [download this] are the top 250 words used in English Fiction from 1900 to 2000. Lots of data! Browse the data and make sure you understand it.
After meandering the dataset a bit, look for three words of interest to you that radically differ in frequency. In other words, pick a word that is very high in frequency, and another word that is medium frequency, and another one that appears to be very low frequency. This may take you a few minutes, but should be easy enough (the columns are ordered by frequency, so that helps).
Create a new spreadwsheet in Excel and copy and paste the words that are of interest to you.
Plot these three words using their frequency over time (year = x-axis, frequency = y-axis). Of course, you should see a problem you've seen before: The words vary too greatly in frequency to judge their relative patterns.
We need to transform the frequencies so that they are in the same scale. The most common way of doing this is to standardize your variables. Do you remember doing this in PSYC 10? It's quite easy. In a fourth column, type the following formula in the first cell containing frequency data:
=(A2-average($A$2:$A$101))/stdev($A$2:$A$101)
See it in action here:
Notice the "$" signs in the formula in Excel. This tells Excel to calculate average and standard deviation using *precisely* rows 2 to 101 (our 100 years). The dollar signs let Excel know that this and only this range of cells should be included in that part of the formula.
Recognize this formula? This is the formula for a z-score, and it will transform your variable into a scale that is in terms of the number of standard deviations a given frequency is for that year (for a word). That's it! Now, use your mouse to click on the bottom-right corner of this cell, and drag it all the way down the same column to the last year. Zowy! The standardization is accomplished across all years. Now do this for each word in columns E and F. Now plot the standard variables. Your variables are now in the same scale -- standard deviations! Now we are able to compare the patterns of their frequency change over time directly. Here are some screenshots from my own work to give you clues:
Write 100-200 words as in last week, and include the first time plot (unstandardized), the standardized plot, and the scatterplot. Submit your writeup and your figure on CROPS.