Exercise 4 - Quantifying Time Series

You have now seen how to get RStudio setup, how to move around your computer to access your data files (get to your working directory with setwd), load your data files, and plot and format some time series.

Today we are going to focus on how we can measure aspects of a time series, and thereby learn something about the process which generated that time series. For example, our original keystroke data tells us how fast someone can produce language at the keyboard. However the dynamic pattern as people type may also be interesting, and may reveal how they type!

Let’s start by exploring some new data. I have created a new archive of data files that contains 10 participants who did the typing task. When you download this ZIP archive, extract it, and look inside, you’ll see each subject is stored as file 1.csv, 2.csv, etc. You can load one of these in the familiar way. (Note: This assumes you have downloaded these data, and navigated your RStudio to that folder using setwd, as yesterday!)

subject1 = read.table('1.csv',header=T)
plot(subject1$RT)

This should look familiar to you – it’s the same subject with 270 keystrokes from yesterday! Let’s try another subject, for… fun.

subject2 = read.table('2.csv',header=T)
plot(subject2$RT,family='Courier') # note we can change the FONT! possible values: Times, Arial, Courier, etc.

Notice that the data, when imported, has a column name called “RT”. If you look at the data file, you’ll notice it actually has two columns this time! I have shared a bit more data for you, for fun.

subject2[1:10,] # take first 10 rows, all coumns

##    Char       RT
## 1     h   23.385
## 2     i  168.650
## 3     s   63.720
## 4        151.940
## 5     i  239.980
## 6     s   96.065
## 7         96.030
## 8     a  103.980
## 9         96.145
## 10    n 3768.080

Remember the dollar sign is used to refer to the column (these are sometimes called variables in the data table). Our subjects have two variables in their data table: character typed, and reaction time in milliseconds. You can also read through each character and guess which movie each participant is describing. :-)

Let’s now take the average typing speed for each participant.

mean(subject1$RT)

## [1] 413.2844

mean(subject2$RT)

## [1] 654.8031

Subject 2 has a higher average keystroke speed than subject 1. How about standard deviation – variability?

sd(subject1$RT)

## [1] 2930.427

sd(subject2$RT)

## [1] 4718.488

Which has the highest variability?

Simple exercise: Plot subject 1 and subject 2 and take a look at their time series visually; does the mean and standard deviation, as we just calculated here, seem obvious from the data?

Hint: remember to use ylim, like yesterday, to get a better look at the reaction times.

For this exercise, we also need the “entropy” library. This will be your first experience installing an RStudio library. It’s quite easy. R and RStudio have a HUGE number of completely free tools that can extend the capabilities of RStudio for you. The entropy library has lots of useful components, chief among them, of course, the “entropy(x)” function. Copy and paste the following instruction into your RStudio console: install.packages(‘entropy’)

It should install for you. It not, let me know and I’ll swing by. Now that we have this installed, we have to “load” the library in the following way:

library(entropy)

This tells RStudio we want to do some entropy calculations.

We should now be good to go! Let’s see which time series have higher entropy. From the plot you viewed above, can you guess?

entropy(discretize(subject1$RT,numBins=10)) # entropy of subject 1

## [1] 0.0488497

entropy(discretize(subject2$RT,numBins=10)) # entropy of subject 2

## [1] 0.06305718

Given what we described in the brief presentation by Rick, what does this mean? Which participant has more “disorder”?

(Note: In the entropy lines above, I am using a new “discretize” function; ignore this for now. Use this same line of code when you’re computing entropy for the other subjects too; I’ll explain later.)

Now, for fun, let’s do the following:

Simple exercises

Create a new .R script file, and load up all 10 subjects. Calculate the mean for each. Which subject, out of the 10, is the slowest typist? Which the fastest?
In that script file, also compute the entropy for each of the 10 subjects. Which has the highest entropy? Which the lowest?
Take the participants with the highest entropy and lowest entropy and plot their keystroke times. What do you see?