Exercise 6 - Devilish Details

Before you analyze any data, including data that are time series, it is critical to plot your data and discern if there are problems that need to be dealt with before you analyze the time series. We have many examples of this from the keystroke data we have been using. Let’s again load up subject 1 and take a glance at his or her keystrokes.

subject1 = read.table('1.csv',header=T)
plot(subject1$RT)

How do you feel about this long reaction time? You can find out which keystroke it is by using a function called which.max. This function tells you which particular entry in the keystroke times the maximum value is located:

which.max(subject1$RT)

## [1] 11

It’s in entry #11! We can check this easily enough by referencing that precise entry, in the following way:

subject1$RT[11]

## [1] 47841.68

And there it is! What a monstrous keystroke time. What do you think the participant was doing? Checking his or her cell phone? Watching a flock of birds gracefully fly across a window? Yelling at a roommate that she or he would also like to have some popcorn? Rather than a finger used for typing, was it being used for picking, perhaps? Who knows. But as a cognitive scientist analyzing your data, you may simply want to remove entries like this from the data, because they are overly long and do not properly capture the unfolding dynamics of natural typing. Removing it is quite simple actually. We have to learn a bit about subsetting the data. Consider the following:

subject1$RT<1000

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [155]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [166]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [177]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [188]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [199]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [221]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [232]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [243]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [254]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Whoa! This is a list of TRUE and FALSE values, based on whether an entry in subject 1 RT is less than 1,000 milliseconds (TRUE) or greater or equal to 1,000 milliseconds (FALSE). You can actually save this list of trues and falses to a variable, and use it later. Consider this:

entriesWeWant = subject1$RT<1000
cleanedReactionTimes = subject1$RT[entriesWeWant]

Now, in your console, look inside cleanedReactionTimes (either by using view(x), or just by typing it in the console and hitting enter). Do you see the monstrous keystroke time at entry 11? You shouldn’t. Because that entry (entry 11) had a “false” value associated with it in the “entriesWeWant” variable. Let’s show this in a different way, by finding the new max, using which.max again:

which.max(cleanedReactionTimes)

## [1] 200

Whoa! The new maximum is at entry 200 – but why? This is because when we used the “entriesWeWant” variable – which contains a sequence of TRUE and FALSE values – we subsetted subject 1’s RT values to only those that are less than 1 second (1,000 ms). Not only did we get rid of the maximum value, but we got rid of any other slower keystroke time too. This is an example of “cleaning” our data set – a devilish detail – because we have to make tough decisions about a cut off (1 second? 2 seconds? 500ms?), and decide whether to remove them at all. You might also call this process removing outliers from our keystroke reaction times. In our case, to keep it simple, let’s plot the cleanedReactionTimes data (containing the filtered or subsetted RT column of subject 1).

plot(cleanedReactionTimes)

So much clearer! Now if you want to do all this cleaning really efficiently, in just a couple of lines of R code, watch this. Let’s do exactly this filtering, but on a new subject (say, subject 4):

subject4 = read.table('4.csv',header=T)
plot(subject4$RT[subject4$RT<1000])

Look over the plot line there, and think about what’s happening. We are plotting subject 4’s reaction time, but not all of them. Inside the square brackets, we are subsetting the reaction times we want by introducing a condition – “less than 1000ms” – which produces a sequence of TRUE and FALSE values. The TRUE values are the only ones that get selected, and the resulting plot is much cleaner. Easy!

Let’s do subject 5 just to show you how this works again.

subject5 = read.table('5.csv',header=T)
plot(subject5$RT[subject5$RT<1000])

Wicked. So now you should have a sense of how to subset your data by creating a condition. Let’s change the condition a bit to see what happens. Let’s make a more restrictive condition that we only plot the sequence of keystroke times that are less than 500ms.

plot(subject5$RT[subject5$RT<500])

We can ask simple questions about the data such as what effect on our means and standard deviations do outliers have? To figure this out, we can just take the mean and standard deviation of the RT data before and after subsetting. See the difference:

mean(subject5$RT)

## [1] 459.4336

mean(subject5$RT[subject5$RT<1000])

## [1] 227.6392

Notice a big difference? Of course! When you filter out (or subset) only reaction times that are under a second, the average RT on keystrokes drops precipitously – down to just over 200ms.

What about standard deviation?

sd(subject5$RT)

## [1] 1809.623

sd(subject5$RT[subject5$RT<1000])

## [1] 168.5709

This is also a massive drop. This is because we are removing EXTREME variability by removing outliers.

Why do I call this a devilish detail? It is because subsetting (“cleaning”) your data requires you to make particular choices. Should you take out keystrokes that are way too slow? Sure. But at what threshold or cutoff? 500ms? 1000ms? What do you think? There is no correct answer. The RStudio part is easy. The hard part is thinking it over and justifying it.

Main exercise: Try other subjects

As an exercise, try this process of subsetting using whatever value you want on (say) one or two other subjects we downloaded from yesterday. Just get a sense of how this process of subsetting and filtering out your data works for you.

Now a little gift for you. Something fun and extra… you can also plot text with R/RStudio. Check this out:

plot(subject5$RT,col='white')

Okay, boring so far. We have plotted white (col=‘white’) markers so it’s just blank (since the background is white). But I did this because it generates our plot and makes room for letters – using the text() function, we can plot the characters in the RT locations!

plot(subject5$RT,col='white')
text(1:length(subject5$RT),subject5$RT,subject5$Char)

Notice we are using a new function called length(). This is how many reaction times there are (it is similar to dim(), but it gives length of sequences).

Also Notice that the text() function takes 3 arguments now. It needs to know the x-axis position (from 1 to the length of the sequence), the y axis values (which we take to be the reaction time value), and then which particular text to write (we finally get to use the Char variable in our data table!).

Try this on a few of the other subjects. It allows you to read their typing in a fun way, seeing the dynamics as the keys unfold character to character. Do any particular keys/characters seem slower than others?

If you can’t quite see the letters well enough because they’re all scrunched together along the x axis, there are two options. We can again subset our data to look only at the first (say) 50 keystrokes and plot just those. Another is that we can shrink the size of the text in the plot so we can see more of them. Here’s how you do both:

# plot only 1 to 50; notice how we are subsetting with 1:50
rangeOfKeys = 1:50
plot(subject5$RT[rangeOfKeys],col='white')
text(1:length(subject5$RT[rangeOfKeys]),subject5$RT[rangeOfKeys],subject5$Char[rangeOfKeys])

This looks more complicated than it is. We are creating a variable called “rangeOfKeys” so that you can subset the data using a range. We then just subset all the RT sequences (dollar-RT requests) by putting in square brackets that range of keystrokes we want (from 1 to 50). The code is convenient because all you have to do to change the range you want to plot, is to change the rangeOfKeys variable.

(NB: It should be clear now that you can subset using either sequences of TRUE/FALSE, but also by using specific entries that you want, such as 1:50, as above.)

# plot only 1 to 50; notice how we are subsetting with 1:50
rangeOfKeys = 1:100
plot(subject5$RT[rangeOfKeys],col='white')
text(1:length(subject5$RT[rangeOfKeys]),subject5$RT[rangeOfKeys],subject5$Char[rangeOfKeys])

Now, here’s how we would shrink the text size. See the “cex” options:

plot(subject5$RT,col='white')
text(1:length(subject5$RT),subject5$RT,subject5$Char,cex=.5)

Simple exercise

Load up a few subjects of your choice, and try out these procedures on them. Clean ’em. Plot the text. Etc.

A bonus exercise if you have time

Now that you have a sense of subsetting, we can actually subset by character, and explore whether some keys are slower than others. Let’s take one of our subjects again (say, subject 5) and subset by choosing only rows that have the letter ‘e’ – and another for the letter ‘p’. Which letter is more common in English? Given that information, which do you think will have the fastest (or, smallest) reaction time? Let’s try it out:

letter_e_data = subject5$RT[subject5$Char=='e'] # subset only letter e RTs; take all columns
letter_p_data = subject5$RT[subject5$Char=='p'] # subset only letter p RTs; take all columns
mean(letter_e_data)

## [1] 169.1277

mean(letter_p_data)

## [1] 235.8571

Here’s the exercise:

Choose another subject from our keystroke data, and try out the above comparison of ‘p’ and ‘e’. Does it hold up to your expectations?
Choose a few letters that you think will also have a fast keystroke speed … can you find one that compares to ‘e’, since ‘e’ is, in fact, the most common letter in the English and many other languages?

One last bonus demo

What if you just want to read what the subjects typed!? And more easily?! Like without having to scan through the sequence, argh!? RStudio allows you to do all kinds of things with characters, too. Let’s run a quick “paste” function on the “Char” variable in one subject’s data table. “Paste” does what it describes – it can COLLAPSE characters and paste them together into a longer string. Like this:

paste(subject5$Char,collapse='')

## [1] "his was a movie abiut a girl thst lived in india and a guy whi lived in america  the girl worked in a bank and she came across the guys file whos card was being charged for things he did not buy  the girl had alws wanted to go to san frs  she had told the guy she lived in san fra because her job gad given her a fake name and and city  the guy yold her he wss going to he in san fra and they could meet if she wanted  sje meet with him wito him and left her parents a note  the girl was supposes to get married but she stared ro fslwith the other guy  the story contined where her parents caugr her snd bring her back to india  but soon enough the guy realzes that her relly like her sk he follws her and they fall in love and the girl gets a parent approval  "

Woo! Now you can easily read all the reviews. DO IT! I mean, if you want to. No pressure.

Okay see you later!