COGS 127, Fall 2015: Activity #1

Topic: Introducing Mode Analytics, SQL, and Reports

This is going to be a "weird" exercise at first, because it will be the first time we'll get our hands dirty with data. But remember: This semester "big data" is a major theme of our discussions, and this Activity 1 will form a foundation for our tool development later this semester. Ready for this party? I am. Let's go.

Important notice 1: It would be much easier to do this exercise on a laptop, but a tablet should work. For future exercises, when we begin to import data into Mode, it will probably be necessary to use a laptop; just a warning for now.

Important notice 2: This activity will be the focus of classes on October 5th and 7th. These two days will be "hands on," and our TA Maryam will go around and help those get setup. It may be possible to complete the exercise in the class over these two days.

Finally, notice 3: You may do this exercise in class with a partner, if you wish to work together on these steps. However, when submitting on CatCourses (see Step 7), all activities must be handed in individually, in your own words, with your own plots.

Step 1: Create a Mode Analytics account
Mode Analytics is an online service for uploading, storing, and sharing data, along with analyzing and even sharing analyses of your data. It is, in short, an integrated "data solution." It is also completely free for basic use.

(Note: Any data imported on Mode, the free version, is public and can be viewed by everyone. Of course, it's unlikely folks would look at your data... but it is important to bear in mind when we do future exercises with data: These are public.)

We will use Mode this semester because it will be a tool to "learn data." As we've discussed already in the domain of cognitive prosthetics and modern social systems -- we are awash, in all cases, with data. It is very important for a cognitive scientist to become familiar with new ways of storing, managing, analyzing and sharing those data.

So we'll start with Mode. We have permission to setup personal accounts for each of us. Please use your @ucmerced.edu email to create your account. Click here to go to the sign up page. Important: Be sure to click "Don't have a Google or O365 account" and use your @ucmerced.edu email (this was also discussed in class). (Note: If you are prompted to import a private data source, just ignore this -- we will not be using private data sources.)

Step 2: What is a data table?
Once you have an account and you're logged in, go to this link to see your first data table:

https://modeanalytics.com/rdale/tables/some_run_data

It would be useful to start thinking of a data table as a kind of spreadsheet, with which you are likely familiar. You have rows of data, and columns (or "fields") specifying different types of information. When you go to the link above, you'll see a data table for some human behavior -- some running data.

Each row represents a specific run on a specific date, and the different columns represent information about each run: type of run, number of miles, duration, etc. The rows in a data table are often called RECORDS and the columns are called FIELDS.

If you click "schema" Mode will tell you about the structure of the data table -- what are the fields, and what kind of data is stored in each field? Note the different types of data here: date, string, and number. Date is obvious, but strings and numbers are crucially different from each other. A field is of a "Data Type" that lets you do different things with it. For example, a number can be added to or divided by and so on. However, a string cannot. A string is just a sequence of characters or letters (like a tweet). You can't "add up" a tweet, or divide it, at least not arithmetically. But you can do other things with strings, such as inserting characters, removing characters, counting the characters, and so on.

So, in short, the data table needs to know which "type" of data each field is.

Now you know what a data table is. But the data table all by itself is boring -- it's just passive, simple data. What we need to do next is ask something of our data -- what's goin' on in these data? Are there patterns? Anything interesting? Can I plot some analysis on these data? To do this, we need queries.

Step 3: What is a query?
A query is a (frequently short) program that takes your data, and does some sorting or analysis of it. In order to write a program to do this with the data, we always need to use a specific programming language or system. Mode uses a common programming language called "Structured Query Language" or "SQL" (often pronounced "sequel"). Note: SQL is a very widespread programming language, in virtually all segments of science and industry, so learning SQL in Mode will transfer to pretty much anywhere else that uses data tables.

Let's just start very simple. A SQL program is just a short program that selects particular data. Let's do the simplest possible query on our running data here. When you're at the data table linked above, click the green "Query" link in the top right.

Now you'll see Mode's really fancy SQL editor. In fact, Mode should have already produced the simplest SQL program for you. It should look like this:

[Q1] SELECT * FROM rdale.some_run_data

Let's translate this SQL into simple natural language. These programs are called "queries" and this one says: "select everything from the table rdale.some_run_data." That's it. The asterisk means "everything" (meaning, select all the fields from the table). When you run this, Mode will show you the output below: All the records in the data table. Neat eh? You've just done your first Mode query. It was simple, but let's get a bit spicier here.

Step 4: Building a more useful query in Mode.
Let's do some analysis on our data here. We might have a very basic question about the data. For example: What is the longest type of run? We can run a query to do this in SQL. I've written one for you right here. Plug this into the query editor window in Mode, and some great stuff should happen:

[Q2] SELECT avg(miles),subtype FROM rdale.some_run_data GROUP BY subtype

This says, in simple natural language: "select the average of miles and the type of run from our table, and compute that average by grouping running data according to type." Notice how the output has changed? It should now group and average the runs by type.

Activity Write Up 1: In your activity write up, please answer the following questions: What is the longest type of run, and what is its average length? What is the shortest, and its length? Do you see a problem with the data output, and if so, what could it be?

Let's add one more detail to deal with that problem mentioned above... try this out:

[Q3] SELECT avg(miles),subtype FROM rdale.some_run_data WHERE subtype!='0' GROUP BY subtype

(Note: The symbol "!=" means "not equal to" so "WHERE subtype!=" means "only where the subtype is not equal to the string '0'". Important for later: You use quotes around strings, and no quotes around numbers -- what happens if you remove the quotes around '0' in the query above?)

Activity Write Up 2: What's different about the last two queries you tried? What did we add to Q3? What was the effect?

Step 5: Create a Mode report.
Finally, we can do some fun things in Mode, like create and share reports from our queries. I'll show you how to do it from Q3, the cleaned version of our data.

Click "Add Chart" right above the data output. Pretty cool eh? It already generates something useful. Go ahead and play with this "point and click" menu of options. This should be easy for you to do. Go to town on the grouped averages here and make a snazzy plot. "Go to town" is a technical term. It means trying to do something useful but crazy. CRAZY I TELL YOU.

After you've gone to town, let's get the chart into our design. This is easy. Just click "Insert Into Design," and the chart should pop into your query. Done.

Now you can share this chart with all your friends and family -- I'm sure you already had this idea brewing from the unbearable excitement of data. Click the "Share" button in the top right. You can simply copy and paste the URL to your report, or you can share on Twitter or Facebook. Go ahead, if you like... this isn't required for the activity, but you can show off data analysis to your friends. (Note: This report will be associated with YOUR Mode account, even though it is using my data table. You can also name your report near "Untitled Query" at the top of thep age. You can click this title and rename the report.)

Activity Write Up 3: Find a way to save your chart and insert it into a Word document. Do this for both your charts (this one, and the next one below). Make sure both are in a Word document, with all your responses (see Step 7).

Step 6: Let's do one more.
By now you should have an initial sense of the following parts of a query: SELECT, avg, FROM, WHERE, and GROUP BY. If you are new to this domain, it will take a bit of time to learn it, so don't fret. Let's just do one more query on our run data, and you can answer a few more questions about it.

To show you the flexibility of queries, let's do a bit more with the running data. Let's ask the question: What are the most common runs? We can add an additional process when we SELECT by doing what's called a "count". Check this out:

[Q4] SELECT avg(miles),count(subtype),subtype FROM rdale.some_run_data WHERE subtype!='0' GROUP BY subtype

Again, in everyday terms: "select the average miles, the count of the type of run, and the type of run from our table, omit the bad data ('0'), and group by the type of run." When you run Q4, you should get a new column in the output.

Activity Write Up 4: What is the most frequent type of run? Use a chart to plot the average miles (on the x axis) and the count (on the y axis), and make sure it's a "Scatter Plot." Do you see a relationship? Briefly describe it.

Step 7: How to hand in your activity.
Answer the 4 sets of questions above (i.e., responses to sections labeled "Activity Write Up #X"). Each answer should be several sentences. Don't cheap out on us. Include your responses in a Word document, along with pictures of two charts, one from Q3 and the other from Q4. We look forward to seeing what you get!

Grading: We will evaluate your responses based on the following criteria: correctness, quality (do you justify your answer?), completeness (is everything there?), and coherence (do your answers reflect a connected understanding of the content of the activity?). The activity will be out of 16 (4 criteria x 4 question sets).

Upload the activity to CatCourses once the assignment is made available this week.