Assignment 3: Exploratory Data Analysis

Description

In this assignment you will use visualization software to perform exploratory data analysis on a real-world dataset. The goal is to gain practice formulating and answering questions through visual analysis and to learn and critique a leading visualization tool.

You must work in pairs for this assignment. The assignment should be 3,000 – 5,000 words in length (10 – 20 pages with images). You'll be turning it in online by uploading it to the class dropbox.

Assignment

For this assignment, you will use a visualization tool to analyze a data set.  You may use Tableau Software or an alternate tool (see below).

OVERVIEW

First steps:

Step 1. Choose domain & data

Step 2. Profile the data

Step 3. Pose questions

Iterate as needed.

Create visualizations:

Interact with the data

Refine your questions

For your writeup:

Keep a record of your analysis

Prepare at least one final graphic and caption to answer an interesting question

During your exploration of the data, we encourage you to create and record various types of views, including bar charts, scatter plots, maps and time series as appropriate for the data and question you are exploring.  Note how different views support different questions and may reveal areas for further questions or exploration.

Data

Please use one of the following three datasets for the assignment. They should be sufficiently rich for you to be able to discover interesting information by exploring it.  These sets contain a mix of nominal, ordinal, quantitative, geographical, and temporal data.

While these sets do not require a lot of pre-processing or coding on your part, the data may contain empty fields, spelling errors, or other incomplete or faulty data, which you may need to address as you explore.

We have intentionally removed some of the details about contents and size of each dataset below. Discovering the content and scope of your chosen dataset is part of your exploration process.

File formats:  We offer each dataset represented as a CSV (comma separated values) text file. In addition to being readable by the visualization tools we will be using, it can also be directly imported into Microsoft Excel. You are encouraged to look at the data to get a feel for its contents, structure and scale before beginning your analysis. You may use Tableau to look at the underlying data after connecting to it (use the spreadsheet shaped button at the upper left under the word Data) or you may open the file in Excel or a text editor before loading it into the visualization tool.

Since each dataset includes a large amount of data, a large number of questions could be asked at many levels of detail. Using congressional candidate spending as an example, one might want to investigate spending and contributions at an aggregated level, breaking down the data by political parties at the national level. Alternatively, it is equally valid to filter out many of the attributes or entire sections of the data and explore, say, finances at a finer granularity, for example by investigating one's own state and local and neighboring congressional districts.

1. FAA wildlife strike data

This large dataset reports on collisions between wildlife and aircraft, based on pilot reporting collected by the United States Federal Aviation Administration.  Each row of data describes a single collision between some type of wildlife and an aircraft.  The data was downloaded from: http://www.faa.gov/airports/airport_safety/wildlife/database/.  The form pilots use to report a wildlife strike may be helpful in understanding some aspects of the data: http://wildlife.faa.gov/strikenew.aspx

The data can be downloaded here in a compressed format that is optimized for Tableau:

or here for use in an alternate tool:

2. Congressional candidate spending

This dataset is a financial summary of U.S. campaign finance contributions and spending for each 2-year Congressional election cycle for a range of years. These data sets are published by the United States Federal Election Commission, and include summary information for every candidate for Congressional office (both the Senate and House of Representatives). The data was downloaded from: http://www.fec.gov/finance/disclosure/ftpsum.shtml

The dataset can be downloaded here:

Each row of data represents a single candidate running for office in the given year (years are listed using the last year of the election cycle), and contains information about contributions, expenditures, party affiliation, state, congressional district (or none for the Senate), and outcome (win/loss/runoff). The dataset is fairly large. The only difference between the data set being given to you and the ones on the FEC website is that we have concatenated data for multiple election cycles and added a year column, enabling any number of trend analyses. As you'll undoubtedly notice, the data is highly multidimensional, with a large number of columns. The FEC website includes a detailed description of each column.

3. World economic data

This dataset is included as a sample dataset in Tableau 8.1.  It includes a variety of measures about countries of the world over some recent years.

Please provide a basic description of the contents, size (approximate # of rows and columns, length of time covered, geographical span included, rough number and/or range of dimensions) and quality (how complete, how many errors did you see, did quality issues impede your analysis) of the dataset you explored.

Here is a version of the data for use outside Tableau:

Tools

You may use Tableau or other visualization tools of your choosing for this assignment.  Possible choices include Ggobi, d3.

Grading (50 pts)

Assignments will be graded based on: