Assignment 3: Exploratory Data Analysis
Description
In this assignment you will use visualization software to perform exploratory data analysis on a real-world dataset. The goal is to gain practice formulating and answering questions through visual analysis and to learn and critique a leading visualization tool.
You must work in pairs for this assignment. The assignment should be 3,000 – 5,000 words in length (10 – 20 pages with images). You'll be turning it in online by uploading it to the class dropbox.
Assignment
For this assignment, you will use a visualization tool to analyze a data set. You may use Tableau Software or an alternate tool (see below).
OVERVIEW
First steps:
Step 1. Choose domain & data
Step 2. Profile the data
Step 3. Pose questions
Iterate as needed.
Create visualizations:
Interact with the data
Refine your questions
For your writeup:
Keep a record of your analysis
Prepare at least one final graphic and caption to answer an interesting question
During your exploration of the data, we encourage you to create and record various types of views, including bar charts, scatter plots, maps and time series as appropriate for the data and question you are exploring. Note how different views support different questions and may reveal areas for further questions or exploration.
- Look at the data and/or its description. Write down at least four initial questions that you think the data may answer, including a comparative question, a correlation question, a geographically-oriented question, and a time-related or trend question.
- Use the visualization tool to examine the data for answers to your initial questions. You may wish to look for (for example)
- relationships between pairs of variables (correlations, clusters)
- outliers of various kinds
- trends
- You may wish to refine your initial questions based on what you find as you explore. For example, you may wish to pose a related question about a subset of the data. Filtering, sorting, or other operations may be helpful.
- Use the visualization tool to explore the dataset and look for other unexpected kinds of relations. Note what features of the visualized data attracted your attention/focus. Try to highlight or otherwise isolate the subset of the data that contains an interesting feature.
- For some datasets it can be helpful to transform some of the data (e.g. by computing averages or medians, by converting numbers to percentages, etc.). You may do some of these kinds of transformations if you feel it is necessary or helpful (Tableau supports this), but if it is not needed than leave the data as is.
- Write up a discussion of what you found -- both expected and unexpected. This can include relationships that did not appear even though you thought they might. Try to report on at least one interesting or surprising piece of information. Be sure to illustrate your points with screenshots, but please scale them so they aren't too large.
- For each of your final questions, create a visualization that answers it. Be sure to label your axes and include an appropriate caption.
- In your discussion, comment on your use of Tableau (or other tool). What features did you find useful? Which ones were intuitive to use, and which were hard to understand? Was there any functionality that the tool did not have that you wished it did? In other words, how would you improve on the tool?
Data
Please use one of the following three datasets for the assignment. They should be sufficiently rich for you to be able to discover interesting information by exploring it. These sets contain a mix of nominal, ordinal, quantitative, geographical, and temporal data.
While these sets do not require a lot of pre-processing or coding on your part, the data may contain empty fields, spelling errors, or other incomplete or faulty data, which you may need to address as you explore.
We have intentionally removed some of the details about contents and size of each dataset below. Discovering the content and scope of your chosen dataset is part of your exploration process.
File formats: We offer each dataset represented as a CSV (comma separated values) text file. In addition to being readable by the visualization tools we will be using, it can also be directly imported into Microsoft Excel. You are encouraged to look at the data to get a feel for its contents, structure and scale before beginning your analysis. You may use Tableau to look at the underlying data after connecting to it (use the spreadsheet shaped button at the upper left under the word Data) or you may open the file in Excel or a text editor before loading it into the visualization tool.
Since each dataset includes a large amount of data, a large number of questions could be asked at many levels of detail. Using congressional candidate spending as an example, one might want to investigate spending and contributions at an aggregated level, breaking down the data by political parties at the national level. Alternatively, it is equally valid to filter out many of the attributes or entire sections of the data and explore, say, finances at a finer granularity, for example by investigating one's own state and local and neighboring congressional districts.
1. FAA wildlife strike data
This large dataset reports on collisions between wildlife and aircraft, based on pilot reporting collected by the United States Federal Aviation Administration. Each row of data describes a single collision between some type of wildlife and an aircraft. The data was downloaded from: http://www.faa.gov/airports/airport_safety/wildlife/database/. The form pilots use to report a wildlife strike may be helpful in understanding some aspects of the data: http://wildlife.faa.gov/strikenew.aspx
The data can be downloaded here in a compressed format that is optimized for Tableau:
or here for use in an alternate tool:
2. Congressional candidate spending
This dataset is a financial summary of U.S. campaign finance contributions and spending for each 2-year Congressional election cycle for a range of years. These data sets are published by the United States Federal Election Commission, and include summary information for every candidate for Congressional office (both the Senate and House of Representatives). The data was downloaded from: http://www.fec.gov/finance/disclosure/ftpsum.shtml
The dataset can be downloaded here:
Each row of data represents a single candidate running for office in the given year (years are listed using the last year of the election cycle), and contains information about contributions, expenditures, party affiliation, state, congressional district (or none for the Senate), and outcome (win/loss/runoff). The dataset is fairly large. The only difference between the data set being given to you and the ones on the FEC website is that we have concatenated data for multiple election cycles and added a year column, enabling any number of trend analyses. As you'll undoubtedly notice, the data is highly multidimensional, with a large number of columns. The FEC website includes a detailed description of each column.
3. World economic data
This dataset is included as a sample dataset in Tableau 8.1. It includes a variety of measures about countries of the world over some recent years.
Please provide a basic description of the contents, size (approximate # of rows and columns, length of time covered, geographical span included, rough number and/or range of dimensions) and quality (how complete, how many errors did you see, did quality issues impede your analysis) of the dataset you explored.
Here is a version of the data for use outside Tableau:
- world economic data.csv (0.5M) (This one is only for use outside Tableau!)
Tools
You may use Tableau or other visualization tools of your choosing for this assignment. Possible choices include Ggobi, d3.
Grading (50 pts)
Assignments will be graded based on:
- Clear questions and applicable dataset
- Basic description of dataset contents, size & perceived quality
- The description of your visual exploration process
- Major view types included (bar charts, scatter plots, maps and time series) with appropriate related questions
- The depth of your analysis
- The design of your final visualizations
- Instructive image (does it answer the question?)
- Appropriate caption and description
- Expressiveness/effectiveness of the visualization
- Comments and evaluation of the visualization tool including any improvements you might make.