# POLS/CSSS 503 # Lab Session 1: April 6, 2012 # Daniel Berliner # Agenda: # 1. Reading in datasets # 2. Basic Data Tools and Plotting # 3. Indexing Practice # 4. Regression Basics ### 1. READING IN DATASETS # finding out your working directory: getwd() # changing your working directory: setwd("C:/Documents and Settings/Dan B/Desktop/POLS 503 TA 2012/Lab 1") #note that this is specific to my computer #for mac? #setwd("/Users/....") dir() # returns the list of files in your working directory # Reading in .csv files (note: this provides your bridge to Excel): data1<-read.csv("rossoildata.csv", na.strings="") #NOTE THAT DATA FOR THIS (and many subsequent) EXAMPLES COMES FROM THE FOLLOWING ARTICLE: #Ross, Michael L. 2001. "Does oil hinder democracy?" World Politics. 53 (3): 325-361. # Dealing with STATA datasets: # To load STATA files you need to first load a library called "foreign": library(foreign) # Then just use the read.dta() command, just as you would with read.csv #data2<-read.dta("data2.dta") # and, there are tons of other data formats that you can use, although you probably won't need to worry about these for the time being given that almost all PoliSci datasets are available as either .csv or .dta files. ### 2. BASIC DATA TOOLS AND PLOTTING # Dataframes are another class of R objects. Technically they're a type of list. You can think of them as just collections of vectors bound together in a matrix-like format. Conveniently, they automatically have variables names. They also can contain both character string variables and numeric variables at the same time, without all values being treated as character strings (as in a matrix). #You can reference a single variable using the name of the dataframe followed by "$" followed by the name of the variable. This single variable is just a vector: data1$oil # You can use indexing to reference the column position of the variable, as you would for a matrix: data1[,24] # You can also index using the name of the variable: data1[,"oil"] # Some useful things: names(data1) summary(data1) dim(data1) head(data1) tail(data1) unique(data1$regime1) sort(unique(data1$regime1)) table(data1$regime1) quantile(data1$regime1, .9) ?quantile cor(data1$regime1, data1$GDPcap, use="complete.obs") #correlation between two variables) cor(data1[,c("regime1","oil","GDPcap","oecd")],use="complete.obs") #correlation matrix for 4 variables #can you explain what each component of the above line of code is doing? # Using the attach command: attach(data1) #You can now reference variables from the dataframe directly: oil # Q: What does "attach" do? # A: When you read in a file like data1 you end up with an object called a dataframe. You can think of this as a special type of matrix (although we'll get into the details later in the course). What the attach command does is to let you access each column in the dataframe as if it were a separate vector object - i.e., just by using its column name. If you hadn't attached the dataframe, you would need to reference individual variables in the dataframe using constructs like data1$oil for the oil exports variable, etc. Sometimes it saves time to attach dataframes, other times it can be confusing or lead to errors. For instance, if working with multiple dataframe objects each of which had an Education variable, you wouldn't want to attach since R wouldn't know which one "Education" was referring to. # You can create a very crude scatterplot using the plot() command with any two vectors. For example: a<-1:10 b<- 100-exp((10-a)/2) plot(a,b) # adding lines: lines(a,b) # you need to supply two vectors containing the x and y coordinates of the points # adding bells and whistles: plot(a,b, type="n", xlab="Time", ylab="Knowledge", main="R Learning Curve", bty="n", las=1) lines(a,b, lwd=3, col="blue") # Make a scatterplot of the relationship between oil and democracy: plot(oil, regime1, xlab="Oil exports/GDP", ylab="Democracy", main="Relationship between Oil and Democracy") # Also... histograms and boxplots hist(data1$oil[data1$year==1995]) boxplot(data1$oil[data1$year==1995]) boxplot(regime1[oil<10], regime1[oil>=10 & oil<20], regime1[oil>=20]) # you can experiment with ?boxplot and ?hist to find ways of improving the presentation. #other tools: sort(oil) round(oil,2) sort(round(oil,2)) # How to subset specific variables (advisable before removing missing data): # The Ross replication data has many poorly labeled variables. We want to focus on only 4, for now: # regime1: level of democracy # oil: oil exports per GDP # GDPcap: GDP per capita # oecd: indicator variable for OECD members # We make a new dataframe, data2, containing only the columns with those variables, as well as the country and year indicators: data2<-data1[,c(1,4,28,24,50,20)] dim(data1) dim(data2) head(data2) # If we do not know the column numbers for the variables we want, we can reference them by name: data3<-data1[,c("cty_name","year","regime1","oil","GDPcap","oecd")] # are the the two identical? identical(data2, data3) # Now we can safely na.omit to remove missing data. data2<-na.omit(data2) #Note that I am leaving the original, data1, untouched. dim(data2) #Compare with dim(data1) #What would happen if we ran na.omit() on the full data1 first, and THEN selected the variables we wanted? Why? attach(data2) #attach data2 just for purposes of the following exercises: # (the warning notifies you that the variables from when we attached data1 earlier have been overwritten) ### 3. INDEXING PRACTICE (these are very difficult, so don't worry): # 1. What years does the dataset cover? # 2. How many observations are in the dataset for Albania? How many for Belarus? # 3. What is the highest observed GDP per capita for the year 1985? # 4. Which OECD country has the highest observed oil exports/GDP (regardless of year)? Which non-OECD country? # 5. What is the average level of democracy for observations where oil exports make up less than 10 percent of GDP? # 6. What is the average level of democracy among the top twenty percent of oil exporting countries in the year 1995? What is the average among the bottom eighty percent? # 7. What is the difference between the average level of democracy among countries with GDP per capita of less than $10,000 and the average level among countries with greater than or equal to $10,000, for the year 1980? What is this difference for 1990? # 8. Make an interesting scatterplot. ### 4. REGRESSION BASICS (if time) # To run a regression res <- lm(y ~ x1 + x2 + x3) # If you did not attach, need to specify dataset. Can also specify many other options. res <- lm(y~x1+x2+x3, data=data) # A dataframe containing y, x1, x2, etc. # To print a summary summary(res) # To get the coefficients res$coefficients # or coef(res) #To get residuals res$residuals #or resid(res) # To get the variance-covariance matrix of the regressors vcov(res) # To get the standard errors sqrt(diag(vcov(res))) # To get the fitted values predict(res) # To get expected values for a new observation or dataset predict(res, newdata, # a dataframe with same x vars # as data, but new values interval = "confidence", # alternative: "prediction" level = 0.95 ) #This may not make sense now, but we'll come back to prediction in the future.