Intro to Data Science
1
Introduction
1.1
What does this book cover?
2
R: Environment for data analysis
2.1
Installing R and RStudio
2.2
First look at RStudio
2.3
Working with R in Rstudio
2.4
Workspace variables: remembering your results
2.5
Data types
2.6
Writing scripts: repeating and improving your commands
2.7
Packages: re-using work done by others
2.8
Getting help
2.8.1
Available help sources
2.8.2
Asking for help
3
What is data?
3.1
What is data
3.2
Different kinds of data
3.2.1
Numeric data
3.2.2
Categorical Data
3.2.3
Text data
3.2.4
Other data types
3.3
Data and information
3.4
Data integrity: How is data collected?
3.5
Data storage, privacy, ethics
3.6
Data frame
3.6.1
CSV-files
3.6.2
Limitations
3.7
Loading and exploring data
3.8
What is data science
4
Using R and RStudio for data analysis
4.1
Why R is useful?
4.2
How to think about data processing
4.3
The
tidyverse
world
4.3.1
Prepare for analysis
4.3.2
Basic data description
4.3.3
Working with individual variables
5
More about data processing: dplyr pipelines
5.1
What are pipelines?
5.2
Writing pipelines: split the tasks into small parts
5.3
The most important functions for pipelines
5.3.1
Select variables with
select
5.3.2
Filter observations with
filter
5.3.3
Order results by
arrange
5.3.4
Compute with
mutate
5.3.5
Aggregate with
summarize
5.4
Combining functions into pipelines
5.5
Other functions
5.5.1
Selecting rows
5.6
Grouped operations
5.7
Debugging pipelines
6
Preliminary data analysis
6.1
Different variable types
6.1.1
Numeric variables
6.1.2
Character (string) variables
6.1.3
Categorical variables
6.1.4
Logical variables
6.2
Preliminary data analysis
6.2.1
Is this a reasonable dataset?
6.2.2
Are the relevant variables good?
6.3
Missing values
6.4
How good are the variables?
6.5
What is not in data
6.6
Sampling, documentation
7
Descriptive statistics
7.1
Descriptive statistics and inferential statistics
7.2
Basic properties of data: location, range, distribution
7.2.1
Location
7.2.2
Spread
7.2.3
Distribution
7.3
Some specific data types
7.3.1
Dummy variables and rates
8
Asking questions and answering with data
8.1
Interesting questions, important questions and answerable questions
8.2
General questions and answerable questions
8.3
Example: who survived Titanic wreck?
8.3.1
The question
8.3.2
Preliminary analysis
8.4
Sea ice cover: and example analysis
9
Answering questions: how to write reports
9.1
Answering the questions: writing
9.2
Analysis report: Titanic example
10
Visualizing data
10.1
ggplot
visualization framework
10.2
Basic plot types
10.2.1
Histogram
10.2.2
Scatterplot
10.2.3
Line plot
10.2.4
Barplot
10.2.5
Boxplot and violin plot
10.3
Grouping data on plots
11
How are values related
11.1
Visualizing trend lines
11.2
Strength of the relationship: correlation
11.2.1
What is correlation
11.2.2
Example: how similar are iris flower parts
11.2.3
Example: basketball score and time played
11.2.4
Caveats of correlation
11.3
Measuring the relationship: linear regression
11.3.1
Linear regression: an example
11.3.2
Intercept and slope
11.3.3
Linear regression: prediction
11.3.4
Linear regression: definition
11.3.5
Linear regression: iris flower example
11.3.6
Linear regression: playing time and score in basketball
11.3.7
Pitfalls: correlation versus causation
11.3.8
Pitfalls: regression to mean
11.4
Linear regression: categorical variables
11.4.1
Binary (dummy) variables
11.4.2
Analyzing the difference with linear regression
11.5
Appendix: how to process Harden data
12
Statistical Inference
12.1
Population and sample
12.2
Different ways of sampling data
12.2.1
Complete sample
12.2.2
Random sample
12.2.3
Stratified sample
12.2.4
Representative and biased sample
12.3
Example: election polls
12.4
Statistical hypotheses
12.5
Confidence intervals
12.6
Appendix: random numbers
13
Logistic regression
13.1
What is logistic regression and what is it good for?
13.1.1
Fix the draft text below
14
Predictive modeling
14.1
Prediction versus inferences
14.2
Predicting linear regression
14.2.1
How to predict with linear regression
14.2.2
RMSE/R2: model goodness
14.3
Categorization: predicting categorical outcomes
14.3.1
Predicting logistic regression results
14.3.2
How good is categorization? Confusion matrix
14.3.3
Accuracy, precision, recall
14.4
Extrapolation: dangerous
14.5
Machine Learning: similar issues
15
Model goodness
Appendix
A
Introduction to Machine Learning
A.1
ML: make computer to spot patterns
A.2
Decision Trees
A.3
Overfitting
B
Dataset Description
B.1
Benefits
B.2
HadCRUT
B.3
Heart attack
B.4
Ice extent
B.5
Iris
B.6
Ncbirths: births in North Caroline
B.7
Titanic
C
R Cheatsheet
C.1
Rstudio Keyboard shortcuts
C.2
Handling packages
C.3
Loading and creating data
C.4
Describing data
C.5
Selecting observations
C.6
Computing
C.7
dplyr
C.7.1
main functions
C.7.2
Comparison operators for filtering
C.8
Data cleaning and processing
C.8.1
Converting into different formats
C.8.2
Other
D
Exercise Solutions
D.1
R
D.1.1
Workspace variables
D.1.2
Packages
D.2
Data
D.2.1
What is data
D.2.2
Data Frame
D.2.3
Loading and exploring data
D.3
Using R and RStudio for data analysis
D.3.1
The
tidyverse
world
D.4
dplyr
pipelines
D.4.1
Writing pipelines
D.4.2
The most important functions for pipelines
D.4.3
Grouped operations
D.5
Preliminary data analysis
D.5.1
Preliminary data analysis
D.6
Descriptive statistics
D.6.1
Basic properties: location, range, distribution
D.7
Questions and answers
D.7.1
General questions and answerable questions
D.8
Visualizing data
D.8.1
Basic plot types
D.9
Relationships
D.9.1
Visualizing trend lines
Published with bookdown
Intro to Data Science
Chapter 15
Model goodness
TBD