# EC 524, Winter 2022

Welcome to Economics 524 (424): Prediction and machine-learning in econometrics, taught by Ed Rubin and Andrew Dickinson.

## Schedule

**Lecture** Monday and Wednesday, 10:00a–11:20a (Pacific), 220 Chapman

**Lab** Friday, 12:00p–12:50p (Pacific), 220 Chapman

**Office hours**

## Syllabus

## Books

### Required books

### Suggested books

- R for Data Science
- Introduction to Data Science (not available without purchase)
- The Elements of Statistical Learning
- Data Science for Public Policy (not available without purchase)

## Lecture notes

- Why do we have a class on prediction?
- How is prediction (and how are its tools) different from causal inference?
- Motivating examples

**Readings** Introduction in *ISL*

**001 – Statistical learning foundations**

- Why do we have a class on prediction?
- How is prediction (and how are its tools) different from causal inference?
- Motivating examples

**Readings**

- Prediction Policy Problems by Kleinberg .it[et al.] (2015)
*ISL*Ch1*ISL*Start Ch2

**Supplements** Unsupervised character recognization

- Model accuracy
- Loss for regression and classification
- The variance-bias tradeoff
- The Bayes classifier
- KNN

**Readings**

*ISL*Ch2–Ch3*Optional:**100ML*Preface and Ch1–Ch4

- Review
- The validation-set approach
- Leave-out-out cross validation
- k-fold cross validation
- The bootstrap

**Readings**

*ISL*Ch5*Optional:**100ML*Ch5

**004 – Linear regression strikes back**

- Returning to linear regression
- Model performance and overfit
- Model selection—best subset and stepwise
- Selection criteria

**In between: tidymodels-ing**

- An introduction to preprocessing with
`tidymodels`

. (Kaggle notebook) - An introduction to modeling with
`tidymodels`

. (Kaggle notebook) - An introduction to resampling, model tuning, and workflows with
`tidymodels`

(Kaggle notebook) - Introduction to
`tidymodels`

: Follow up for Kaggle

**005 – Shrinkage methods**

(AKA: Penalized or regularized regression)

- Ridge regression
- Lasso
- Elasticnet

**006 – Classification intro**

- Introduction to classification
- Why not regression?
- But also: Logistic regression
- Assessment: Confusion matrix, assessment criteria, ROC, and AUC

**007 – Decision trees**

- Introduction to trees
- Regression trees
- Classification trees—including the Gini index, entropy, and error rate

**008 – Ensemble methods**

- Introduction
- Bagging
- Random forests
- Boosting

**009 – Support vector machines**

- Hyperplanes and classification
- The maximal margin hyperplane/classifier
- The support vector classifier
- Support vector machines

## Projects

Planned projects

**000** Predicting sales price in housing data (Kaggle)

**Help:**

- A simple example/walkthrough
- Kaggle notebooks (from Connor Lennon)

**001** Validation and out-of-sample performance

**002** Cross validation, penalized regression, and `tidymodels`

**003** Nonlinear predictors

**004** MNIST image classification

## Class project

**Topic and group due by** **TBA**.

**Final project submission due by midnight on March 17th.**

## Lab notes

Approximate/planned topics…

- General “best practices” for coding
- Working with RStudio
- The pipe (
`%>%`

) - Cleaning and Kaggle follow up

**001 – Workflow and cleaning** (continued)

- Finish previous lab on
`dplyr`

- Working with projects
- Using
`dpylr`

and`ggplot2`

to make insightful visuals - How to fix a coding error

Housing data download

**002 – Validation**

- Creating a training and validation data set from your observations dataframe in R
- Writing a function to iterate over multiple models to test and compare MSEs

**003 – Practice using tidymodels**

- Cleaning data quickly and efficiently with
`tidymodels`

- R-script used in the lab

**004 – Ridge, Lasso and Elasticnet Regressions in tidymodels**

- Ridge, Lasso and Elasticnet regressions in
`tidymodels`

from start to finish with a new dataset. - Using the best model to then predict onto a test dataset.

**005 – Forcing splits in tidymodels and penalized regression**

- Combining pre-split data together and then defining a custom split
- Running a Ridge, Lasso or Elasticnet logistic regression in
`tidymodels`

using a fresh dataset. - Predicting the model onto test data and then viewing the confusion matrix.

## Additional resources

### R

- RStudio’s recommendations for learning R, plus cheatsheets, books, and tutorials
- YaRrr! The Pirate’s Guide to R (free online)
- UO library resources/workshops
- Eugene R Users

### Data Science

- Python Data Science Handbook by Jake VanderPlas
- Elements of AI
- Caltech professor Yaser Abu-Mostafa: Lectures about machine learning on YouTube
- From Google:

### Spatial data

- Geocomputation with R (free online)
- Spatial Data Science (free online)
- Applied Spatial Data Analysis with R