EC 524, Winter 2022

Welcome to Economics 524 (424): Prediction and machine-learning in econometrics, taught by Ed Rubin and Andrew Dickinson.

Schedule

Lecture Monday and Wednesday, 10:00a–11:20a (Pacific), 220 Chapman

Lab Friday, 12:00p–12:50p (Pacific), 220 Chapman

Office hours

  • Ed Rubin Tuesdays, 3:30p–5:00p, Zoom
  • Andrew Dicknson Thursdays, 2:00p-3:00p, Zoom

Syllabus

Syllabus

Books

Required books

Suggested books

Lecture notes

000 – Overview (Why predict?)

  1. Why do we have a class on prediction?
  2. How is prediction (and how are its tools) different from causal inference?
  3. Motivating examples

Formats .html | .pdf | .Rmd

Readings Introduction in ISL

001 – Statistical learning foundations

  1. Why do we have a class on prediction?
  2. How is prediction (and how are its tools) different from causal inference?
  3. Motivating examples

Formats .html | .pdf | .Rmd

Readings

Supplements Unsupervised character recognization

002 – Model accuracy

  1. Model accuracy
  2. Loss for regression and classification
  3. The variance-bias tradeoff
  4. The Bayes classifier
  5. KNN

Formats .html | .pdf | .Rmd

Readings

  • ISL Ch2–Ch3
  • Optional: 100ML Preface and Ch1–Ch4

003 – Resampling methods

  1. Review
  2. The validation-set approach
  3. Leave-out-out cross validation
  4. k-fold cross validation
  5. The bootstrap

Formats .html | .pdf | .Rmd

Readings

  • ISL Ch5
  • Optional: 100ML Ch5

004 – Linear regression strikes back

  1. Returning to linear regression
  2. Model performance and overfit
  3. Model selection—best subset and stepwise
  4. Selection criteria

In between: tidymodels-ing

005 – Shrinkage methods

(AKA: Penalized or regularized regression)

  1. Ridge regression
  2. Lasso
  3. Elasticnet

006 – Classification intro

  1. Introduction to classification
  2. Why not regression?
  3. But also: Logistic regression
  4. Assessment: Confusion matrix, assessment criteria, ROC, and AUC

007 – Decision trees

  1. Introduction to trees
  2. Regression trees
  3. Classification trees—including the Gini index, entropy, and error rate

008 – Ensemble methods

  1. Introduction
  2. Bagging
  3. Random forests
  4. Boosting

009 – Support vector machines

  1. Hyperplanes and classification
  2. The maximal margin hyperplane/classifier
  3. The support vector classifier
  4. Support vector machines

Projects

Planned projects

000 Predicting sales price in housing data (Kaggle)

Help:

001 Validation and out-of-sample performance

002 Cross validation, penalized regression, and tidymodels

003 Nonlinear predictors

004 MNIST image classification

Class project

Outline of the project

Topic and group due by TBA.

Final project submission due by midnight on March 17th.

Lab notes

Approximate/planned topics…

000 – Workflow and cleaning

  1. General “best practices” for coding
  2. Working with RStudio
  3. The pipe (%>%)
  4. Cleaning and Kaggle follow up

Formats .html | .pdf | .Rmd

001 – Workflow and cleaning (continued)

  1. Finish previous lab on dplyr
  2. Working with projects
  3. Using dpylr and ggplot2 to make insightful visuals
  4. How to fix a coding error

Housing data download

Formats .html | .Rmd

002 – Validation

  1. Creating a training and validation data set from your observations dataframe in R
  2. Writing a function to iterate over multiple models to test and compare MSEs

003 – Practice using tidymodels

  1. Cleaning data quickly and efficiently with tidymodels
  2. R-script used in the lab

004 – Ridge, Lasso and Elasticnet Regressions in tidymodels

  1. Ridge, Lasso and Elasticnet regressions in tidymodels from start to finish with a new dataset.
  2. Using the best model to then predict onto a test dataset.

005 – Forcing splits in tidymodels and penalized regression

  1. Combining pre-split data together and then defining a custom split
  2. Running a Ridge, Lasso or Elasticnet logistic regression in tidymodels using a fresh dataset.
  3. Predicting the model onto test data and then viewing the confusion matrix.

Additional resources

R

Data Science

Spatial data

GitHub

View Github