In this lab, you will learn how to use linear regression to build a predictive model of California house prices.
Predictive modeling
If you took CDS 101 in a previous semester, you should quickly review the new material added this semester about using models for prediction:
- Video 1: https://www.youtube.com/watch?v=HwZwWkxqafs
- Video 2: https://www.youtube.com/watch?v=SOvBYm9wyuA
Modeling for prediction (commonly referred to as machine learning) has different objectives to modeling for understanding/explanation.
In machine learning, we are most interested in whether our model will accurately predict data points in the future, and so we use a different workflow and different objectives to modeling for understanding.
About the Data
The dataset has already been randomly divided into training and test
sets for you (in the train
and test
variables
respectively).
Each row contains information about a district in California; each column contains data on the characteristics of housing in those districts.
Variable | Description |
---|---|
longitude |
the E-W coordinate of the center of the district |
latitude |
the N-S coordinate of the center of the district |
housing_median_age |
the median age of houses in the district |
total_rooms |
the total number of rooms of houses in the district |
total_bedrooms |
the total number of bedrooms of houses in the district |
population |
the population of the district |
households |
the number of households in the district |
median_income |
median income of the district |
median_house_value |
response variable: the median value of houses in the district |
ocean_proximity |
categorical variable that indicates if a district is inland, on the oceanfront, in a bay, etc. |
Exercises
Install packages
You will need to install the
lmvar
package by running these three lines of code once in the RStudio console:install.packages("remotes") require(remotes) install_version("lmvar", version = "1.5.2", repos = "http://cran.us.r-project.org")
Note that the third line may prompt you to update other packages. We recommend selecting the option that doesn’t update anything, as the updates will take a while. (To do this you may need to type the number
3
in the Console when it prompts you for a selection.)
Take a look at the
train
dataset in RStudio (remember that we should never look at our test dataset - otherwise we risk biasing our model).First, visualize the geographic distribution of the data by creating a scatter plot of the
train
data with longitude on the x-axis and latitude on the y-axis.To get an idea of the density of districts in different parts of the state, set the
alpha
parameter to a low-ish value (0.1 - 0.2).You should see that there are 3 rough clusters where districts are more dense:
A southern coastal region
A northern coastal region
A northern internal region.
The northern internal region is the Central Valley, which contains a number of cities including Sacramento and Fresno.
What are the other two regions with a large density of districts?
Take your code from Exercise 1, and copy it into a new chunk. Add a parameter inside the
aes
function to color the points by themedian_house_value
variable.Where do most of the high house values seem to be located in California?
Visualize the relationships between the response variable
median_house_value
and the remaining continuous explanatory variables in thetrain
dataset (i.e. notlatitude
andlongitude
, orocean_proximity
).To do this:
use the
pivot_longer()
function (or thegather()
function) to collect the remaining continuous explanatory variables into name and value columns.Pipe the gathered data to ggplot and create scatter plots with the value column that you just created on the x-axis and
median_house_value
on the y-axis.facet_wrap
over the name column that you created withpivot_longer()
orgather()
. Use thescales="free"
parameter to allow the axis limits to vary between facets.
Using your graph, answer the following questions:
Which variable has the most obvious relationship with
median_house_value
?What value does the response variable go up to? Do you think this will cause a problem for making predictions with a linear model, and why?
Visualize the relationship between the response variable
median_house_value
and the categorical variableocean_proximity
by creating a box plot (the response variable should be on the y-axis, the categorical variable on the x-axis).Which category seems to have most of its districts distributed at low median house prices?
Using the
lm
function, create a simple linear model wheremedian_house_value
is the response variable, andmedian_income
is the explanatory variable. You will also need to supply the argumentsy = TRUE
andx = TRUE
to the model, as in this example:model_1 <- lm(... ~ ..., data = ..., y = TRUE, x = TRUE)
Remember to use the training dataset…
Calculate the k-fold cross validation error of this model using the
cv.lm
function (from thelmvar
package):cv.lm(model_1, k =...)
Use
k = 5
. With a regression model, we typically calculate our error (i.e. our innaccuracy) with the root mean square error (RMSE, i.e. the square-root of the mean of the squared residuals). What is the validation RMSE of this model?Repeat the modeling process to create a new linear model, but this time use all the explanatory variables (categorical and continuous, including latitude and longitude).
Hint
As a short hand you can write
.
instead of writing out all the explanatory variables, i.e. `lm(y ~ .)Calculate and report the cross-validation error as before.
Which model performs best?
Using the best of the two models that you created in the previous two exercises, calculate its error at making predictions on the test dataset. (As a reminder, you should not have touched the
test
dataframe before this question.)To do this we can use the
rmse
function from themodelr
package.The syntax of
rmse
is:rmse(<MODEL>, <DATA>)
Pass in the model that did best at cross validation (either
model_1
ormodel_2
[or whatever you called the model in Exercise 6]) and thetest
data.What is the root mean square error of this model on the
test
data. Is that better or worse than the error in cross validation (i.e. is your model more or less accurate on thetest
data)?
That’s the end of this lab. However, we’ve only just scratched the surface of what we could do with this dataset. If you interested, you might try creating new variables (such as the average rooms per house, rather than the total number of rooms in the district, or the number of bedrooms per person), and see if those allow you to create a more accurate model. Although we didn’t have time to go into it here, the process of combining variables to create new, more useful ones is called feature engineering, and is a key part of any machine learning project.
How to Submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 9 posting on Blackboard.
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Dominic White. Dataset originally from “Hands-On Machine learning with Scikit-Learn and TensorFlow” by Aurélien Géron.