In this lab, you will learn how to use linear regression to build a predictive model of California house prices.

Predictive modeling

If you took CDS 101 in a previous semester, you should quickly review the new material added this semester about using models for prediction:

Modeling for prediction (commonly referred to as machine learning) has different objectives to modeling for understanding/explanation.

In machine learning, we are most interested in whether our model will accurately predict data points in the future, and so we use a different workflow and different objectives to modeling for understanding.

About the Data

The dataset has already been randomly divided into training and test sets for you (in the train and test variables respectively).

Each row contains information about a district in California; each column contains data on the characteristics of housing in those districts.

Variable Description
longitude the E-W coordinate of the center of the district
latitude the N-S coordinate of the center of the district
housing_median_age the median age of houses in the district
total_rooms the total number of rooms of houses in the district
total_bedrooms the total number of bedrooms of houses in the district
population the population of the district
households the number of households in the district
median_income median income of the district
median_house_value response variable: the median value of houses in the district
ocean_proximity categorical variable that indicates if a district is inland, on the oceanfront, in a bay, etc.

Exercises

Install packages

You will need to install the lmvar package by running these three lines of code once in the RStudio console:

install.packages("remotes")
require(remotes)
install_version("lmvar", version = "1.5.2", repos = "http://cran.us.r-project.org")

Note that the third line may prompt you to update other packages. We recommend selecting the option that doesn’t update anything, as the updates will take a while. (To do this you may need to type the number 3 in the Console when it prompts you for a selection.)

  1. Take a look at the train dataset in RStudio (remember that we should never look at our test dataset - otherwise we risk biasing our model).

    First, visualize the geographic distribution of the data by creating a scatter plot of the train data with longitude on the x-axis and latitude on the y-axis.

    To get an idea of the density of districts in different parts of the state, set the alpha parameter to a low-ish value (0.1 - 0.2).

    You should see that there are 3 rough clusters where districts are more dense:

    • A southern coastal region

    • A northern coastal region

    • A northern internal region.

    The northern internal region is the Central Valley, which contains a number of cities including Sacramento and Fresno.

    What are the other two regions with a large density of districts?

  2. Take your code from Exercise 1, and copy it into a new chunk. Add a parameter inside the aes function to color the points by the median_house_value variable.

    Where do most of the high house values seem to be located in California?

  3. Visualize the relationships between the response variable median_house_value and the remaining continuous explanatory variables in the train dataset (i.e. not latitude and longitude, or ocean_proximity).

    To do this:

    • use the pivot_longer() function (or the gather() function) to collect the remaining continuous explanatory variables into name and value columns.

    • Pipe the gathered data to ggplot and create scatter plots with the value column that you just created on the x-axis and median_house_value on the y-axis.

    • facet_wrap over the name column that you created with pivot_longer() or gather(). Use the scales="free" parameter to allow the axis limits to vary between facets.

    Using your graph, answer the following questions:

    1. Which variable has the most obvious relationship with median_house_value?

    2. What value does the response variable go up to? Do you think this will cause a problem for making predictions with a linear model, and why?

  4. Visualize the relationship between the response variable median_house_value and the categorical variable ocean_proximity by creating a box plot (the response variable should be on the y-axis, the categorical variable on the x-axis).

    Which category seems to have most of its districts distributed at low median house prices?

  5. Using the lm function, create a simple linear model where median_house_value is the response variable, and median_income is the explanatory variable. You will also need to supply the arguments y = TRUE and x = TRUE to the model, as in this example:

     model_1 <- lm(... ~ ..., data = ..., y = TRUE, x = TRUE)

    Remember to use the training dataset…

    Calculate the k-fold cross validation error of this model using the cv.lm function (from the lmvar package):

     cv.lm(model_1, k =...)

    Use k = 5. With a regression model, we typically calculate our error (i.e. our innaccuracy) with the root mean square error (RMSE, i.e. the square-root of the mean of the squared residuals). What is the validation RMSE of this model?

  6. Repeat the modeling process to create a new linear model, but this time use all the explanatory variables (categorical and continuous, including latitude and longitude).

    Hint

    As a short hand you can write . instead of writing out all the explanatory variables, i.e. `lm(y ~ .)

    Calculate and report the cross-validation error as before.

    Which model performs best?

  7. Using the best of the two models that you created in the previous two exercises, calculate its error at making predictions on the test dataset. (As a reminder, you should not have touched the test dataframe before this question.)

    To do this we can use the rmse function from the modelr package.

    The syntax of rmse is:

     rmse(<MODEL>, <DATA>)

    Pass in the model that did best at cross validation (either model_1 or model_2 [or whatever you called the model in Exercise 6]) and the test data.

    What is the root mean square error of this model on the test data. Is that better or worse than the error in cross validation (i.e. is your model more or less accurate on the test data)?

That’s the end of this lab. However, we’ve only just scratched the surface of what we could do with this dataset. If you interested, you might try creating new variables (such as the average rooms per house, rather than the total number of rooms in the district, or the number of bedrooms per person), and see if those allow you to create a more accurate model. Although we didn’t have time to go into it here, the process of combining variables to create new, more useful ones is called feature engineering, and is a key part of any machine learning project.

How to Submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Dominic White. Dataset originally from “Hands-On Machine learning with Scikit-Learn and TensorFlow” by Aurélien Géron.