1. Introduction

In this project, we will develop a spatial machine learning model to predict the home price of San Francisco’s future housing market. Since the housing market predictions are not accurate as we expected, this prediction model would provide more comprehensive and accurate results for the future home prices, and it will be a very practical reference to help the home buyers to do the decision makings.
It is quite tough to achieve an accurate prediction due to many reasons. Firstly, the original dataset includes a large amount of missing data for several crucial variables. We must properly make up those missing values to ensure the reliability of the dataset. Secondly, the prediction needs external variables to increase the overall accuracy, and the external data may have different scales. Thus, we will spend extra time on data collection and data wrangling to make the external data fit the scale of San Francisco. When two predictors highly relate to each other, the results of prediction will be affected, so to keep the independence and uniqueness of predictors, we also need to pay attention to the process of predictor-selections.
The model includes both the internal characteristics of single homes and the external characteristics of the city. We will find out the correlations between the sold houses and each predictor, and then sum up all predictors to get an overall correlation with the strongest relationship. The built predicting model will be tested multiple times on data collected from sold houses in the San Francisco area. Once the result is a good fit and the number of errors is minimized enough, we will use the model to predict the home price of the on-sale housing units at the local housing market.
Based on the predicting model, we found that the internal characteristics largely determine the home price. The neighborhoods of home play minor roles, and other external characteristics, such as crime rate, proximity to other facilities, usually have small impacts on the prices.

2. Data

2.1 Gathering data

We gathered data to assess four aspects of the houses: internal characteristics, demographic characteristics, availability of public services/amenities, and spatial structure.
For internal characteristics, the variables we used were mainly derived from the attribute table of the house sale data. For demographic characteristics, we created our variables from the 2015 5-year estimates of American Community Survey dataset which we obtained from the website of U.S. Census Bureau. For availability of public services, most data we used, including crime, school, and land use data, were downloaded from SF open data portal. We also downloaded data of technology company locations from Connor Leech’s Github repository.

2.2 Feature engineering

To better predict the home price, we transformed the data we collected into useful features. Below is a brief introduction of how we created our features and the logic behind the process.

Spatial structure

Neighborhood: When buyers consider to purchase a new house, neighborhood is the most common scale in evaluating the house. The neighborhood where the house is located would directly affect the price of the home.

Internal characteristics

  • Sale year: The trends of the real estate market generally affect the overall property value. We extract the sale year of each housing unit and assume properties sold in the same year would have similarities in prices. Also, if the housing has been on sale for a long time but has not been sold, it may have a lower value, so we believe there is a negative relationship between the home price and sale year.

  • Age of House & Built year missing: We assume the built year would affect home prices, so we used the current year (2015) to minus the built year to get specific age of each house. If the housing unit does not have a built year, the bad records may represent the poor conditions, and we assume it has a negative correlation with home prices.

  • Lot area & Lot area missing: Larger lots usually associated with higher home value, so we assume this is a determinable variable. The housing unit with 0 lot area may relate to other external factors, and we assume it has a negative impact on home prices. In data processing, we applied the neighborhood-median lot area to the missing data.

  • Property area & property area missing: The house with more extra space trends to have high selling prices in the housing market. he housing unit with 0 property area may relate to other external factors, and we assume it has a negative impact on home prices. In data processing, we applied the neighborhood-median property area to the missing data
  • No. rooms & bedrooms & bathrooms: The smaller house with extra rooms would have more value than a larger room without bedroom and bathrooms. Thus, the number of rooms directly correlate with home prices. We classify the number of rooms into three categories. Since the small differences in room number (1 or 2 rooms) may not lead to big differences in price, we believe the range can magnify the impacts of different room numbers on home prices.
  • Missing rooms & bedrooms & bathrooms: If the value of room/bedroom/bathroom is zero in the dataset, it may because the house does not have a room/bedroom/bathroom, but the reason can also be the bad records.
  • No. stories: Stories are able to divide the house into more spaces and increase the property area, so we believe it has a positive relationship with the home value. Since the small differences in stories may not lead to big differences in price, we classify the number of stories into three categories.
  • Construction type: Homes with better construction types usually work more, so there should be certain relationship between construction type and home price.
  • Zoning: Land use type relates to the property type of houses, so we believe there is a correlation between zoning and home price.
  • Seasons: Houses are listed on sale at different times of a year, and generally, the best time to buy a new house is in the late summer or autumn, and in national vacation, people may temporally not to look for new houses. Thus, we assume the month of listing on sale potentially affects the home price.

Demographic characteristics

  • No. families with at least $200,000 of annual income: This variable represents the highest income group, so we believe this community has a high correlation with the expensive housing units.
  • Pct. residents with a bachelor’s degree or higher: Well-educated community trends to live in the same area and share the same standards when they buy homes. If the neighborhood has a high percentage of the well-educated community, the homes at the neighborhood may worth more.
  • Pct. families below poverty level (log-transformed): People are below the poverty level have low purchasing capacity, so homes at high-poverty rate area trend to have a lower value and this variable is able to represent the cheap housing units.
  • No. houses for sale: If the on-sale houses cluster together, they may share similar prices and other features. We believe this fine-scale spatial agglomeration would affect the housing value.

Amenities/public services

  • Distance to roads (log-transformed): Based on the theory in city planning, high residential proximity to main roads may cause issues in noise and unsafety, and those factors would reduce the home value. Thus, we calculate the distance to the nearest highway for each housing unit.
  • No. colleges within 1 mile: Students trends to live around campus, and 1 mile is perferable walking distance. The high demand raises the housing price. We count the number of colleges within 1 mile of each house and assume the larger count number relates to higher home prices.
  • No. technology companies within 1 mile: As a technology hub, the technology companies largely increase the land value of San Francisco, and it may further affect the home price. Also, because of local topography and poor transportation conditions, high-income programmers may prefer to live around companies. We assume those local characteristics potentially raise the home price.
  • No. schools within 2 miles: Family with children trends to live around schools, so parents could spend less time on sending and picking up kids from schools. Thus, there is a positive relationship between school and home prices. We drew a 2-mile radius circle around each house and count the number of schools within the circle. The larger number means more school options, and we assume it would raise the home price.
  • No. aggravated assaults within 1/4 miles: The number of crimes reflects the safety level and it may correlate to the home prices. We filter the raw data to the assault reports in the time range of 2012 to 2015 and then count the number of crimes within a ¼-mile radius to housing units.
  • Distance to entertainment/retail land use: People care about whether it is convenient enough to get access to other amenities, especially in a city with poor transportation conditions, like San Franciso. If the location of the house has high accessibility to the surrounding stores and facilities, it would value more. There would be a potential correlation between the home price and the proximity to surrounding entertainment or retail facilities.
  • Distance to the nearest Whole Foods Market: As a high-ended organic supermarket, we assume the Whole Foods Market usually locates in the area with a high home price.

2.2.1 Summary statistics for all numerical variables

NAME OF VARIABLES MEAN MEDIAN SD MIN MAX FIRST QUANTILE THIRD QUANTILE CATEGORY
Age of the house 82.11 87.00 25.95 0.00 146.00 68.00 103.00 Internal characteristics
Lot area 278009.95 257800.00 102115.39 0.00 1890500.00 235200.00 300000.00 Internal characteristics
Property area 1645.26 1494.00 771.77 187.00 24308.00 1160.00 1973.00 Internal characteristics
Number of families with at least $200000 of annual income 22.67 18.80 14.60 0.00 62.40 10.50 34.80 Demographic characteristics
Percentage of residents with a bachelor’s degree or higher 52.84 57.70 20.32 11.30 87.70 35.50 68.00 Demographic characteristics
Log-transformed percentage of families below poverty level 1.65 1.70 0.86 0.00 3.82 1.22 2.22 Demographic characteristics
Log-transformed distance to closest road 7.73 7.92 0.94 3.73 9.13 7.16 8.44 Amenities/public services
Number of colleges within 1 mile 1.73 1.00 2.58 1.00 29.00 1.00 1.00 Amenities/public services
Number of schools within 2 miles 102.78 100.00 41.28 24.00 216.00 67.00 133.00 Amenities/public services
Number of technology companies within 1 mile 8.46 1.00 35.52 1.00 481.00 1.00 1.00 Amenities/public services
Number of aggravated assaults within 1/4 miles 266.21 162.00 316.37 2.00 4653.00 87.00 310.00 Amenities/public services
Distance to Entertainment/retail land use 701.43 600.14 499.45 6.68 4393.59 320.85 965.77 Amenities/public services
Distance to the nearest Whole Foods Market 6465.00 5564.77 3774.80 276.85 17473.18 3618.33 8533.08 Amenities/public services

2.2.2 Summary statistics for all categorical variables

NAME OF VARIABLES CATEGORY MOST FREQUENT LEAST FREQUENT
Whether lot area is missing Internal characteristics 0 1
Whether number of bedrooms is missing Internal characteristics 0 1
Number of bedrooms reclassified Internal characteristics 3 or 4 bedrooms 5+ bedrooms
Whether number of bathrooms is missing Internal characteristics 0 1
Number of bathrooms reclassified Internal characteristics Up to 2 bathrooms 5+ bathrooms
Whether property area is missing Internal characteristics 0 1
Whether number of rooms is missing Internal characteristics 0 1
Number of rooms reclassified Internal characteristics 5+ rooms Up to 2 rooms
Whether built year is missing Internal characteristics 0 1
Number of stories reclassified Internal characteristics Up to 2 Floors 4+ Floors
Construction type Internal characteristics D S
Zoning Internal characteristics RH-1 C-3-G
Sale year Internal characteristics 13 15
Sale date classified as seasons Internal characteristics Summer Spring
Total number of houses for sale Demographic characteristics 0 11
Neighborhood Spatial structure Sunset/Parkside Lincoln Park

2.3 Correlation Matrix

Here we present a correlation matrix that visualizes the correlation among our variables. For each pair of variables, a correlation coefficient is calculated. A correlation coefficient of 1 indicates perfectly positive linear correlation between the two variables while -1 indicates perfectly negative linear correlation.

2.4 Scatter plots of house price vs. independent variables

Below are the four scatter plots that visualize the relationships between the house sale price and four independent variables: the number of colleges within 1 mile of the house, the number of technology companies within 1 mile of the houses, the plots, the log-transformed distance between the house and the nearest road, and the percentage of residents with a bachelor’s degree of higher. As expected, the plots show a slighly positive correlation between the house price and these independent variables.

2.5 Mapping house prices in San Francisco

2.6 Mapping interesting predictors

3. Method

3.1 Overview

We aim to create a ordinary least square model that can accurately predict the house value in San Francisco. TO achieve this goal, we divided our feature engineered data into two sets: training data (25%) and test data (75%). We trained our model using the training data and evaluated its performance using our test data. We ran a Moran’s I test to test the presence of spatial autocorrelation within the error of our model. In order to assess the generalizability of our model, we also ran a 100-fold cross validation.

3.2 Validation

3.2.1 Generalizability

We ran a k-Fold Cross-Validation (where k =100) to test the generalizability of the model. The k=100 means we ran the method 100 times and each time a new randomly selected set of data was extracted from the original dataset to be the test data. The Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) obtained from the 100 iterated tests indicate the modeling accuracy. The standard deviation of MAE provides a general sense of the generalizability of the mode.

3.2.2 Spatial Autocorrelation

We calculated Moran’s I to test the spatial autocorrelation of the model. It would give us a sense of whether the nearest housing units tend to be similar. We also plotted the errors between the true prices and predicted prices to observe the generalizability of the predicting model. We also test the model generalizability by calculating the MAE in the high-income household versus low-income household, and the majority household is a married couple with kids under 18-year-old versus other households. The aim is to find out other unseen biases of the predicting model in different demographic groups.

4. Results

4.1 Model results for training data

RSE R-SQUARED ADJUSTED R-SQUARED F STATISTIC
value 0.2606917 0.7659334 0.7614718 171.673

Regression results for numerical variables

TERM ESTIMATE STD.ERROR STATISTIC P.VALUE
prop_as 0.0001515 0.0000056 27.0821047 0.0000000
lot_as 0.0000004 0.0000000 11.0727564 0.0000000
num_school_2mi 0.0023296 0.0002572 9.0559018 0.0000000
num_houseintract..Families..Estimate…200.000.or.more. 0.0046382 0.0005822 7.9668595 0.0000000
num_college_1mi -0.0218460 0.0043353 -5.0390548 0.0000005
num_houseintract..Percent..Estimate..Percent.bachelor.s.degree.or.higher. 0.0028334 0.0005794 4.8905052 0.0000010
wholefoods_nn1 -0.0000103 0.0000022 -4.6992732 0.0000027
num_assaults_1_4mi -0.0000938 0.0000204 -4.5891150 0.0000045
num_tech_1mi 0.0014896 0.0004105 3.6288868 0.0002867
num_houseintract..All.families….Percent.below.poverty.level..Estimate..Families._log -0.0141474 0.0063207 -2.2382646 0.0252350
min_dis_entertain 0.0000162 0.0000078 2.0774540 0.0377959
Age -0.0000122 0.0001455 -0.0836384 0.9333463
Note: This is the summary of regression results for the numerical variables. The variables are sorted by p value.

4.2 Evaluating predictions for test data

MAE MAPE R.SQUARED
218359.1 0.1974364 0.7659334

4.3 Plotting MSE for cross-validation outcome

4.4 Plotting observed vs. predicted house price

4.5 Mapping residuals

4.6 Moran’s I

4.7 Predicted house prices for all houses in dataset

4.8 MAPE by neighborhood

4.9 MAPE by neighborhood vs. mean house price by neighborhood

4.10 Generalizability: MAPE vs. demographic data

Marriage

Mean Absolute Error of test set sales by marriage context
Married.Household Other
0.1810294 0.2347259

Income

Mean Absolute Error of test set sales by neighborhood income context
High.Income Low.Income
0.2186747 0.173428

5. Discussion

5.1 Effectiveness

The predicting model is able to predict 77.04% variance within the house price. Overall, the results of cross-validation does show that our model has a reasonable level of generalizability.

5.2 Noticeable variables

Based on the value of R-squared, the education-related variables play important roles in predicting the house price. The variables including the percentage of individual with a bachelor’s degree or higher and the proximity to schools as well as colleges are the most significant external variables in our predictions. Moreover, We have considered the local characteristic of San Francisco as a tech hub and used the locations of technology companies as a predictor. However, it only contributed little to the predictive power of the model. Although its effect on the model did not match our expectations, we still think it is an interesting variable and will be useful after some more feature engineering.
Many external variables were used in the model, but eventually, we found the internal characteristics were the most decisive variable on home prices. The variables like zoning, the number of rooms, bedrooms, and bathrooms largely increased the accuracy of the predicting model. The neighborhood is another variable that largely affects the results. Houses within the same neighborhood usually share more similarity in prices.

5.3 Accuracy and predicted results

Based on the distribution of error, the points with lower error were distributed cross the map and especially clustered around the north. The errors occurred mostly among the predictions for high-priced houses, and a few median-priced houses were underpredicted. The final map of predicting results showed an overall distribution of San Francisco home prices. The most expensive houses were mainly located in the northeastern area of San Francisco as well as the central city. The decline of prices also associates with this expansion. The low price houses clustered at the south and southeast of San Francisco. The model predicted particularly well for the houses in the south and the west: it successfully predicted for the majority of low-priced houses and also predicted reasonably for the small number of high-priced houses. Conversely, the model had a relatively poor prediction at the northeastern San Francisco. Many high-priced houses were over- or under-predicted, and a few median-priced houses were underpredicted.

5.4 Generalizability

According to the map of predicted values and the results of Moran’s I test, we found that there are spatial autocorrelations associates with our predicting results. In this case, it means there is a similarity in predicting errors among nearby houses. The model would perform well in predicting large-scale price distribution, but it would not be very effective when we use it to predict home prices on a finner scale. The distribution of MAE was relatively spread out, which means the generalizability of the model was not high enough.
The MAE of the high-income household versus low-income household, and the majority household is a married couple with kids under 18-year-old versus other households also shows the biases of the model. The model performed better at the low-income dominated census tracts and the census tracts contain more married couples with kids. In other words, the model is biased to the above two demographic groups, so the generalizability is not very high.
Besides showing the bias, these two results also cohere with our predicting results for housing prices: the model predicts low-priced houses better, and the education-related variables have more weights in the model.

6. Conclusion

We would like to recommend our model to Zillow. The model has a high R-squared value, indicating its capacity to explain a high percentage of variance within the test data. It also accounts for some spatial autocorrelation among the house prices.
As we mentioned in the previous section, our predicting model did fail to account for all spatial autocorrelation. To improve the model, we need variables at a finer scale than neighborhood. In addition, we found that many independent variables do not linearly relate to the home price, so the OLS may not the best method to deal with housing price prediction.