Building a Spatial Econometrics Model

Abstract

Before beginning, an overview of some of the literature on spatial autocorellation in real estate prices suggests that building quality, accessibility, and neighborhood amenities are correlated to house value in statistical studies (see Dubin 1998; Malpezzi 2003; Tu et al 2004; and Ismail 2006). According to Malpezzi (2003), the traditional hedonic variables for rent prices are R = f(S, N, L, C, T), where Rent equals structural characteristics, neighborhood characteristics, location within market, contract conditions, and the time of rental. I used varisimilar variables to N, such as Walkability and Transit indexes, and found that they did not significantly correlate to housing value. Square footage, rental prices in renter-occupied homes, race, and areal educational attainment did appear significant in exploratory modeling, although those findings were contested by some follow-on results. More tuning is necessary to affirm these results.

Dependent Variable and Independent Variable #1

Given a random set of houses (below), how can we attempt to predict housing value? What variables can we rely on? Here is my journey…

I am first creating the neibhorhood & weights matrix so that I can develop spatial autocorrelation with a Monte Carlo simulation to see if the total values are spatially clustered or not. It appears that they are only somewhat clustered with a 0.43 statistic.

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  Data$TOTALVAL 
## weights: Data_W  
## number of simulations + 1: 1000 
## 
## statistic = 0.4336, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

Year Built

Here I am simply loading in a variable extant within the dataset, “Year built,” and subtracting it from the current year in order to get an age profile.

Variable 2

I decided to grab some census data on how many tracts in a given area are owner-occupied. I pulled this data “over” into the Data df, so that the variables would line up in order to create a model in the future. Does owner-occupied housing impact value? We’ll find out…

## Getting data from the 2013-2017 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Variable 3: PhD status

I started this one by grabbing data on higher education, thinking a house occupied by a PhD might have higher value. It doesn’t really. But, I created a spatial lag with Queen’s contiguity, and found that, while a home with a PhD might not be significant, living in a neighborhood with PhDs actually is. Interesting…

## Getting data from the 2013-2017 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Variable 4: Whiteness

I noticed that some of the data we encountered suggested racial biases implicit in economic systems, so I dug back into the census to find white people. Does being white have an impact on the value of your house? Actually, it depends on how you think about it, but I did find that white people have higher-value homes according to our data frame. This surely indicates structural inequalities in our economic system, expressed spatially.

## Getting data from the 2013-2017 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Variable 5 (iffy): Walkability Index from EPA

This is data pulled from the EPA’s walkability index. I had to churn through a few variables before I found one that mattered most. I wonder if the EPA intentionally took a mixed income into account and is actually attempting to address structural inequalities in their data (ie, if they simply have an index that privileges wealthy people, it will only reinforce systemic redlining). The only problem with this is that if wealthy people decide to move to mixed income neighborhoods based on the EPA’s more diffuse walkability scores, it could help drive gentrification. The data that does correlate to housing value is, “mix of employment types in a block group (such as retail, office or industrial). Higher values correlate with more walk trips.” The higher correlation for mix of employment types may be due to “location, location, location.” At the same time, I live in a mixed income neighborhood sandwiched between a nice residential district, some nice retail, restaurants and pop-ups, and some factories, warehouses, and throughways — it would fit the walkability variable here, but wouldn’t necessarily show higher home values than other areas, so the lack of significance makes sense. I left this in despite its low significance simply because it says something about walkability in Portland or the EPA’s index, or both.

Distance to…

Here is my distance variable. I was boring and chose schools, largely because my original idea, “distance to city hall,” was laughed at by a colleague. The second part of this code was enabled by Slack and Jordan’s help, which is seriously appreciate. I found that this distance variable actually shows significant correlation, which does not surprise me.

Dummy variable…

Here, we are creating a ZIP Code variable that registers as 1 each time the Dunthorpe Zip Code comes up. Let’s see if we can go fishing for some high house values off the banks of Willamette.

Linear Model

Here, I simply create a linear model to see if I can hold up a decent R squared value. Based on the inclusion of building square foot, I was able to create a decent adjusted R-squared of about 0.65.

All my variables except for walkability scored fairly well in terms of P value, suggesting that we can be confident that they play a significant role in housing prices. Elevation is surprisingly negatively correlated, because I had thought that higher-value homes would be in the hills (ie, as elevation increases, so do values).

Similarly, it appears owner-occupied homes are negatively correlated with total value, perhaps because owners who rent their homes out are more inclined to maximize profit when they sell. I also ran a Variance Inflation Factor test to see how the variables score, and they all hit under 5, which means our multicolinearity is okay and none of the variables are inflated.

For next steps, I conduct a LaGrange Multiplier test to see which sorts of models would be best to hone our results and get our unfortunately high Root Mean Squared Error (RMSE) number down below $100k at least.

## 
## Call:
## lm(formula = f, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -672533  -70329   -3437   62782 1164368 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     77310.731  12424.686   6.222 5.96e-10 ***
## Data$lag_edu      892.020     65.936  13.529  < 2e-16 ***
## Data$white         10.419      3.319   3.140 0.001717 ** 
## Data$elevation   -325.921     69.435  -4.694 2.86e-06 ***
## Data$BLDGSQFT     191.334      3.713  51.532  < 2e-16 ***
## Data$YEARBUILT    374.402     40.374   9.273  < 2e-16 ***
## Data$GIS_ACRES  61158.640  18351.663   3.333 0.000876 ***
## Data$owner        -64.256     10.140  -6.337 2.90e-10 ***
## Data$NND           29.178     11.103   2.628 0.008656 ** 
## Data$ZIP       -49976.274  10368.752  -4.820 1.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121500 on 1987 degrees of freedom
## Multiple R-squared:  0.6473, Adjusted R-squared:  0.6457 
## F-statistic: 405.3 on 9 and 1987 DF,  p-value: < 2.2e-16
##   Data$lag_edu     Data$white Data$elevation  Data$BLDGSQFT Data$YEARBUILT 
##       1.425633       2.916669       1.456180       1.138810       1.033069 
## Data$GIS_ACRES     Data$owner       Data$NND       Data$ZIP 
##       1.140471       3.097827       1.154990       1.120396
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = f, data = Data)
## weights: Data_W
##  
##        statistic parameter   p.value    
## LMerr    2446.27         1 < 2.2e-16 ***
## LMlag    1553.59         1 < 2.2e-16 ***
## RLMerr   1121.33         1 < 2.2e-16 ***
## RLMlag    228.65         1 < 2.2e-16 ***
## SARMA    2674.92         2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error and Lag

My LaGrange Multiplier test shows that the Error Model will be the best fit, but they are close. Under the simultaneous autoregressive linear error model, it appears that my P values have increased. The education lag, owner-occupied, ZIP, and NND variables are still cookin’, but elevation is questionable at 0.161, whiteness is now 0.32 and can pretty much be thrown out, and the Walkability index is even more of a joke. However, our pseudo-R-squared is now up to 0.77, which is higher than the straight linear model, and the Akaike information criterion has dropped from 52,371 to 51,580, so it is less likely to falsely interpret the data.

Checking out our Root Mean Squared Error (RMSE) for the error model, we have made about a $26,000 improvement. To get this RMSE down, we would need to consider better variables and, in particular, develop a better walkability index from whole cloth.

## 
## Call:
## spdep::errorsarlm(formula = f, data = Data, listw = Data_W, na.action = na.omit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -528150.6  -49391.3   -3875.1   40480.2 1196231.0 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                   Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)    122802.2091  22385.5701  5.4858 4.117e-08
## Data$lag_edu      639.1905    151.7360  4.2125 2.525e-05
## Data$white          5.2261      4.3227  1.2090  0.226663
## Data$elevation   -199.6109    139.8309 -1.4275  0.153431
## Data$BLDGSQFT     164.5577      3.1101 52.9107 < 2.2e-16
## Data$YEARBUILT    144.7116     33.0528  4.3782 1.197e-05
## Data$GIS_ACRES 166407.3196  15888.3944 10.4735 < 2.2e-16
## Data$owner        -37.1328     13.3035 -2.7912  0.005251
## Data$NND           24.9223     11.4944  2.1682  0.030142
## Data$ZIP       -45041.3411  14932.7613 -3.0163  0.002559
## 
## Lambda: 0.82383, LR test value: 805.26, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.020299
##     z-value: 40.586, p-value: < 2.22e-16
## Wald statistic: 1647.2, p-value: < 2.22e-16
## 
## Log likelihood: -25806.36 for error model
## ML residual variance (sigma squared): 9278100000, (sigma: 96323)
## Nagelkerke pseudo-R-squared: 0.76437 
## Number of observations: 1997 
## Number of parameters estimated: 12 
## AIC: 51637, (AIC for lm: 52440)

Checking to see if we’re still spatially autocorrelated, our Moran’s statistic of -0.02 exclaims “negatory, Ghost Rider.” At the same time, our BP Test probably suggests that we have some heteroskedasticity issues, but they do not rule out the validity of our model.

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  mod_err$residuals 
## weights: Data_W  
## number of simulations + 1: 1000 
## 
## statistic = -0.025577, observed rank = 1, p-value = 0.999
## alternative hypothesis: greater
## 
##  studentized Breusch-Pagan test
## 
## data:  
## BP = 239.74, df = 9, p-value < 2.2e-16

Discussion

It appears home prices are determined largely by square footage, although some other spatial features and characteristics become important. Our spatial error model produced the best outcome, far better than an Ordinary Least Squares model. Long story short, if “location, location, location” is true, then location acts as a force multiplier for square footage value. However, more variables could be introduced to make this Hedonic Regression more accurate.

Works cited

Dubin, R.A. 1998. Predicting House Prices Using Multiple Listings Data. Journal of Real Estate Finance and Economics, 17(1): 35-59.

Ismail, S. 2006. Spatial Autocorrelation and Real Estate Studies: A Literature Review. Regional Science and Urban Economics. vol 35.

Malpezzi, S. 2003. Hedonic Pricing Models: a Selective and Applied Review. In: T. O’Sullivan and K. Gibb, eds, Housing Economics & Public Policy. Oxford: Blackwell.

Tu, Y., S. Yu, and H. Sun. 2004. Transaction-Based Office Price Indexes: A Spatiotemporal Modeling Approach. Real Estate Economics, Vol. 32, Issue 2: 297-328.