A New Approach
About a year ago, I published my first post on data&stuff. I applied econometric techniques to develop three least squares regression models to explain HDB resale flat prices. A year on, I’m re-visiting the expanded dataset (now includes an additional year of data) with new skills and knowledge. This time, I intend to apply proper data science techniques to accurately predict prices.
In this first post, I perform exploratory data analysis (EDA) on the dataset. In subsequent posts, I will develop a more complex regression model to predict resale flat prices.
# Import
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
# Settings
%matplotlib inline
warnings.filterwarnings('ignore')
# Read data
hdb = pd.read_csv('resale-flat-prices-based-on-registration-date-from-jan-2015-onwards.csv')
Target: Resale Prices
As we can see, resale prices are right-skewed (mean is to the right of the median). The mean resale price transacted was a whopping $440,000. Singaporeans must be crazy rich to afford a resale flat in this era.
Date and Month Purchased
First, note that the month
feature combines both the month and the year. Let’s split these up while preserving the original notation.
# Rename month variable
hdb = hdb.rename(columns={'month': 'year_mth'})
# Add variables for month and year
hdb['year'] = pd.to_numeric(hdb.year_mth.str[:4])
hdb['month'] = pd.to_numeric(hdb.year_mth.str[5:])
From the graph below, we find that there are “hot” and “cold” periods for buying resale flats, with a surge in recent months. We note how lots of transactions take place on a regular basis: at least 1,000 per month. At the median price, that’s approximately $4.4 billion transacted per month.
Relation with Target
Plotting the median resale price from 2015 onwards, we find that the median price has remained stable over time. In addition, the variation in prices has remained relatively wide. Hence, as in my first post on HDB resale flat prices, we will assume that the relationship between the flat characteristics and resale flat prices is stable for all transactions in the dataset. In other words, we treat the transactions as having occurred within a single, stable time period.
Town
Relation with Target
We find high variability in resale flat prices across the respective towns. This tells us that towns are an important factor in predicting resale flat prices.
Flat Type
Relation to Target
Naturally, we would expect flats that are “high SES” to have a higher resale price:
Storey Range
Relation to Target
Conventional wisdom would tell us that the higher the storey, the nicer the view. The nicer the view, the higher the resale price. The data appears to agree.
Floor Area
Relation to Target
Conventional wisdom would also suggest a positive relationship between floor area and price. Yet again, the data appears to agree.
Flat Model
Relation to Target
There appears to be high variability in resale prices across flat types. This suggests that flat types will be useful for prediction.
Lease Commencement Date
Although we expect a higher price for later lease commencement dates, the relationship is not all that clear. Perhaps remaining lease is a bigger factor.
Relation to Target
Remaining Lease
Relation to Target
We find a positive relationship between resale price and the remaining years in lease from 50 to 90 years. However, from 90 years onwards (referring to Build-to-Order (BTO) flats sold in the last 5 years), the relationship weakens substantially, and the variation increases substantially as well. This suggests that we could create a special category for transactions of flats with 90 or more years remaining in their leases to predict resale flat prices.
Click here for the full Jupyter notebook.
Credits for image: Public Service Division
Credits for data: Data.gov.sg