Predicting House Price in Hong Kong #3

Date: 13 July 2020

In part 2, we have talked about splitting data into a training set and testing set. In part 3, I would like to share some findings on the first data exploration.

Big Problem: Missing Data

I did not expect there are so many missing data in some of the key fields. I have used missingno package to examine the missing fields.

% of missing value

Build Size (feet), Actual Size (feet), Price/Sq.ft, and Building age (year) have more than 60% missing values. Intuitively, they are highly correlated to the property price. This intuition is further confirmed by correlation plots:

Actual Size V.S. Price
Age V.S Price

Sometimes, when the feature contains more than 50% missing value, we may drop the feature. But, I don’t want to do that for important features. This creates a dilemma for me.

I am still working out the solution to this problem. One option is to look for other datasets. But I first need to understand how the data is collected and why there are so many missing values. Are those required field? Another option is to find some ways to fill in the missing values. I think I will try the first option first.

Problem: Limited Features

If we look at the dataset, there are actually very few features.

#   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Reg. Date           5952 non-null   object 
 1   Property Name       5318 non-null   object 
 2   Street              5952 non-null   object 
 3   Block               2732 non-null   object 
 4   Floor               5918 non-null   object 
 5   Flat                5343 non-null   object 
 6   Price$(M)           5952 non-null   float64
 7   Build Size(feet)    559 non-null    object 
 8   Actual Size(feet)   1589 non-null   object 
 9   Price/sq.ft         559 non-null    float64
 10  Building age(year)  2063 non-null   float64
 11  Type                5952 non-null   object

If we do not do any feature engineering, we may left with Reg. Date, Floor, Build Size(feet), Actual Size(feet), Building age(year), Type, 6 features. And Build Size (feet) and Actual Size(feet) are highly correlated. For Property Name, Street, Block, we may want to do Geocoding, converting address information to coordinates. Can we do one-hot encoding on address (maybe street)? I am not sure.

Some Observations

A lot of outliers in target variable

Positive skewness and multi-model distribution in target variable

Price$(M) positive correlated with Build Size, actual size, and negatively correlated with age

The observations are preliminary. I should first fix the problem of missing data first.