Predicting House Price in Hong Kong #5

in #4, I talked about how to find the hidden APIs of Centaline Android App. In this article, I am going to show you how to write a scraper to scrape the data.

You can find the Scrapy Crawler in this Git directory.

3 Levels of Data

The scraping was done in 3 levels. First, it is the transaction level. transaction API contains basic information about the transaction. Second, it is the transaction detail level, the transaction detail API contains more information on the transaction. At last, buildings’ information level. It contains even more information regarding the building. For more details on the data field, please refer to this wiki (The documentation has not completely done yet).

So, imagine the crawler will first get all transactions. Next, based on the returned transaction_id, it will then get all transaction details. After that, based on the building_id got from transaction details, it will get all building information.

My First Scrapy Project

I would say I am quite experienced in scraping data(LOL). But I never used Scrapy to complete my tasks before. I mainly used the requests module and async programming. After using Scrapy, I think I will not going to write my own scraper again. It is async by default. And by sticking with the structures, my codes are much more organised than writing my own scrapper.

Basic Structure

Folder Structure

The folder structure is created by default. Scrapy helps you organize your code!

centaline
   centaline
       spiders
           transaction_spider.py
       items.py
       pipelines.py
       settings.py

transaction_spider.py

The following skeleton shows you the basic structure of transaction_spider.py. Following our plan, I first scrape the first url in start_requests using the yield function. The scrapy.Request also specify the function parse, parse_2 and parse_3 for parsing the returned data. In parse, I parse the first level of data and also yield the requests for level 2 data. In parse_2, I parse the level 2 data, and yield the requests for level 3 data.

class TransactionSpider(scrapy.spiders.Crawler):
   some constants (such as urls, headers..)

   def start_requests(self):
       some constants
       yield scrapy.Request(sth, sth, callback.self.parse, sth...) #level 1 data

   def parse(self, response):
       get the response into Item
       yield item
       yield scrapy.Request (sth, sth, callback.self.parse_2, sth...) #level 2 data

   def parse_2(self, resposne):
      get the response into Item
      yield item
      yield scrapy.Request (sth, sth, callback.self.parse_3, sth...) #level 3 data

   def parse_3(self, response):
      get the response into Item
      yield item

You may be wondering what is item? An item is simply a temporary location for storing the scraped data and probably do some transformation in it too. For example, for level 1 transaction data, my Item Class CentalineTransactionsItem stored in items.py (Please refer to the folder structure) is like the following:

class CentalineTransactionsItem(scrapy.Item):
    # define the fields for your item here like:
    TransactionID = scrapy.Field()
    IsRelated = scrapy.Field()
    RegDateString = scrapy.Field()
    CblgCode = scrapy.Field()
    CestCode  = scrapy.Field()
    Data_Source = scrapy.Field()
    Memorial = scrapy.Field()
    RegDate = scrapy.Field()
    InsDate = scrapy.Field()
    PostType = scrapy.Field()
    Price = scrapy.Field()
    Rental = scrapy.Field()
    RFT_NArea = scrapy.Field()
    RFT_UPrice = scrapy.Field()
    INT_GArea = scrapy.Field()
    INT_UPrice = scrapy.Field()
    CX = scrapy.Field()
    CY = scrapy.Field()
    c_estate = scrapy.Field()
    c_phase = scrapy.Field() 
    c_property = scrapy.Field()
    scp_c = scrapy.Field()
    scp_mkt = scrapy.Field()
    pc_addr = scrapy.Field()

If you are interested to know more about Scrapy, I recommend the following short video series on YouTube.

Results

The above skeleton is a very simplified one. After some trials and errors, I finally got all transaction data (3 levels) in the last 360 days. The resulting CentalineBuildingInfo.csv, CentalineTransactinsDetailItem.csv and CentalineTransactionItem.csv can be found in the staging_area folder.

Challenges

Writing the code is quite straightforward. But I was sucked on outputting three separate files containing each level of data. By default, Scrapy outputs everything into one file.

Luckily, I found a solution in Stackoverflow. Turn out, I need to write a custom pipeline to do the job. I also added my answer to the Stackoverflow answer since the original answer is quite old and not up to date.

Next Step

The next step will be doing EDA (Exploratory Data Analysis) again. I will look at if the missing data problem improved. And figure out some patterns on the data.