Predicting House Price in Hong Kong #5
in #4, I talked about how to find the hidden APIs of Centaline Android App. In this article, I am going to show you how to write a scraper to scrape the data.
You can find the Scrapy Crawler in this Git directory.
3 Levels of Data
The scraping was done in 3 levels. First, it is the transaction level. transaction API contains basic information about the transaction. Second, it is the transaction detail level, the transaction detail API contains more information on the transaction. At last, buildings’ information level. It contains even more information regarding the building. For more details on the data field, please refer to this wiki (The documentation has not completely done yet).
So, imagine the crawler will first get all transactions. Next, based on the returned transaction_id
, it will then get all transaction details. After that, based on the building_id
got from transaction details, it will get all building information.
My First Scrapy Project
I would say I am quite experienced in scraping data(LOL). But I never used Scrapy to complete my tasks before. I mainly used the requests module and async programming. After using Scrapy, I think I will not going to write my own scraper again. It is async by default. And by sticking with the structures, my codes are much more organised than writing my own scrapper.
Basic Structure
Folder Structure
The folder structure is created by default. Scrapy helps you organize your code!
centaline
centaline
spiders
transaction_spider.py
items.py
pipelines.py
settings.py
transaction_spider.py
The following skeleton shows you the basic structure of transaction_spider.py
. Following our plan, I first scrape the first url in start_requests
using the yield
function. The scrapy.Request
also specify the function parse
, parse_2
and parse_3
for parsing the returned data. In parse
, I parse the first level of data and also yield
the requests for level 2 data. In parse_2
, I parse the level 2 data, and yield
the requests for level 3 data.
class TransactionSpider(scrapy.spiders.Crawler):
some constants (such as urls, headers..)
def start_requests(self):
some constants
yield scrapy.Request(sth, sth, callback.self.parse, sth...) #level 1 data
def parse(self, response):
get the response into Item
yield item
yield scrapy.Request (sth, sth, callback.self.parse_2, sth...) #level 2 data
def parse_2(self, resposne):
get the response into Item
yield item
yield scrapy.Request (sth, sth, callback.self.parse_3, sth...) #level 3 data
def parse_3(self, response):
get the response into Item
yield item
You may be wondering what is item
? An item is simply a temporary location for storing the scraped data and probably do some transformation in it too. For example, for level 1 transaction data, my Item Class
CentalineTransactionsItem
stored in items.py
(Please refer to the folder structure) is like the following:
class CentalineTransactionsItem(scrapy.Item):
# define the fields for your item here like:
TransactionID = scrapy.Field()
IsRelated = scrapy.Field()
RegDateString = scrapy.Field()
CblgCode = scrapy.Field()
CestCode = scrapy.Field()
Data_Source = scrapy.Field()
Memorial = scrapy.Field()
RegDate = scrapy.Field()
InsDate = scrapy.Field()
PostType = scrapy.Field()
Price = scrapy.Field()
Rental = scrapy.Field()
RFT_NArea = scrapy.Field()
RFT_UPrice = scrapy.Field()
INT_GArea = scrapy.Field()
INT_UPrice = scrapy.Field()
CX = scrapy.Field()
CY = scrapy.Field()
c_estate = scrapy.Field()
c_phase = scrapy.Field()
c_property = scrapy.Field()
scp_c = scrapy.Field()
scp_mkt = scrapy.Field()
pc_addr = scrapy.Field()
If you are interested to know more about Scrapy, I recommend the following short video series on YouTube.
Results
The above skeleton is a very simplified one. After some trials and errors, I finally got all transaction data (3 levels) in the last 360 days. The resulting CentalineBuildingInfo.csv
, CentalineTransactinsDetailItem.csv
and CentalineTransactionItem.csv
can be found in the staging_area folder.
Challenges
Writing the code is quite straightforward. But I was sucked on outputting three separate files containing each level of data. By default, Scrapy outputs everything into one file.
Luckily, I found a solution in Stackoverflow. Turn out, I need to write a custom pipeline to do the job. I also added my answer to the Stackoverflow answer since the original answer is quite old and not up to date.
Next Step
The next step will be doing EDA (Exploratory Data Analysis) again. I will look at if the missing data problem improved. And figure out some patterns on the data.
Comments ()