How to Extract Data from Text Columns?
Imagine you need to extract the price from the “sell_price” column, what would you do? Before reading the solutions, leave a comment below!
ad_cat | sell_price |
住宅 屋苑 村屋 | 售 358 萬 |
住宅 屋苑 | 售 960 萬 |
住宅 屋苑 | 售 465 萬 |
住宅 屋苑 | 售 760 萬 |
住宅 屋苑 | 售 650 萬 |
住宅 屋苑 單幢式大廈 | 售 690 萬 |
住宅 屋苑 | 售 1,000 萬 |
住宅 屋苑 | 售 435 萬 |
住宅 屋苑 | 售 530 萬 |
住宅 屋苑 獨立屋 | 售 1,380 萬 |
If you are familiar with Excel, you may come up with this formula:
=MID(A2, SEARCH("售",A2) + 2, SEARCH("萬",A2)-2).
The first argument of MID
is the cell number. The second argument is the starting point of the text being extracted. The third argument is the ending point of the text being extracted. This is not the end. To convert currency format back to a number, you need to replace the ,
by empty string ''
.
This is not a bad idea! But it may fail in some cases like the ending string is not 萬
(10 thousand) but 千
(thousand). How can we solve this and make it more general?
Introducing Regular Expression (Regex)!!!
By using this code
^\W\s([+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:.[0-9]{2})?)\s\W
in Regex, we can extract the value in the text regardless of what’s in front or behind the value. It looks more complicated than Excel. Why don’t we just use Excel? To answer this, please keep reading.
Syntax | Meaning |
^ | Start of string |
\W | Any non word character |
\s | Any whitespace character |
[+-] | A single character of + or – |
? | Zero or one |
* | Zero or more |
[0-9] | A character in range 0-9 |
{1,3} | Between 1 to 3 |
(?:…) | Match everything enclosed but do not capture the value |
( ) | Match everything enclosed and capture the value |
The whole code can be read as “Let’s start with the start of the string. Match any non-word character then followed by a space. Then capture anything that starts with +/- sign but it is not necessary, follows by a 3 digits number which can repeat many times, follows by a comma but it is not necessary, follows by a dot but it is not necessary, follows by a 2 digits number but it is not necessary. After that space and a non-word character are expected.
Now, you will start to realize that Regex can be much more specific in what pattern to match instead of using Mid
function in Excel which you do not have control of the characters in the middle.
Regex can match a lot of things, for example, email addresses, phone numbers, URLs, logs, etc. If you need to match a certain pattern, before writing your own, please search Google or use Regexlib to find whether someone has done it before. You can also use Regex101 to test the code!
Another good news is that you don’t need to know any programing language to use Regex. You can use Regex101 to extract data and download it in CSV format, or use Google Sheet which supports Regex. Check it out at this link.
Comments ()