The Best Personalized Fundamental Stocks Scanner Setup (Part 2)
A few weeks ago) (In this post), I have shared with you where to find 10 years of financial statement data and how to use AWS Athena
to query the files stored in AWS S3
. In this post, I am going to show you how to write a program to download the data and store it in S3
.
To download the files using Python in SEC web is not that difficult.
- Find the pattern in the download links
- Use
requests
package to get the files - Unzip the files and upload to
S3
The basic steps are like that. But there are some details we need to handle. For example, how to keep track of the files we have already processed? We don’t want to keep duplicated copies of data in S3
. How can we select only certain file types to be uploaded?
Without further a due, let me show you the code:
It may look complicated. But the main part is in the for loop
. I basically loop through all the URLs found, create a folder called zipdate
if not existed yet, and extract the downloaded zip files to the folder, and finally upload to s3
path.
To get the URLs for all financial statements data, I used BeautifulSoup
to find all href
links on the webpage. Then I used regex
to filter only those links with zip
extension.
Next, to maintain a record of processed URLs and prevent duplication, I used logging
package to create a log file containing the URLs already processed. Then, I defined a function called find_url_in_file
to retrieve those processed URLs and used set
operation to get the symmetric difference. Finally, I also check if the files already existed in s3
using the upload_dir_s3
` function.
Comments ()