We all know how necessary data is and even more so «good data» to be able to exercise the cool task of Data Science. In my case, one of the recurring tasks to obtain that data is the scraping of websites to be able to then analyze the data or generate and automate tasks of my day-to-day.
We are going to expose here two use cases that can occur in this type of website, let’s start with the first in this first part:
1. In search of the lost .json.
As we said this type of website, when making a “request” provides us with an Html code that has nothing to do with what is shown on the web of our browser. For example, if we enter this website of calls for aid and financing of entrepreneurship projects of Banco Santander (https://www.santanderx.com/calls) there are a lot of data on calls that would be great to collect and be able to analyze to study where it puts its efforts in supporting university entrepreneurship. It would be great to connect and get the html that provides us with the tags where to extract the data, but in doing so… disappointment… shows us a «skeletal» Html.
To check it out, I use a very interesting feature in Firefox’s «Inspect» functionality. This is the Firefox Network analysis, which basically tells us the process of loading the different functions and data from the server to the browser.
When reloading the web, it generates the entire process of requesting data and code to the server that will show us successively:
Here, we must look at how it requests the data, in this case, it shows us two .json files that the server seems to return with the call and event data that is displayed in the browser. If we copy that url (right button above) and put it in the browser… voila! We obtain the complete data of the 6 calls that appear in the first place of the cover.
But we must notice that it tells us that there are a total of 600 calls, so we can play with the api rest (url) so as not to limit the number to 6 and thus obtain the complete data: https://api-manager.universia.net/santanderx-core/api/calls/find?offset=0&limit=6, we will change the 6 by 600 and we will obtain the .json with all the data of the calls that interest us. In the same way, we can do with the other url for events.
From here, we only have to treat the json in Python and import it to Pandas to work on the analysis of the data.
In my Github repository, I leave the Jupyter of data extraction and analysis if you want to dig a little deeper.