Once upon a time, a young street urchin from a kingdom far, far away was managing virtual machines and cron jobs for a living but soon his life will turn upside down when he discovers serverless.
Having a setup where a simple script is running once a day/week/month on a server is not ideal. Sure, this could be a small, free-tier virtual machine running in a public cloud, so who cares?
In this article I will give a quick how-to for replacing a simple data retrieval pipeline (as the one described in this article) with a completely serverless solution using the Google Cloud Platform.
Note: there might be a reference or two from Disney’s Aladdin movie which I recently watched!
Let’s imagine ourselves in the ancient kingdom of Agrabah. The protagonist of our story was looking for a new solution for his data pipeline when a giant, blue ghost came out of a magic lamp.
🧞 Three Wishes
The Genie granted three wishes:
- I wish to never have to worry about my infrastructure ever again
- I wish to only pay for the compute time my code is using to run
- and finally, I wish automatic scaling with high availability
☁️ Answer is serverless
Back in the old days, this would probably seem impossible even for Big G.
But today all the major cloud providers have some sort of serverless option: AWS Lambda, MS Azure Functions, GCP Cloud Functions.
In this project, I implemented the following architecture to get a weekly dataset of audio data from the Spotify API:
In more detail this means:
☁️ the first Cloud Function is retrieving the data from the Spotify API and storing it as CSV
☁️ Cloud Storage is just for storing the CSVs
☁️ the second CF will load the data from the CSVs into BigQuery
☁️ BigQuery is our serverless data warehouse holding all the data
☁️ last but not least, Cloud Scheduler (which is not shown in the diagram above) will trigger the first CF once a week
☝️Command Line vs. Web Console
Interacting with GCP services can be done via the CLI or web console (REST API being another option). Personally, I find the web console very user-friendly and can absolutely recommend it. In fact, I did this project mostly using the web frontend.
☁️ Cloud Functions
Two Cloud Functions were needed here: one for retrieving the data from Spotify and the other one for loading the data into BigQuery.
Besides the function itself there is one major difference between them: the trigger.
The first one is going to be triggered by an HTTP request (like opening an URL with a web browser). The second one by a Cloud Storage event (whenever a new CSV is stored in the Cloud Storage bucket).
🐍 Gotcha alert: my Cloud Function was crashing until I added a ‘greater-than’ sign in my
So I basically changed this:
pandas==0.22.0 spotipy==2.4.4 google-cloud-storage==1.15.0
pandas>=0.22.0 spotipy>=2.4.4 google-cloud-storage>=1.15.0
The second Cloud Function is going to take the CSV’s from the Google Storage bucket and load them into BigQuery.
Setting up a dataset and table in BigQuery is fairly easy with the the web console. Just make sure to not miss any columns or data types.
☁️ Cloud Scheduler
We could now trigger the whole data pipeline manually by opening the URL endpoint of the first Cloud Function in a web browser. But that’s not how we do things here in the kingdom of Agrabah 🧞.
With the help of Cloud Scheduler we will automatically trigger the URL endpoint at a specified schedule.
Below is the schedule for “every Monday at 12:30 AM”. Target/method needs to be HTTP/POST.
30 00 * * 1
🐍 Picking a schedule using the unix-cron format is another gotcha. This website can help.
After everything was set up we can just fly away on our magic carpet and let the automation do it’s thing.
T H E E N D
🔗 Moving your cron job to the cloud with Google Cloud Functions by Dustin Ingram
🔗 My github repo for this project (includes all the notebooks, scripts and more)