In this blog, I will discuss the process of creating a Google Cloud Platform (GCP) pipeline. We'll begin with ingesting data from our customer support ticketing system, Cherwell and finish with visualising data in Power BI.
You can take a look at our GCP services which complement what I'll be explaining in this blog, here.
This blog will be providing an alternative to orchestrating your pipeline with Cloud Composer by using Functions and Scheduled Queries instead. You may prefer this method because it is cheaper, however by not using Cloud Composer, you miss out on the built-in logging and monitoring.
This will be an ETL pipeline, which means the raw data will be stored outside the data warehouse, and transformations will be done before loading the data into the data warehouse. The alternative is an ELT pipeline, which means the raw data is loaded directly into the data warehouse and transformations are performed within.
I'm assuming you already have a functional GCP account, and a project set up. You can create a free account here with $300 worth of credits, but make sure you shut down all the resources we create in this blog once completed, otherwise you may be charged. Before you read on, check out our Google Cloud Consultancy Services to see how we could advise you.
Initially, you need to create a service account to grant access between services used in the GCP pipeline.
In the API request, you may use confidential credentials, therefore, hiding them in plain text is essential; you can do this using 'secrets'. Secrets can password-protect confidential information, so only authorised users can see them.
The function used in step 4 not only makes an API request to third-party software but also saves the results of this API request to Cloud Storage in a JSON file. To do this, we need to create a bucket and a directory to store these JSON files.
To kick off the pipeline, I created a Cloud Function to create an API request for the third-party software. In this example, the third-party software will be the managed service 'Cherwell'.
Below is the code used to create API requests to Cherwell and UK bank holidays:
Below are the modules I used in the requirements.txt:
BigQuery is used as a Data Warehouse in GCP to store data that will be displayed in the reports. It is important to manually create a dataset that will be used to store tables generated from the dataflow transformations in step 6.
To be able to run a dataflow job that transforms your data, you first need to create a classic template in the programming language of your choice (Java or Python) and then stage this template using the Cloud Shell. This means it can be run by the Cloud Function in Step 6 using the Dataflow REST API.
Below is the code I used to create my template in Python called dataflow-cherwell.py:
Below are the modules I used in the requirements.txt:
Below are the arguments I used to stage the template in the Cloud Shell:
The next step of the pipeline is to trigger a dataflow job that transforms the JSON files stored in the cloud storage. This function should trigger as soon as the data arrives at the cloud storage bucket.
Below is the code used to trigger the dataflow job once the files have been stored in the cloud storage bucket:
Below are the modules used in the requirements.txt:
BigQuery allows you to perform SQL queries and transformations on the stored data. It can also connect to data analytics software, such as Power BI. Table transformations should be stored as views rather than tables. This is because they take up significantly less storage space while still being able to run queries and connect them to reports.
Cloud logging will allow you to retrieve the exact point in time your tables were created in BigQuery. You can then create a sink in cloud logging to create a PubSub topic every time this log is generated.
Doing this will create a PubSub topic if that specific table is created.
You will now need to create a cloud function that will be triggered based on the PubSub topic created in step 8. This cloud function will run the scheduled query you created in step 8.
Below is the code used to run the scheduled query you created in step 8:
Below is the module I used in the requirements.txt:
BigQuery can connect to many different data analytic tools, but in my pipeline, I decided to use Power BI because I had the most familiarity with this software. You can connect Power BI to BigQuery using DirectQuery, which allows Power BI to run reports directly on the data held within BigQuery. The benefits of this are the report runs on the current data, you don't have to wait on the report to load the data, and it takes up less storage.
Finally, we will create a Cloud Schedule that will kick off the pipeline at a frequency of your choice.
You now have developed a scheduled pipeline which fetches raw data from Cherwell, stores it in Cloud Storage, transforms it using Dataflow, stores it again in BigQuery and displays the data in Power BI without the need for Cloud Composer.
I have not compared the data pipeline with and without the use of Cloud Composer. Therefore, I would highly recommend weighing up the pros and cons of each before picking a solution.
Feel free to change Cherwell to a third-party software of your choice with changes to the API, such as changing the authentication method and endpoints, which can be found within the software's API documentation. Furthermore, if you decide to use another third-party software, storing data in Cloud Storage and transforming the data using Dataflow may not be necessary. This is because the data retrieved from Cherwell was not in a suitable format to be stored directly in BigQuery. However, this may not be the case when using another third-party software, and it would change the pipeline's architecture from ETL to ELT.
Power BI can also be replaced with any data analytics software that has a supported connector to GCP.