Dan Sullivan

As cloud adoption allows companies to scale their technical infrastructure, understanding how to leverage all the data stored in cloud servers becomes a competitive imperative. Public cloud platforms allow data scientists to gather deep insights because the services support the full life cycle of data science, from data exploration and collection through to deploying models or explaining one’s findings.

With hybrid and multi-cloud infrastructures becoming more popular, companies don’t have to rely on a single vendor for every cloud need. Data teams have a wide array of choices when it comes to tools and platforms and several factors influence the choices we make. That’s why I encourage companies and my students to consider Google Cloud Platform (GCP) solutions for any data science applications within their multi-cloud structure. 

GCP as a public cloud infrastructure provider is growing in popularity and currently ranks as the third-largest public cloud provider after Amazon’s AWS and Microsoft’s Azure. Every cloud provider offers its own advantages and disadvantages when it comes to features, but where I feel GCP really stands out from its competitors is in data science and machine learning. 

In this article, I’ll share the five differentiators that make GCP a powerful tool for data science teams.

1. Ease of use

One of the first things users notice about Google Cloud Platform (GCP) is how easy it is to get started with virtual machines and cloud storage. Data scientists can spin up virtual machines and containers, upload data, and start analysis jobs all from a graphical user interface. Also, GCP provides reasonable defaults for many infrastructure configuration parameters, which means data scientists spend less time configuring things like firewall rules and security groups.  

If you are working with large data sets, you can upload data to Cloud Storage where you can choose among several categories of storage. If you need low latency access to data from different geographic areas, you can use multi-region storage; less frequently accessed data can be stored in Nearline or Coldline storage. Again, all of this can be done through a graphical user interface.

Google Cloud Associate Cloud Engineer: Get Certified 2024

Last Updated March 2024

  • 114 lectures
  • All Levels
4.4 (10,534)

Learn How to Pass the Exam from the author of the Official Certification Guide for Google | By Dan Sullivan

Explore Course

2. Range of compute options

GCP offers a variety of computing resources of which you can pick and choose the optimal configuration for your needs. If you need full control over servers and operating systems, you can use Compute Engine. Managed instance groups make it easy to create instances and scale them up and down automatically depending on demand. 

If you prefer to deploy containers, Kubernetes Engine offers managed clusters while Cloud Run is a serverless option for running stateless containers. Both Compute Engine and Kubernetes Engine support the use of GPUs and TPUs.

3. Managed services for data science

Spending time configuring and managing servers takes away from time that could be spent analyzing data and building models. With GCP, teams can use managed services to reduce the operational overhead of common data science work. 

Cloud Dataproc is a managed Spark/Hadoop service that allows you to quickly spin up clusters. Unlike on-premise Spark clusters which typically run continually, Dataproc clusters are usually ephemeral. You start them when you need them and shut them down when your job finishes — capabilities that can lead to significant savings.  

Cloud Dataflow is a managed service for steam and batch processing and is well suited for pre-processing large data sets prior to analysis. A more recent addition to the GCP set of services, Cloud Data Fusion, is also available for extraction, transformation, and load (ETL) and ELT workflows.

4. Build models with SQL

With so much structured data stored in relational databases, SQL is an essential data science skill. GCP offers BigQuery, a managed analytical database, that uses SQL as the query language.

More importantly, BigQuery SQL allows users to create regression and classification models in SQL, including linear regression, binary and multi-class logistic regression, K-means clustering, time series prediction, and XGBoost as well as allowing users to run TensorFlow models. If you want to work with SQL and need to scale up to petabyte volume data sets, then BigQuery is an option to consider.

5. Telling your story

Once you’ve finished your analysis, it’s time to build out the story behind the data and share those results across your organization. Tools like Cloud Data Studio enable teams to build interactive dashboards, including visualizations that can help non-technical team members better understand the data story. 

Cloud Data Studio integrates with BigQuery as well as other services, including Google Analytics and Google Ad. With Google’s acquisition of the popular business intelligence platform, Looker, customers now have a high-end business intelligence analytics and reporting platform available to them to make sense of the growing data pouring into a company.

To start using the data advances in GCP, I recommend you and your team build a foundational level of knowledge of the platform by completing the Google Associate Cloud Engineer Certification. The skills needed to pass this exam create an understanding of the GCP fundamentals required to plan and configure cloud solutions, monitor cloud operations, deploy applications, manage your company’s cloud environment, and more.

Empower your team. Lead the industry.

Get a subscription to a library of online courses and digital learning tools for your organization with Udemy Business.

Request a demo

Page Last Updated: July 2020