Google Cloud Data Engineer Practice Exam Questions

Free GCP Certification Exam Topics Tests

Over the past few months, I’ve been helping software developers, solutions architects, DevOps engineers, and even Scrum Masters who have been displaced by AI and ML technologies learn new skills and accreditations by getting them certified on technologies that are in critically high demand.

In my opinion, one of the most reputable organizations providing credentials is Google, and one of their most respected designations is that of the Certified Google Cloud Professional Data Engineer.

So how do you get Google certified and get Google certified quickly? I have a simple plan that has now helped thousands, and it’s a pretty simple strategy.

Google Cloud Certification Practice Exams

First, pick your designation of choice. In this case, it’s Google’s Professional Data Engineer certification.

Then look up the exam objectives and make sure they match your career goals and competencies.

The next step?

It’s not buying an online course or study guide. Next, find a Google Professional Data Engineer exam simulator or a set of practice questions for the GCP Data Engineer exam. Yes, find a set of Data Engineer sample questions first and use them to drive your study.

First, go through your practice tests and just look at the GCP exam questions and answers. That will help you get familiar with what you know and what you don’t know.

When you find topics you don’t know, use AI and Machine Learning powered tools like ChatGPT, Cursor, or Claude to write tutorials for you on the topic.

Really take control of your learning and have the new AI and ML tools help you customize your learning experience by writing tutorials that teach you exactly what you need to know to pass the exam. It’s an entirely new way of learning.

About GCP Exam Dumps

And one thing I will say is try to avoid the Google Cloud Professional Data Engineer exam dumps. You want to get certified honestly, you don’t want to pass simply by memorizing somebody’s GCP Data Engineer braindump. There’s no integrity in that.

If you do want some real Google Cloud Data Engineer exam questions, I have over a hundred free  exam questions and answers on my website, with almost 300 free exam questions and answers if you register. But there are plenty of other great resources available on LinkedIn Learning, Udemy, and even YouTube, so check those resources out as well to help fine-tune your learning path.

The bottom line? Generative AI is changing the IT landscape in disruptive ways, and IT professionals need to keep up. One way to do that is to constantly update your skills.

Get learning, get certified, and stay on top of all the latest trends. You owe it to your future self to stay trained, stay employable, and stay knowledgeable about how to use and apply all of the latest technologies.

Now for the GCP Certified Data Engineer exam questions.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

GCP Data Engineer Practice Exams

Question 1

At Nimbus Outfitters you are planning storage for 25 TB of CSV files as part of a new analytics pipeline on Google Cloud. Teams across the company will run aggregate queries with different processing engines while the files remain in Cloud Storage. You want to keep the cost of running these aggregate queries as low as possible and still allow shared access across tools. Which storage approach and schema setup should you choose?

  • ❏ A. Use Cloud Bigtable and run queries from an HBase shell on a Compute Engine VM

  • ❏ B. Store the data in Cloud Storage and create temporary external tables in BigQuery whenever users need to query

  • ❏ C. Keep the files in Cloud Storage and define permanent external tables in BigQuery that reference the CSV objects

  • ❏ D. Load the files into partitioned BigQuery managed tables and query directly in BigQuery

Question 2

Is it possible to convert an existing single master Cloud Dataproc cluster to a high availability cluster with three masters using gcloud, and if so which command should you run?

  • ❏ A. gcloud dataproc clusters repair my-ha-cluster –masters=3

  • ❏ B. You cannot change the master node count after the cluster is created

  • ❏ C. gcloud dataproc clusters update my-ha-cluster –num-masters=3

  • ❏ D. gcloud dataproc clusters create my-ha-cluster –num-masters=3

Question 3

At Meridian Metrics you plan to train a BigQuery ML linear regression model that estimates the likelihood that a site visitor will buy an item. Your source table includes a string field for the customer’s city which is known to be highly predictive. You want to keep preprocessing inside BigQuery with very little custom code while retaining the full signal from this categorical feature. What should you do?

  • ❏ A. Create a BigQuery view that removes the city field before model training

  • ❏ B. Use ML.HASH_BUCKET on the city field to turn it into a single numeric hash feature and train on that representation

  • ❏ C. Apply ML.TRANSFORM with ONE_HOT_ENCODER to the city field and train on the transformed output

  • ❏ D. Build a TensorFlow preprocessing pipeline that generates a city vocabulary and connect it to BigQuery ML

Question 4

Given 300 TB of training data accessed roughly every 30 days, with each job reading only a small subset, which Google Cloud Storage class provides low cost while remaining reliable and highly available?

  • ❏ A. Cloud Storage Archive

  • ❏ B. Cloud Storage Nearline class

  • ❏ C. Cloud Storage Coldline storage

Question 5

At AuroraRetail Co., you ingest several terabytes of event data from Google Analytics 4 into BigQuery each day. Customer attributes such as preferences and loyalty tiers are stored in two transactional systems. One is a Cloud SQL for MySQL instance and the other is a Cloud SQL for PostgreSQL instance that backs your CRM. The growth team wants to combine behavioral events with customer records to target customers active in the last year. They plan to run these campaigns about 120 times on a regular day and up to 360 times during major promotions. You must support frequent queries without placing heavy read load on the Cloud SQL systems. What should you do?

  • ❏ A. Create BigQuery connections to both Cloud SQL databases and run federated queries that join Cloud SQL tables with the BigQuery events for each campaign

  • ❏ B. Set up Datastream to continuously replicate the necessary tables from both Cloud SQL instances into BigQuery and run all campaign queries only in BigQuery

  • ❏ C. Trigger a Dataproc Serverless Spark job for each campaign to read from both Cloud SQL databases and from BigQuery directly

  • ❏ D. Create read replicas for both Cloud SQL databases and point BigQuery federated queries at the replicas to isolate the primaries

Question 6

You need event-driven orchestration where each new file in Cloud Storage triggers a Dataproc normalization followed by BigQuery transformations for about 350 tables, and the transformations can run for up to four hours. Which approach will minimize maintenance?

  • ❏ A. BigQuery Data Transfer Service with scheduled queries every 45 minutes

  • ❏ B. Cloud Composer DAG per table triggered by Cloud Storage finalize via Cloud Functions that runs Dataproc then BigQuery

  • ❏ C. Workflows triggered by Cloud Storage finalize via Eventarc that calls Dataproc then BigQuery

Question 7

BeaconPlay is a media startup that serves soccer fans around the globe. The platform offers live broadcasts and an on-demand library of recorded matches, and the lead engineer wants viewers to have consistent playback quality for the recorded videos no matter where they are located. Which Google Cloud service should be used to efficiently deliver the on-demand content to a worldwide audience?

  • ❏ A. Cloud Storage multi-region

  • ❏ B. Cloud Load Balancing

  • ❏ C. Cloud CDN

  • ❏ D. Cloud Storage Nearline

Question 8

Which data model is best suited for semistructured asset records in an interactive application where the schema changes approximately every three weeks?

  • ❏ A. Snowflake schema

  • ❏ B. Wide-column model

  • ❏ C. Document model

  • ❏ D. Star schema

Question 9

At Aurora Press you keep analytical files in both Google Cloud Storage and in Amazon S3, and everything is stored in North America. Analysts need to run up to date queries in BigQuery regardless of which cloud holds the data, and they must not receive direct permissions on either set of buckets. What should you implement to let them query the data through BigQuery while avoiding direct bucket access?

  • ❏ A. Use the Storage Transfer Service to replicate S3 objects into Cloud Storage and then build BigLake tables over the Cloud Storage data to query from BigQuery

  • ❏ B. Configure a BigQuery Omni connection to the S3 location and create external tables over data in both Cloud Storage and S3 for direct querying in BigQuery

  • ❏ C. Set up a BigQuery Omni connection to the S3 buckets and create BigLake tables that reference objects in both Cloud Storage and S3, then query them from BigQuery

  • ❏ D. Build a Dataflow pipeline that loads files from S3 into partitioned BigQuery native tables every 45 minutes and run queries on those tables

Question 10

In a streaming pipeline that ingests from Pub/Sub, processes with Dataflow, and writes to BigQuery, about 4% of events are malformed. How should you modify the Dataflow pipeline so that only valid records are written to BigQuery?

  • ❏ A. Define a Partition to split valid and invalid then drop the invalid branch

  • ❏ B. Use a ParDo to validate and emit only valid records

  • ❏ C. Configure a Pub/Sub dead-letter topic

  • ❏ D. Rely on BigQuery schema enforcement to reject bad rows

Question 11

After spending four days loading CSV files into a BigQuery table named WEB_EVENT_LOGS for Clearwater Goods, you realize the column evt_epoch stores event timestamps as strings that represent UNIX epoch times because you initially set every field to STRING for speed. You need to calculate session durations from these events and you want evt_epoch available as a TIMESTAMP so that future filters and joins are efficient. You want to make the smallest possible change while keeping future queries fast. What should you do?

  • ❏ A. Drop WEB_EVENT_LOGS, recreate it with evt_epoch defined as TIMESTAMP, and reload all historical data from the CSV files

  • ❏ B. Add two new columns named event_ts as TIMESTAMP and is_new as BOOLEAN, then reload all data in append mode with is_new set to true and query only those rows going forward

  • ❏ C. Add a TIMESTAMP column named event_ts to WEB_EVENT_LOGS, backfill it by converting evt_epoch, and use event_ts for all future queries

  • ❏ D. Write a query that casts evt_epoch to TIMESTAMP and writes the results to a new table WEB_EVENT_LOGS_NEW using event_ts as the TIMESTAMP column, then switch all pipelines and reports to the new table

  • ❏ E. Create a view named WEB_EVENT_VIEW that casts evt_epoch to TIMESTAMP on the fly and point all future queries to the view

Question 12

Which command sets the serve-ml Deployment to 5 replicas?

  • ❏ A. kubectl rollout restart deployment serve-ml

  • ❏ B. kubectl autoscale deployment serve-ml –min=5 –max=5

  • ❏ C. kubectl scale deployment serve-ml –replicas=5

  • ❏ D. gcloud container clusters resize fraud-cluster –num-nodes=5 –zone=us-central1-a

Question 13

Harborline Outfitters keeps tens of millions of records in a BigQuery date partitioned table named retail_ops.sales_events, and dashboards at example.com and internal services run aggregation queries dozens of times per minute. Each request calculates AVG, MAX and SUM across only the most recent 12 months of data, and the base table must preserve all historical rows for auditing. You want results that include brand new inserts while keeping compute cost, upkeep, and latency very low. What should you implement?

  • ❏ A. Enable BigQuery BI Engine and query retail_ops.sales_events with a filter for the last 12 months of partitions

  • ❏ B. Create a scheduled query that rebuilds a 12 month aggregate summary table every 30 minutes

  • ❏ C. Create a materialized view that aggregates retail_ops.sales_events and restricts it to the last 12 months of partitions

  • ❏ D. Create a materialized view on retail_ops.sales_events and configure a partition expiration policy on the base table so only the last 12 months are kept

Question 14

How should you design service account access to PII stored in Cloud Storage to enforce least privilege and enable auditability?

  • ❏ A. One shared service account per project with CMEK on the buckets

  • ❏ B. Default Compute Engine service account with Project Editor for all workloads

  • ❏ C. Service accounts per workload with IAM groups and least privilege roles

  • ❏ D. Individual service accounts for each employee to access data

Question 15

A fintech named OrionPay needs to orchestrate a multi stage analytics workflow that chains several Dataproc jobs and downstream Dataflow pipelines with strict task dependencies. The team wants a fully managed approach that provides retries, monitoring, and parameterized runs, and they must trigger it every weekday at 0315 UTC. Which Google Cloud service should they use to design and schedule this pipeline?

  • ❏ A. Workflows

  • ❏ B. Cloud Composer

  • ❏ C. Cloud Scheduler

  • ❏ D. Dataproc Workflow Templates

Question 16

In Dataplex, how should you assign roles so engineers have full control of the sales lake and analysts have read-only access to curated data in the refined zone while maintaining governance within Dataplex?

  • ❏ A. Grant the dataplex.dataReader role on the sales lake to engineering and grant the dataplex.dataOwner role on the refined zone to analytics

  • ❏ B. Assign dataplex.dataOwner on the sales lake to engineering and assign dataplex.dataReader on the refined zone to analytics

  • ❏ C. Grant BigQuery and Cloud Storage IAM roles directly on datasets and buckets for each group

  • ❏ D. Use Data Catalog policy tags and share BigQuery datasets without Dataplex roles

Question 17

HarborLight Retail needs to run both scheduled batch loads and real time event streams in Google Cloud Dataflow, and leaders expect predictable execution with correct aggregates even when some records show up late or arrive out of order. How should you design the pipeline so that results remain accurate in the presence of late and out of order events?

  • ❏ A. Configure sliding windows wide enough to cover lagging records

  • ❏ B. Use a single global window to simplify aggregation across all events

  • ❏ C. Assign event time timestamps and configure watermarks with allowed lateness and triggers

  • ❏ D. Enable Pub/Sub message ordering and rely on processing time windows for consistency

Question 18

Which Google Cloud solution provides global, strongly consistent ACID transactions with SQL access and supports concurrent updates across multiple regions at approximately 30 million operations per day?

  • ❏ A. Cloud SQL with BigQuery federation

  • ❏ B. Cloud Spanner with locking read write transactions

  • ❏ C. AlloyDB for PostgreSQL with read replicas

Question 19

Aurora Streams is moving its legacy warehouse to BigQuery and wants stronger collaboration across about 24 internal groups. The company needs a design that lets data producers securely publish curated read only datasets that others can easily discover and subscribe to without tickets. They also want subscribers to read the freshest data while keeping storage and operational costs low. Which approach should they use?

  • ❏ A. Grant bigquery.dataViewer on each producer dataset to every subscribing team

  • ❏ B. Use BigQuery Data Transfer Service to replicate shared datasets into a central exchange project on an hourly schedule

  • ❏ C. Publish datasets through BigQuery Analytics Hub and let teams subscribe to linked datasets

  • ❏ D. Catalog producer datasets in Dataplex and control access with tag based IAM for consumer projects

Question 20

For a Pub/Sub push subscription, how should you configure retry behavior and dead lettering so messages survive short outages, retry with gradual delays, and are routed to a different topic after 10 delivery attempts?

  • ❏ A. Use immediate retry and enable dead lettering to a different topic with a cap of 10 delivery attempts

  • ❏ B. Use exponential backoff for retries and configure a dead letter topic that is different from the source with a maximum of 10 delivery attempts

  • ❏ C. Set the acknowledgement deadline to 20 minutes

Google Professional Data Engineer Sample Questions Answered

Question 1

At Nimbus Outfitters you are planning storage for 25 TB of CSV files as part of a new analytics pipeline on Google Cloud. Teams across the company will run aggregate queries with different processing engines while the files remain in Cloud Storage. You want to keep the cost of running these aggregate queries as low as possible and still allow shared access across tools. Which storage approach and schema setup should you choose?

  • ✓ C. Keep the files in Cloud Storage and define permanent external tables in BigQuery that reference the CSV objects

The correct option is Keep the files in Cloud Storage and define permanent external tables in BigQuery that reference the CSV objects.

This approach lets all teams keep using the same CSV objects in Cloud Storage while BigQuery provides a shared schema and consistent SQL interface. With permanent external tables you avoid ingesting or duplicating 25 TB into BigQuery storage, which keeps storage spending down and preserves cross‑tool access to the files. You can manage schema centrally and control access with IAM, and queries are billed on the bytes BigQuery scans without extra data movement.

Use Cloud Bigtable and run queries from an HBase shell on a Compute Engine VM is incorrect because Cloud Bigtable is designed for low latency operational workloads rather than ad hoc analytical aggregation on CSV files. It does not provide a SQL analytics layer for scanning large flat files and moving 25 TB into Bigtable would add complexity without meeting the requirement to keep files in Cloud Storage for shared use.

Store the data in Cloud Storage and create temporary external tables in BigQuery whenever users need to query is incorrect because ad hoc temporary definitions add friction and inconsistency. The query cost is not lower than using a reusable definition and repeated setup hampers governance and reuse, whereas a permanent external definition provides a stable shared schema with less operational overhead.

Load the files into partitioned BigQuery managed tables and query directly in BigQuery is incorrect because it moves the data into BigQuery storage which duplicates the 25 TB and prevents other tools from using the same Cloud Storage files. Although partitioning can reduce scanned bytes for some queries, it violates the requirement to keep the files in Cloud Storage and adds ongoing BigQuery storage cost.

When the requirement says the files must remain in Cloud Storage and be accessible to multiple tools, think BigQuery external tables. Prefer a permanent external table for shared schemas and consistent access rather than ad hoc temporary definitions.

Question 2

Is it possible to convert an existing single master Cloud Dataproc cluster to a high availability cluster with three masters using gcloud, and if so which command should you run?

  • ✓ B. You cannot change the master node count after the cluster is created

The correct option is You cannot change the master node count after the cluster is created.

In Cloud Dataproc the number of master nodes is fixed at cluster creation time. If you need a high availability cluster with three masters you must create a new cluster with three masters and then move jobs or workflows to it. You cannot convert an existing single master cluster into a three master cluster using any gcloud operation.

gcloud dataproc clusters repair my-ha-cluster –masters=3 is incorrect because the repair operation can replace failed instances or adjust workers but it does not support changing the number of master nodes, and the shown flag is not a supported way to add masters.

gcloud dataproc clusters update my-ha-cluster –num-masters=3 is incorrect because the update command does not allow modifying the master count and there is no supported flag to change masters on an existing cluster.

gcloud dataproc clusters create my-ha-cluster –num-masters=3 is incorrect for this question because it creates a new high availability cluster rather than converting an existing single master cluster.

When you see questions about changing core cluster topology, look for hints that the setting is immutable. If update or repair commands do not list a flag to change it, the safest answer is that you must recreate the resource with the desired configuration.

Question 3

At Meridian Metrics you plan to train a BigQuery ML linear regression model that estimates the likelihood that a site visitor will buy an item. Your source table includes a string field for the customer’s city which is known to be highly predictive. You want to keep preprocessing inside BigQuery with very little custom code while retaining the full signal from this categorical feature. What should you do?

  • ✓ C. Apply ML.TRANSFORM with ONE_HOT_ENCODER to the city field and train on the transformed output

The correct option is Apply ML.TRANSFORM with ONE_HOT_ENCODER to the city field and train on the transformed output.

This approach keeps preprocessing inside BigQuery with minimal custom code and preserves the full predictive signal from the categorical city field. One hot encoding creates a separate binary indicator for each city so the linear model can learn a distinct weight per category. Using the TRANSFORM clause also cleanly separates feature engineering from model definition which makes the workflow simpler and repeatable.

Create a BigQuery view that removes the city field before model training is incorrect because it discards a highly predictive feature which reduces model performance.

Use ML.HASH_BUCKET on the city field to turn it into a single numeric hash feature and train on that representation is incorrect because hashing compresses many categories into limited buckets which introduces collisions and loses information. A single numeric code can also mislead a linear model by implying an arbitrary ordering.

Build a TensorFlow preprocessing pipeline that generates a city vocabulary and connect it to BigQuery ML is unnecessary and adds complexity. The requirement is to keep preprocessing in BigQuery with little custom code and the built in transform with one hot encoding already meets that need.

When a categorical string feature is strongly predictive and you want to keep work inside BigQuery, think of TRANSFORM with one hot encoding. Hashing trades signal for compactness and removing the feature wastes information.

Question 4

Given 300 TB of training data accessed roughly every 30 days, with each job reading only a small subset, which Google Cloud Storage class provides low cost while remaining reliable and highly available?

  • ✓ B. Cloud Storage Nearline class

The correct option is Cloud Storage Nearline class. It provides low storage cost for data that is accessed about once per month while remaining highly durable and available.

Cloud Storage Nearline class is optimized for infrequent access on the order of 30 days. The per gigabyte retrieval fees remain manageable when each training job reads only a small subset of the data and the 30 day minimum storage duration matches the usage pattern. It retains very high durability and offers strong availability across regional, dual region, or multi region locations.

Cloud Storage Archive targets data that is rarely accessed such as once a year. It has a much longer minimum storage duration and higher retrieval costs and latency, which make it unsuitable and more expensive for data you touch every month.

Cloud Storage Coldline storage is aimed at data accessed roughly once a quarter. It has a 90 day minimum storage duration and higher retrieval costs, so monthly access would typically incur more cost than Cloud Storage Nearline class and would not be the best fit compared to Nearline.

Match the storage class to access frequency. Use Nearline for about monthly access, Coldline for quarterly, and Archive for yearly, and always factor in retrieval charges and minimum storage durations when only a small subset is read.

Question 5

At AuroraRetail Co., you ingest several terabytes of event data from Google Analytics 4 into BigQuery each day. Customer attributes such as preferences and loyalty tiers are stored in two transactional systems. One is a Cloud SQL for MySQL instance and the other is a Cloud SQL for PostgreSQL instance that backs your CRM. The growth team wants to combine behavioral events with customer records to target customers active in the last year. They plan to run these campaigns about 120 times on a regular day and up to 360 times during major promotions. You must support frequent queries without placing heavy read load on the Cloud SQL systems. What should you do?

  • ✓ B. Set up Datastream to continuously replicate the necessary tables from both Cloud SQL instances into BigQuery and run all campaign queries only in BigQuery

The correct option is Set up Datastream to continuously replicate the necessary tables from both Cloud SQL instances into BigQuery and run all campaign queries only in BigQuery.

This approach continuously captures changes from both Cloud SQL for MySQL and PostgreSQL and brings them into BigQuery with near real time freshness. You remove read pressure from the transactional systems and you run all 120 to 360 daily campaign queries inside BigQuery where large joins with GA4 event data scale well. Change data capture ensures the customer attributes remain current so you can reliably target customers active in the last year.

Create BigQuery connections to both Cloud SQL databases and run federated queries that join Cloud SQL tables with the BigQuery events for each campaign is not suitable because each query still reads from Cloud SQL and adds connection and throughput overhead. Federated queries have limitations and quotas and they do not scale well for frequent large joins, which risks performance issues on the databases.

Trigger a Dataproc Serverless Spark job for each campaign to read from both Cloud SQL databases and from BigQuery directly adds unnecessary complexity and latency and it repeatedly pulls from Cloud SQL which creates the same read load problem. The workload is analytic SQL that fits BigQuery better than spinning up many Spark jobs throughout the day.

Create read replicas for both Cloud SQL databases and point BigQuery federated queries at the replicas to isolate the primaries still leaves the replicas handling many ad hoc analytical reads and the same federation limits apply. This does not match the scale and frequency needed and it increases operational burden without solving the core load and scalability concerns.

When you see frequent analytical joins across BigQuery and OLTP data, think about using change data capture to land the operational tables in BigQuery and avoid direct federation so you protect transactional systems and gain scalable performance.

Question 6

You need event-driven orchestration where each new file in Cloud Storage triggers a Dataproc normalization followed by BigQuery transformations for about 350 tables, and the transformations can run for up to four hours. Which approach will minimize maintenance?

  • ✓ B. Cloud Composer DAG per table triggered by Cloud Storage finalize via Cloud Functions that runs Dataproc then BigQuery

The correct option is Cloud Composer DAG per table triggered by Cloud Storage finalize via Cloud Functions that runs Dataproc then BigQuery.

This approach is event driven from Cloud Storage object finalize events and uses Cloud Functions only as a lightweight trigger. The orchestration and dependency management live in Cloud Composer which is managed Airflow. It can fan out across hundreds of tables with clear task dependencies, retries, and monitoring, and it can monitor Dataproc jobs and BigQuery jobs that may run for hours. Using native operators for Dataproc and BigQuery reduces custom code and keeps maintenance low as your pipelines scale.

Composer is designed for heterogeneous workloads where a Spark job must run before SQL transforms. Airflow operators and sensors handle long running operations and backoff without you writing custom polling loops. You also get centralized logging, alerting, and parameterization so you can version and update per table logic in a consistent way.

BigQuery Data Transfer Service with scheduled queries every 45 minutes is not event driven and introduces delay and unnecessary runs when no new files arrive. It only schedules BigQuery SQL and cannot orchestrate a Dataproc job, which means it does not meet the requirement to start Dataproc on each file arrival.

Workflows triggered by Cloud Storage finalize via Eventarc that calls Dataproc then BigQuery can be wired to the event, but you would need to handcraft API calls and polling for Dataproc and BigQuery and then build fan out for roughly 350 tables. That increases operational code and complexity compared with managed Airflow operators and DAG patterns, which makes it a higher maintenance choice for this scale.

When you see heterogeneous steps across services and many parallel table pipelines, prefer managed orchestration with native operators and event triggers. Map each requirement to a service capability and confirm it supports event driven starts, long running jobs, and clear dependency management.

Question 7

BeaconPlay is a media startup that serves soccer fans around the globe. The platform offers live broadcasts and an on-demand library of recorded matches, and the lead engineer wants viewers to have consistent playback quality for the recorded videos no matter where they are located. Which Google Cloud service should be used to efficiently deliver the on-demand content to a worldwide audience?

  • ✓ C. Cloud CDN

The correct option is Cloud CDN.

Cloud CDN caches frequently accessed content at Google edge locations worldwide. This reduces latency and helps deliver consistent playback quality for recorded videos to viewers everywhere. It integrates with Cloud Storage and HTTP or HTTPS load balancers and serves media efficiently. Caching video segments near users reduces origin load and improves throughput, which is exactly what is needed for a global on demand library.

Cloud Storage multi-region stores objects redundantly across multiple locations for durability and availability. It does not provide edge caching or global content acceleration, so it alone cannot ensure low latency playback for a worldwide audience.

Cloud Load Balancing distributes traffic across backends and regions for scalability and uptime. It does not cache content at the edge and is not a content delivery network, so it will not on its own provide the consistent global performance needed for recorded video delivery.

Cloud Storage Nearline is a storage class designed for infrequently accessed data. It has higher access and retrieval costs and is not intended for serving frequently watched media, and it does not provide global delivery optimizations.

When a question emphasizes global delivery and low latency for static or recorded media, map the requirement to a CDN. Storage classes address cost and durability and load balancing addresses backend distribution, while the CDN solves edge caching and geographic proximity.

Question 8

Which data model is best suited for semistructured asset records in an interactive application where the schema changes approximately every three weeks?

  • ✓ C. Document model

The correct option is Document model. It best supports semi structured asset records in an interactive app where the schema changes about every three weeks.

A document database stores each asset as a self contained document that can include nested fields and arrays. This model allows fields to vary across records and supports adding or removing attributes without costly migrations. It fits interactive workloads because you can read and update entire documents efficiently and you can evolve the schema incrementally as requirements change.

Snowflake schema targets analytical data warehouses with highly normalized dimensions and a rigid structure, which makes frequent schema changes disruptive and is not intended for interactive application access patterns.

Wide-column model is optimized for very large scale workloads with predictable access patterns and requires careful design of rows and column families, which makes frequent and ad hoc schema evolution harder and it does not naturally treat a single nested entity as a first class record.

Star schema is built for analytics and aggregation rather than operational interactivity, and it relies on a predefined structure where changes ripple through ETL and reporting, which is not ideal when the schema changes every few weeks.

Map the workload to the model. If you see interactive app and frequently changing semi structured data, think of a document store. If you see analytics and aggregations, think of star or snowflake schemas instead.

Question 9

At Aurora Press you keep analytical files in both Google Cloud Storage and in Amazon S3, and everything is stored in North America. Analysts need to run up to date queries in BigQuery regardless of which cloud holds the data, and they must not receive direct permissions on either set of buckets. What should you implement to let them query the data through BigQuery while avoiding direct bucket access?

  • ✓ C. Set up a BigQuery Omni connection to the S3 buckets and create BigLake tables that reference objects in both Cloud Storage and S3, then query them from BigQuery

The correct answer is Set up a BigQuery Omni connection to the S3 buckets and create BigLake tables that reference objects in both Cloud Storage and S3, then query them from BigQuery.

This approach lets analysts run in place and up to date queries across both clouds while keeping object permissions isolated. BigQuery Omni executes the processing near the Amazon S3 data so the data remains in AWS, and BigLake tables provide a unified BigQuery table interface over data in both Cloud Storage and S3. Access is enforced through BigQuery IAM on the tables rather than through direct permissions on the buckets, so analysts get only the BigQuery roles they need.

Because the tables reference the files directly, the queries reflect the latest objects without replication delays. You deploy the BigQuery Omni connection in the appropriate North America AWS region and manage the BigLake metadata in BigQuery so regional constraints are respected.

Use the Storage Transfer Service to replicate S3 objects into Cloud Storage and then build BigLake tables over the Cloud Storage data to query from BigQuery is not ideal because it duplicates data and introduces transfer schedules and lag, so queries are not truly in place or guaranteed to be current. It also sidesteps the requirement to query the data where it resides in either cloud.

Configure a BigQuery Omni connection to the S3 location and create external tables over data in both Cloud Storage and S3 for direct querying in BigQuery is not correct because cross cloud object storage querying with BigQuery Omni uses BigLake tables rather than classic external tables. BigLake provides the fine grained, BigQuery based authorization needed to avoid granting bucket permissions.

Build a Dataflow pipeline that loads files from S3 into partitioned BigQuery native tables every 45 minutes and run queries on those tables adds latency and operational overhead. It fails the up to date requirement because analysts may not see the latest data between loads.

When a scenario asks for cross cloud analytics that are in place, up to date, and without bucket permissions, prefer a combination of BigQuery querying through BigQuery Omni and governed access with BigLake tables rather than replication or batch ETL.

Question 10

In a streaming pipeline that ingests from Pub/Sub, processes with Dataflow, and writes to BigQuery, about 4% of events are malformed. How should you modify the Dataflow pipeline so that only valid records are written to BigQuery?

  • ✓ B. Use a ParDo to validate and emit only valid records

The correct option is Use a ParDo to validate and emit only valid records.

This approach lets you perform per element validation and parsing in your DoFn, then emit only records that pass validation to the main output. You can route malformed events to a separate side output for monitoring or drop them entirely. This keeps BigQuery streaming inserts clean and efficient since only valid rows are written and it avoids unnecessary retries and backlogs caused by rejected rows.

Define a Partition to split valid and invalid then drop the invalid branch is not correct because Partition is intended for routing into a fixed number of buckets and it adds complexity without benefit for simple pass or fail validation. The recommended pattern for validation and optional dead lettering is to use a DoFn based transform with separate outputs.

Configure a Pub/Sub dead-letter topic is not correct because Pub/Sub dead lettering handles delivery failures when subscribers cannot successfully process or acknowledge messages. It does not perform content validation and it will not ensure that only valid records are written to BigQuery in a running Dataflow pipeline.

Rely on BigQuery schema enforcement to reject bad rows is not correct because letting BigQuery reject malformed rows causes insert errors and retries in streaming pipelines. This can increase latency and costs and it can also create backlogs. Validation should happen in the pipeline before writing.

When a question asks how to keep only valid data flowing into storage, think about applying validation in the transform layer. A ParDo with side outputs is often the cleanest way to filter and optionally capture bad records without slowing the sink.

Question 11

After spending four days loading CSV files into a BigQuery table named WEB_EVENT_LOGS for Clearwater Goods, you realize the column evt_epoch stores event timestamps as strings that represent UNIX epoch times because you initially set every field to STRING for speed. You need to calculate session durations from these events and you want evt_epoch available as a TIMESTAMP so that future filters and joins are efficient. You want to make the smallest possible change while keeping future queries fast. What should you do?

  • ✓ C. Add a TIMESTAMP column named event_ts to WEB_EVENT_LOGS, backfill it by converting evt_epoch, and use event_ts for all future queries

The correct option is Add a TIMESTAMP column named event_ts to WEB_EVENT_LOGS, backfill it by converting evt_epoch, and use event_ts for all future queries.

This approach uses BigQuery schema evolution to add a new nullable column without dropping or reloading the table. You can run a one time update to convert the string UNIX epoch to a TIMESTAMP using a function such as TIMESTAMP_SECONDS with a CAST, which preserves the existing data and pipelines. Once populated, queries can filter and join on a native TIMESTAMP which avoids per row casts and keeps future queries efficient. It is also the smallest change because the table name and references remain the same.

Drop WEB_EVENT_LOGS, recreate it with evt_epoch defined as TIMESTAMP, and reload all historical data from the CSV files is unnecessary and disruptive. It requires a full reload and coordination with downstream users and it does not provide any advantage over adding and backfilling a new column.

Add two new columns named event_ts as TIMESTAMP and is_new as BOOLEAN, then reload all data in append mode with is_new set to true and query only those rows going forward duplicates data and complicates queries. It leaves historical rows unfixed unless you also backfill and it forces filters on a flag that adds operational risk without improving performance.

Write a query that casts evt_epoch to TIMESTAMP and writes the results to a new table WEB_EVENT_LOGS_NEW using event_ts as the TIMESTAMP column, then switch all pipelines and reports to the new table creates a parallel table and requires updates to pipelines and permissions. It is a larger change and duplicates storage when a simple in place schema addition and backfill is sufficient.

Create a view named WEB_EVENT_VIEW that casts evt_epoch to TIMESTAMP on the fly and point all future queries to the view keeps the column as a string and performs the cast at query time. This can slow filters and joins and prevents storage level optimizations on a typed column, so it does not meet the requirement to keep future queries fast.

When a question asks for the smallest change that keeps future queries efficient, prefer adding a nullable column and doing a one time backfill rather than rebuilding tables, creating new tables, or relying on views that compute types at query time.

Question 12

Which command sets the serve-ml Deployment to 5 replicas?

  • ✓ C. kubectl scale deployment serve-ml –replicas=5

The correct option is kubectl scale deployment serve-ml –replicas=5.

This command directly sets the replicas field on the Deployment to five and the ReplicaSet will create or remove Pods immediately to match that desired state. It is the straightforward way to change the number of Pods for a Deployment to an exact count.

kubectl rollout restart deployment serve-ml only forces a rolling restart of the existing Pods and it does not change the replica count at all.

kubectl autoscale deployment serve-ml –min=5 –max=5 creates a HorizontalPodAutoscaler which manages replica counts based on metrics rather than setting an immediate fixed count. It also depends on a metrics pipeline and is not the direct command to scale a Deployment to an exact number right now.

gcloud container clusters resize fraud-cluster –num-nodes=5 –zone=us-central1-a changes the number of nodes in the GKE cluster and it does not change the number of Pods in a specific Deployment.

Match the action to the Kubernetes resource. If the question targets a Deployment and asks for an exact replica count then choose the command that sets replicas directly rather than autoscaling or changing cluster size.

Question 13

Harborline Outfitters keeps tens of millions of records in a BigQuery date partitioned table named retail_ops.sales_events, and dashboards at example.com and internal services run aggregation queries dozens of times per minute. Each request calculates AVG, MAX and SUM across only the most recent 12 months of data, and the base table must preserve all historical rows for auditing. You want results that include brand new inserts while keeping compute cost, upkeep, and latency very low. What should you implement?

  • ✓ C. Create a materialized view that aggregates retail_ops.sales_events and restricts it to the last 12 months of partitions

The correct option is Create a materialized view that aggregates retail_ops.sales_events and restricts it to the last 12 months of partitions.

A materialized view precomputes AVG, MAX, and SUM and incrementally refreshes only the portions of data that change. This gives very low latency and cost for dashboards and services that run frequent aggregate queries. Restricting the materialized view to the most recent 12 months means queries scan far less data while the base table continues to hold all historical rows for auditing. BigQuery can also rewrite compatible queries to use the materialized view which reduces operational upkeep because clients do not need to change their SQL.

Enable BigQuery BI Engine and query retail_ops.sales_events with a filter for the last 12 months of partitions is not the best fit because BI Engine is an in memory acceleration layer that does not precompute or incrementally maintain aggregates. You still pay for repeated scans or a large reservation and you do not get the same cost savings and simplicity that a preaggregated result provides.

Create a scheduled query that rebuilds a 12 month aggregate summary table every 30 minutes is inefficient and increases maintenance. It introduces staleness between runs and repeatedly recomputes the entire window which drives cost and fails the requirement for near real time results.

Create a materialized view on retail_ops.sales_events and configure a partition expiration policy on the base table so only the last 12 months are kept violates the requirement to preserve all historical rows for auditing because an expiration policy would delete older partitions from the base table.

When you see frequent aggregate queries that must stay fresh with low latency and cost, think materialized views. If the problem mentions an auditing need, avoid any option that expires or deletes base data.

Question 14

How should you design service account access to PII stored in Cloud Storage to enforce least privilege and enable auditability?

  • ✓ C. Service accounts per workload with IAM groups and least privilege roles

The correct option is Service accounts per workload with IAM groups and least privilege roles. The other options do not meet the requirements for least privilege and strong auditing.

This approach assigns a unique identity to each workload which lets you grant only the minimum Cloud Storage roles that workload needs at the bucket or even prefix level. Because each workload uses its own service account, Cloud Audit Logs clearly attribute every access to a distinct principal which improves traceability for PII access and simplifies incident response. Managing permissions through groups streamlines administration at scale while still keeping fine grained bindings to specific buckets, prefixes, and keys. You can also pair this with CMEK by granting only the necessary Cloud KMS key roles to the same workload identities which keeps both data access and encryption permissions tightly scoped.

This design reduces blast radius because revoking a single workload’s access is as simple as removing its service account from a group or role binding. It aligns with Google’s guidance to avoid basic roles and to use narrowly scoped predefined or custom roles for storage and key access which strengthens least privilege and auditability for sensitive data.

One shared service account per project with CMEK on the buckets is incorrect because a shared identity prevents per workload attribution in audit logs and violates least privilege. CMEK improves encryption control but it does not fix the lack of identity separation when many services use the same account.

Default Compute Engine service account with Project Editor for all workloads is incorrect because the default account is broadly shared and the Editor basic role is overly permissive. This combination undermines least privilege and makes it difficult to audit which workload accessed PII.

Individual service accounts for each employee to access data is incorrect because service accounts are intended for non human workloads. Human access should use user identities and groups with strong controls and approvals and should not rely on per person service accounts for PII.

When a question mentions PII or auditability, choose designs that give each workload its own identity and grant only the needed roles on specific resources. Avoid basic roles and the default service account. Remember that CMEK complements but does not replace least privilege IAM.

Question 15

A fintech named OrionPay needs to orchestrate a multi stage analytics workflow that chains several Dataproc jobs and downstream Dataflow pipelines with strict task dependencies. The team wants a fully managed approach that provides retries, monitoring, and parameterized runs, and they must trigger it every weekday at 0315 UTC. Which Google Cloud service should they use to design and schedule this pipeline?

  • ✓ B. Cloud Composer

The correct answer is Cloud Composer because it is a fully managed Apache Airflow service that can orchestrate multi stage pipelines across Dataproc and Dataflow with strict task dependencies, includes retries and monitoring, supports parameterized runs, and can be scheduled to run every weekday at 0315 UTC.

Airflow DAGs let you define ordered tasks that submit Dataproc jobs and then start Dataflow pipelines using native operators and sensors. You can configure per task retries and get centralized logging and monitoring in the service. You can pass parameters through DAG run configuration or templated fields and you can set a weekday cron schedule that runs at the required UTC time.

Workflows can orchestrate API calls and supports retries and parameter passing, however it lacks the rich Airflow operators for Dataproc and Dataflow and it does not include native cron scheduling on its own, so you would need an extra scheduler and more custom logic to manage complex task dependencies.

Cloud Scheduler only provides time based triggers for HTTP targets or Pub or Sub topics and it cannot model multi step dependencies or orchestrate Dataproc and Dataflow tasks with per task retries and detailed monitoring.

Dataproc Workflow Templates can orchestrate sequences of Dataproc jobs with dependencies and parameters, however they do not natively include Dataflow steps and would still need an external scheduler for weekday runs, so they do not meet the cross service orchestration requirement.

Match the requirement to the orchestration level. If you need to chain Dataproc and Dataflow with strict dependencies and retries and monitoring then look for the managed Airflow option. Use Cloud Scheduler only for simple time based triggers and consider Dataproc Workflow Templates when all steps are Dataproc.

Workflows fits API centric flows but usually pairs with a scheduler.

Question 16

In Dataplex, how should you assign roles so engineers have full control of the sales lake and analysts have read-only access to curated data in the refined zone while maintaining governance within Dataplex?

  • ✓ B. Assign dataplex.dataOwner on the sales lake to engineering and assign dataplex.dataReader on the refined zone to analytics

The correct option is Assign dataplex.dataOwner on the sales lake to engineering and assign dataplex.dataReader on the refined zone to analytics.

Granting the dataOwner role on the lake to engineering centralizes data access through Dataplex and gives engineers comprehensive read and write access to the data in all assets attached to that lake. Granting the dataReader role on the refined zone to analytics limits analysts to read-only access on curated data in that specific zone. This follows the principle of least privilege and keeps governance in Dataplex because data roles applied at the lake or zone scope propagate to underlying BigQuery and Cloud Storage assets that are attached.

This approach cleanly separates responsibilities. Engineers can fully work with data across the lake while analysts are constrained to curated datasets in the refined zone. It also avoids scattering IAM on individual datasets and buckets which preserves centralized governance and simpler audits.

Grant the dataplex.dataReader role on the sales lake to engineering and grant the dataplex.dataOwner role on the refined zone to analytics is incorrect because it inverts the needed privileges. Engineers would be unable to make necessary changes in the lake and analysts would gain write capabilities in the refined zone which violates least privilege for curated data access.

Grant BigQuery and Cloud Storage IAM roles directly on datasets and buckets for each group is incorrect because it bypasses Dataplex and fragments governance. The requirement is to keep governance in Dataplex which means you should grant data roles at the lake or zone level so access is centrally managed and consistently applied to attached assets.

Use Data Catalog policy tags and share BigQuery datasets without Dataplex roles is incorrect because policy tags provide column level controls within BigQuery and do not manage access across all lake assets. This does not satisfy the need to keep governance in Dataplex or cover Cloud Storage assets that belong to the lake.

When a scenario says to keep governance in Dataplex prefer Dataplex data roles at the lake or zone. Map ownership or write needs to dataOwner and consumption to dataReader rather than granting IAM directly on datasets and buckets.

Question 17

HarborLight Retail needs to run both scheduled batch loads and real time event streams in Google Cloud Dataflow, and leaders expect predictable execution with correct aggregates even when some records show up late or arrive out of order. How should you design the pipeline so that results remain accurate in the presence of late and out of order events?

  • ✓ C. Assign event time timestamps and configure watermarks with allowed lateness and triggers

The correct option is Assign event time timestamps and configure watermarks with allowed lateness and triggers.

This approach uses event time to place each record in the correct logical window which preserves the true time semantics of the data. Watermarks provide a best effort signal of how far event time has progressed so the pipeline knows when it likely has seen all on time data for a window. Allowed lateness lets the window remain open for a bounded period so late records can still update results. Triggers control when to emit early on time and late results so you can produce timely outputs and then refine them as more data arrives. With appropriate accumulation mode the pipeline can update aggregates when late events show up which keeps results correct and predictable for both batch and streaming runs.

Configure sliding windows wide enough to cover lagging records is not sufficient because widening windows only trades latency for some tolerance of delay and it still cannot guarantee correctness for arbitrarily late or out of order events. Without event time semantics watermarks allowed lateness and triggers the pipeline will either drop late data or place it in the wrong window.

Use a single global window to simplify aggregation across all events removes natural boundaries which leads to unbounded state and makes it difficult to reason about completeness. Even with triggers you lose predictable finality for aggregates and you still need event time watermarks and allowed lateness to handle out of order and late arrivals in a controlled way.

Enable Pub/Sub message ordering and rely on processing time windows for consistency does not address the core problem because ordering is not guaranteed end to end and processing time windows reflect when Dataflow sees messages rather than when events actually occurred. This leads to misattributed counts and incorrect aggregates whenever events are delayed or arrive out of order.

When a question mentions late or out of order events choose event time windowing with watermarks plus allowed lateness and triggers rather than processing time or message ordering. Then think about how results should accumulate as late data arrives.

Question 18

Which Google Cloud solution provides global, strongly consistent ACID transactions with SQL access and supports concurrent updates across multiple regions at approximately 30 million operations per day?

  • ✓ B. Cloud Spanner with locking read write transactions

The correct option is Cloud Spanner with locking read write transactions. It is the only Google Cloud database that delivers global strongly consistent ACID transactions with SQL while supporting concurrent updates across multiple regions at the described scale.

Locking read write transactions in this service provide serializable isolation for reads and writes which is the highest level of transactional correctness for concurrent updates. It uses TrueTime to achieve external consistency across regions and replicates data synchronously so reads and writes remain strongly consistent worldwide. The workload of about 30 million operations per day is well within its horizontally scalable architecture.

Cloud SQL with BigQuery federation is not suitable because Cloud SQL is a regional service and cross region replication is asynchronous which does not provide strongly consistent multi region writes. Federation in BigQuery is for analytical querying of external data and it does not offer transactional guarantees or support for distributed ACID updates.

AlloyDB for PostgreSQL with read replicas is also not suitable because it is a regional system and its replicas are for reads. It does not offer globally strongly consistent multi region write transactions or external consistency for concurrent updates across regions.

When you see requirements that include global scope, strongly consistent ACID transactions, and multi region concurrency with SQL, map directly to Spanner with locking read write transactions. Performance figures like tens of millions of operations per day are a good fit for horizontally scalable distributed databases.

Question 19

Aurora Streams is moving its legacy warehouse to BigQuery and wants stronger collaboration across about 24 internal groups. The company needs a design that lets data producers securely publish curated read only datasets that others can easily discover and subscribe to without tickets. They also want subscribers to read the freshest data while keeping storage and operational costs low. Which approach should they use?

  • ✓ C. Publish datasets through BigQuery Analytics Hub and let teams subscribe to linked datasets

The correct approach is Publish datasets through BigQuery Analytics Hub and let teams subscribe to linked datasets.

With Analytics Hub producers can publish curated read only datasets as listings that subscribers can easily discover and subscribe to. A subscription creates linked datasets in the consumer project that reference the publisher tables in place, which means queries always see the freshest data. Because linked datasets do not copy storage, costs remain low and producers operate a single authoritative dataset while sharing at scale without tickets.

Grant bigquery.dataViewer on each producer dataset to every subscribing team is difficult to scale across about 24 groups and offers no discovery or subscription workflow. It increases administrative overhead and encourages ticket driven access management even though it can provide freshness.

Use BigQuery Data Transfer Service to replicate shared datasets into a central exchange project on an hourly schedule duplicates data, raises storage cost, and introduces staleness between runs. The service is designed for scheduled transfers and copies rather than a publisher subscriber exchange model.

Catalog producer datasets in Dataplex and control access with tag based IAM for consumer projects improves governance and discovery, yet it does not provide a subscription model with in place access. Tag based controls do not replace dataset level sharing in BigQuery, so teams would still need direct role management and would not get the simplicity and freshness of a linked dataset approach.

Map requirements for easy discovery, many consumers, read only sharing, freshest reads, and low storage to Analytics Hub with linked datasets. Options that copy data on a schedule usually mean higher cost and staler results.

Question 20

For a Pub/Sub push subscription, how should you configure retry behavior and dead lettering so messages survive short outages, retry with gradual delays, and are routed to a different topic after 10 delivery attempts?

  • ✓ B. Use exponential backoff for retries and configure a dead letter topic that is different from the source with a maximum of 10 delivery attempts

The correct option is Use exponential backoff for retries and configure a dead letter topic that is different from the source with a maximum of 10 delivery attempts.

This configuration satisfies all three requirements. Exponential backoff spaces out push delivery retries which helps messages survive short outages without overwhelming the endpoint and the interval grows as failures continue. A dead letter policy then moves the message to a separate topic after the tenth failed delivery which prevents loops and provides a clear handoff path for failed processing.

Use immediate retry and enable dead lettering to a different topic with a cap of 10 delivery attempts is incorrect because immediate retry can flood the endpoint during an outage and it does not provide gradual retry behavior.

Set the acknowledgement deadline to 20 minutes is incorrect because the acknowledgement deadline does not control push retry pacing and it does not configure a dead letter route or enforce a delivery attempt limit.

When you see requirements for surviving short outages and gradual retries and routing after a fixed number of attempts, choose exponential backoff with a dead letter topic and set maxDeliveryAttempts to the specified value.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.