Certified GCP Data Engineer Sample Questions and Answers

Free GCP Certification Exam Topics Tests

The Google Cloud Professional Data Engineer exam validates your ability to design, build, and operationalize data processing systems on Google Cloud that support analytics, machine learning, governance, and reliability.

It focuses on core domains such as storage and data modeling, batch and stream processing, orchestration, security and compliance, monitoring, and cost optimization across services like BigQuery, Cloud Storage, Pub/Sub, Dataflow, Dataproc, Data Fusion, and Cloud Composer.

About GCP Exam Dumps

To prepare effectively, begin with GCP Professional Data Engineer Practice Questions, explore Real GCP Certified Data Engineer Exam Questions, and review a concise set of scenarios in the Professional Data Engineer Braindump designed for legitimate study use.

You can also drill with the Google Certified Data Engineer Exam Simulator, browse focused GCP Certified Professional Data Engineer Questions and Answers, skim a rapid-fire Google Certified Data Engineer Exam Dump for quick checks, try GCP Professional Data Engineer Sample Questions, and finish with a timed run on the Google Certified Data Engineer Exam Questions or a full Google Certified Professional Data Engineer Practice Test.

Each section of the Data Engineer Questions and Answers collection is designed to teach as well as test.

These materials reinforce essential Google Cloud data engineering concepts and provide clear explanations that help you understand why specific responses are correct, preparing you to think like an experienced data engineer.

Google Cloud Certification Practice Exams

For complete readiness, use the Exam Simulator and take full-length Practice Tests that reproduce the pacing and structure of the actual Google Cloud certification so you can manage time effectively and build confidence under test conditions.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Google Data Engineer Sample Questions

Question 1

Riverton Outfitters runs its enterprise analytics on BigQuery where fresh data is ingested every day and a transformation workflow reshapes it for business teams. The workflow changes frequently and some defects are not discovered until about 21 days later. You need a recovery approach that supports late detection while keeping backup storage costs low. How should you organize your BigQuery tables and manage backups?

  • ❏ A. Keep all records in a single BigQuery table and schedule compressed exports to Cloud Storage

  • ❏ B. Partition data into monthly tables and mirror them into a second dataset in BigQuery

  • ❏ C. Store data in separate monthly tables and regularly export and compress each month to Cloud Storage

  • ❏ D. Use monthly tables and rely on BigQuery snapshot decorators to restore the table to a prior state

Question 2

In Google Cloud, attempts to create a user managed key for a service account fail in both the Google Cloud console and the gcloud CLI. What is the most likely cause?

  • ❏ A. Missing roles/iam.serviceAccountKeyAdmin

  • ❏ B. VPC Service Controls perimeter blocks IAM API calls

  • ❏ C. Org Policy with constraints/iam.disableServiceAccountKeyCreation

  • ❏ D. The service account is disabled

Question 3

You manage the analytics platform at HarborTech Analytics which relies on BigQuery for storage. Your streaming pipelines read from Pub/Sub and load data into multiple BigQuery tables. After deploying an updated release of the ingestion jobs you observe a 60% jump in total daily bytes stored in BigQuery while Pub/Sub throughput remains unchanged. Only a few tables show daily partitions that are roughly twice as large as usual. What should you do to investigate and fix the underlying cause without risking data loss?

  • ❏ A. Create scheduled queries that periodically remove duplicates from the inflated partitions and distribute the cleanup SQL to other teams so they can run it if they see similar growth

  • ❏ B. Reconfigure the pipelines to use the BigQuery Storage Write API with exactly once semantics and depend on it to handle duplicate records automatically instead of investigating further

  • ❏ C. Inspect the impacted tables for duplicate rows then use BigQuery Audit Logs to find the jobs writing to them and correlate Dataflow job start times and code versions in Cloud Monitoring and stop any older pipeline instances that still write to the same sinks

  • ❏ D. Roll back to the prior pipeline release and restore the tables using BigQuery time travel to the point before the rollout then restart the Dataflow workers and seek the Pub/Sub subscription cursor back to the deployment timestamp to reprocess data

Question 4

A Dataflow streaming pipeline reads from Pub/Sub and writes to Pub/Sub with a 30 second delivery target, but messages are arriving late. Metrics show system lag of about 5 seconds, data freshness of about 40 seconds, and event time within about 3 seconds of publish time. What is the most likely cause and what should you do?

  • ❏ A. Output topic publish quota limits, request increase

  • ❏ B. Backlog on input subscription, scale Dataflow

  • ❏ C. Slow per-record processing, add workers

  • ❏ D. Excessive allowed lateness holds data, lower lateness

Question 5

At Northwind Outfitters you are moving point of sale fact data into BigQuery. One table named retail_dw.sales_events stores the timestamp of each purchase, the items bought, the outlet_id, and the outlet city and region. Analysts frequently run reports that count items sold during the last 45 days and explore trends by region, by city, and by outlet. You want the most efficient scans for filters on time and on location while keeping the model straightforward. How should you design the BigQuery table schema?

  • ❏ A. Do not partition the table and cluster by outlet_id then city then region

  • ❏ B. Use ingestion time partitioning and cluster by outlet_id then city then region

  • ❏ C. Partition the table on the purchase_timestamp column and cluster by region then city then outlet_id

  • ❏ D. Partition the table on the purchase_timestamp column and cluster by outlet_id then city then region

Question 6

A BigQuery ML classifier achieves 99 percent accuracy on the training set but only 70 percent on the validation set, indicating overfitting. Which actions should you take to improve generalization? (Choose 3)

  • ❏ A. Train for many more epochs

  • ❏ B. Reduce features with selection or dimensionality reduction

  • ❏ C. Increase regularization like higher L2 or dropout

  • ❏ D. Disable early stopping

  • ❏ E. Add more recent labeled data

Question 7

You operate streaming ETL jobs at NovaMetrics Labs and a long running Dataflow pipeline that processes events in real time has seen end to end latency rise from about 8 seconds to nearly 75 seconds. The execution graph shows that Dataflow has fused multiple transforms into one large stage which makes stage level metrics hard to interpret. You need to locate the exact part of the pipeline that is causing the slowdown and capture useful execution metrics without changing the business logic. What should you do?

  • ❏ A. Enable Streaming Engine and raise the maximum number of workers

  • ❏ B. Insert a Reshuffle boundary after each major logical step and inspect stage metrics in the Dataflow console

  • ❏ C. Introduce temporary sinks after each significant transform and compare write rates across components

  • ❏ D. Add verbose debug logging inside each ParDo and analyze the logs during execution

Question 8

Which Google Cloud service offers a managed PostgreSQL database that requires minimal changes to existing applications?

  • ❏ A. AlloyDB for PostgreSQL

  • ❏ B. Cloud Spanner

  • ❏ C. Cloud SQL

  • ❏ D. Cloud Bigtable

Question 9

An analytics team at CedarMart has trained a BigQuery ML model and asks you to design a serving pipeline for predictions. Your HTTP API must return a prediction for a single account_id with end to end latency under 80 milliseconds. The team runs this statement to generate results which is SELECT predicted_label, account_id FROM ML.PREDICT(MODEL ‘ml_ops.churn_v2’, TABLE account_features_v3). How should you build the pipeline so that the application can respond for an individual account_id within 80 milliseconds while continuing to use that statement?

  • ❏ A. Export the model to Vertex AI and deploy an online prediction endpoint that the API calls

  • ❏ B. Create a Dataflow pipeline that runs the ML.PREDICT statement and reads results with BigQueryIO and grant the application service account the Dataflow Worker role

  • ❏ C. Run a Dataflow job that writes predictions for all account_ids from the ML.PREDICT statement to Bigtable and let the API read a single row by key using the Bigtable Reader role

  • ❏ D. Add a WHERE account_id filter to the statement and grant the application service account the BigQuery Data Viewer role

Question 10

In Pub/Sub, what should you configure to ensure a newly created subscription can immediately access the most recent 30 days of previously published messages?

  • ❏ A. Set subscription backlog retention to 30 days

  • ❏ B. Seek the new subscription to a timestamp 30 days ago

  • ❏ C. Set topic message retention to 30 days

Question 11

At the Aster Marine Institute you operate 1,200 ocean buoys that emit one metric every second with a timestamp. Your repository holds about 1.5 TB of data and is growing by roughly 2 GB per day. You must support two access patterns. The first is to fetch a single reading for a given buoy and exact timestamp with a typical latency in the single digit millisecond range. The second is to run daily analytical workloads that include joins across the full dataset. Which storage design should you implement so that both needs are satisfied?

  • ❏ A. Use Cloud SQL for PostgreSQL with a composite index on buoy ID and timestamp and schedule a daily export to BigQuery

  • ❏ B. Use BigQuery and concatenate buoy ID and timestamp as a primary key

  • ❏ C. Use Bigtable with a row key of buoy ID concatenated with timestamp and export to BigQuery once per day

  • ❏ D. Use Bigtable with a row key of buoy ID concatenated with metric name and export to BigQuery once per day

Question 12

A neural network with 10 input features trained on approximately 4 million labeled rows is underfitting on both the training and validation sets. What should you do to improve its performance?

  • ❏ A. Vertex AI hyperparameter tuning

  • ❏ B. Add feature crosses for feature interactions

  • ❏ C. Increase dropout during training

  • ❏ D. AutoML Tabular

Question 13

You manage centralized BigQuery datasets with authorized views that several departments at Orion Outfitters use. The marketing analytics team experiences large month to month swings in query charges under the on-demand model and they want a predictable monthly cost without changing how they access the shared views. What should you do to help them maintain a steady BigQuery query spend each month?

  • ❏ A. Establish a BigQuery processed bytes quota for the marketing group and cap the maximum scanned bytes per day

  • ❏ B. Create a BigQuery slot reservation with a baseline of 600 slots for the marketing group and keep autoscaling disabled, then bill them back from that commitment

  • ❏ C. Create a BigQuery Enterprise edition reservation with a baseline of 300 slots and enable autoscaling up to 900 slots for the marketing group

  • ❏ D. Create a BigQuery reservation with a baseline of 0 slots and enable autoscaling up to 800 slots for the marketing group

Question 14

In BigQuery, your queries filter on event_time and user_id across approximately 18 months of data, but a dry run shows a full table scan. How can you reduce the data scanned and cost with minimal changes to the SQL?

  • ❏ A. Add a LIMIT clause

  • ❏ B. Create daily sharded tables and query with a wildcard

  • ❏ C. Use time partitioning on event_time with clustering by user_id

  • ❏ D. Create a materialized view for the last 30 days

Question 15

Larkspur Retail has applications that publish clickstream and purchase events to a Pub/Sub topic and strict ordering is not needed. Analysts require fixed one hour non overlapping aggregations and only the aggregated outputs should be stored in BigQuery for reporting. The pipeline must scale to bursts of about two million events per minute while remaining cost efficient. How should you process the stream and load the results into BigQuery?

  • ❏ A. Trigger a Cloud Function on each Pub/Sub message to update counters and write hourly results to BigQuery

  • ❏ B. Use a Pub/Sub BigQuery subscription to stream messages directly into BigQuery and derive hourly summaries with scheduled queries

  • ❏ C. Run a streaming Dataflow pipeline that reads from Pub/Sub applies fixed one hour tumbling windows and writes each window’s aggregates to BigQuery

  • ❏ D. Schedule an hourly batch Dataflow job that reads the Pub/Sub backlog computes aggregates and outputs to BigQuery

Question 16

Which Google Cloud service should you use to create a conversational assistant that interacts with users and directs them to the appropriate support queue?

  • ❏ A. Contact Center AI

  • ❏ B. Text-to-Speech API

  • ❏ C. Dialogflow CX

  • ❏ D. Speech-to-Text API

Question 17

You are leading a data platform project at Cedar Finch Retail that requires a transactional database with strict ACID guarantees. The operations team wants the service to keep running during a zone outage and they expect failover to occur automatically with little manual work. Which Google Cloud approach should you adopt to meet these needs?

  • ❏ A. Bigtable instance with more than one cluster

  • ❏ B. Cloud SQL for MySQL with point in time recovery enabled

  • ❏ C. Cloud SQL for PostgreSQL with high availability enabled

  • ❏ D. BigQuery dataset using a multi region location

Question 18

A streaming Dataflow pipeline reads from Pub/Sub and writes to BigQuery, and it is backlogged while all six e2 standard 2 workers are at approximately 96 percent CPU utilization. Which changes would increase throughput? (Choose 2)

  • ❏ A. Buy more BigQuery slots

  • ❏ B. Increase the maximum worker cap

  • ❏ C. Move the pipeline to us-central1

  • ❏ D. Use larger worker machines

Question 19

Helios Manufacturing plans to move its on-premises Apache Hadoop environment to Google Cloud to run nightly and multi-hour batch processing for analytics. They want a managed approach that remains resilient to failures and keeps costs low for long-running jobs. How should they migrate and configure the platform to achieve these goals?

  • ❏ A. Install Hadoop and Spark on a 12-node Compute Engine managed instance group with standard machine types. Configure the Cloud Storage connector and keep data in Cloud Storage while changing paths from hdfs:// to gs://

  • ❏ B. Create a Dataproc cluster that uses SSD persistent disks and make about 40% of the workers preemptible. Store data in Cloud Storage and update job paths from hdfs:// to gs://

  • ❏ C. Provision a Dataproc cluster with standard persistent disks and use about 40% preemptible worker nodes. Keep datasets in Cloud Storage and update paths in your jobs from hdfs:// to gs://

  • ❏ D. Rebuild the pipelines to run on Dataflow and write outputs to BigQuery instead of HDFS or Cloud Storage

Question 20

What should you implement so that each Google Cloud project can view its BigQuery slot allocation with a 48 hour trend without building custom pipelines?

  • ❏ A. Cloud Monitoring with query/scanned_bytes

  • ❏ B. Looker Studio on BigQuery audit logs

  • ❏ C. Cloud Monitoring dashboard with BigQuery metric slots/allocated_for_project

  • ❏ D. Cloud Logging export with custom metric from totalSlotMs

Real Certified Google Data Engineer Questions Answered

Question 1

Riverton Outfitters runs its enterprise analytics on BigQuery where fresh data is ingested every day and a transformation workflow reshapes it for business teams. The workflow changes frequently and some defects are not discovered until about 21 days later. You need a recovery approach that supports late detection while keeping backup storage costs low. How should you organize your BigQuery tables and manage backups?

  • ✓ C. Store data in separate monthly tables and regularly export and compress each month to Cloud Storage

The correct option is Store data in separate monthly tables and regularly export and compress each month to Cloud Storage. This approach creates durable and low cost restore points that comfortably cover a 21 day detection window.

Organizing by month limits what you need to back up and what you need to restore. Closed months can be exported once, while the active month can be exported on a schedule to provide multiple recovery points. Compressed exports in Cloud Storage are inexpensive and independent of changes in BigQuery, so you can reload only the affected month into a temporary dataset and reapply transformations without paying to move the entire history.

Exports also avoid BigQuery time travel limits and survive table rewrites, which makes them well suited for late defect discovery. The monthly layout reduces both backup and recovery time and cost because you move only the minimal data required.

Keep all records in a single BigQuery table and schedule compressed exports to Cloud Storage is inefficient because every export must include the entire table. That increases cost and slows recovery, and it prevents targeted restores when only a subset such as one month needs to be corrected.

Partition data into monthly tables and mirror them into a second dataset in BigQuery doubles BigQuery storage and still propagates logical errors to the mirror. It does not provide a cost efficient or isolated backup because the copy remains inside BigQuery and is subject to the same risks.

Use monthly tables and rely on BigQuery snapshot decorators to restore the table to a prior state cannot meet a 21 day requirement because decorators rely on time travel which is limited to at most seven days. This mechanism is not a substitute for long term backups.

When a scenario mentions detection later than the BigQuery time travel window, favor exports to Cloud Storage and organize data by time so you can back up and restore only what you need while keeping storage costs low.

Question 2

In Google Cloud, attempts to create a user managed key for a service account fail in both the Google Cloud console and the gcloud CLI. What is the most likely cause?

  • ✓ C. Org Policy with constraints/iam.disableServiceAccountKeyCreation

The correct option is Org Policy with constraints/iam.disableServiceAccountKeyCreation.

This organization policy constraint centrally blocks the creation of user managed service account keys regardless of whether the attempt is made in the Console or through gcloud. When constraints/iam.disableServiceAccountKeyCreation is enforced at the organization, folder, or project level, it overrides individual IAM permissions and causes key creation to fail consistently across interfaces.

Missing roles/iam.serviceAccountKeyAdmin is not the most likely cause because other roles can grant the needed permission to create keys. If permissions were the only issue, the failure would present as a straightforward permission error and could be resolved by using another role that includes iam.serviceAccountKeys.create. The question points to a cross interface failure that is better explained by a governing policy rather than a single missing role.

VPC Service Controls perimeter blocks IAM API calls is incorrect because VPC Service Controls do not protect or restrict IAM API calls. VPC Service Controls apply to supported data services and do not block the IAM API needed to create service account keys.

The service account is disabled is unlikely to be the cause. Disabling a service account stops it from being used to obtain credentials and access resources, yet administrators can still manage the account. The described consistent failure in both tools is more accurately explained by an organization policy that prevents creating user managed keys.

If key creation fails in both tools and permissions appear correct, quickly check the effective organization policy at the project, folder, and organization levels for a constraint that disables user managed keys.

Question 3

You manage the analytics platform at HarborTech Analytics which relies on BigQuery for storage. Your streaming pipelines read from Pub/Sub and load data into multiple BigQuery tables. After deploying an updated release of the ingestion jobs you observe a 60% jump in total daily bytes stored in BigQuery while Pub/Sub throughput remains unchanged. Only a few tables show daily partitions that are roughly twice as large as usual. What should you do to investigate and fix the underlying cause without risking data loss?

  • ✓ C. Inspect the impacted tables for duplicate rows then use BigQuery Audit Logs to find the jobs writing to them and correlate Dataflow job start times and code versions in Cloud Monitoring and stop any older pipeline instances that still write to the same sinks

The correct option is Inspect the impacted tables for duplicate rows then use BigQuery Audit Logs to find the jobs writing to them and correlate Dataflow job start times and code versions in Cloud Monitoring and stop any older pipeline instances that still write to the same sinks.

The symptoms indicate duplicate writes because only some partitions are inflated while Pub/Sub throughput is unchanged. Start by querying the affected partitions for duplicated natural keys or other identifiers to confirm the issue. Use BigQuery Audit Logs to see which jobs inserted into those tables and from which principals or pipelines. Correlate those inserts with Dataflow job timelines and versions in Cloud Monitoring to identify overlapping or stray jobs that continued writing after the release. Stopping the older instances removes the duplicate source and preserves data without risky deletions.

Create scheduled queries that periodically remove duplicates from the inflated partitions and distribute the cleanup SQL to other teams so they can run it if they see similar growth is incorrect because it treats the symptom rather than the cause and can delete valid late arriving or slowly changing data when keys are imperfect. It also adds cost and operational toil and does not prevent further duplication.

Reconfigure the pipelines to use the BigQuery Storage Write API with exactly once semantics and depend on it to handle duplicate records automatically instead of investigating further is incorrect because switching write mechanisms is a significant change that does not address the immediate problem or the duplicates already written. Exactly once semantics require careful use of idempotency and stream management and they do not fix overlapping pipeline instances that target the same sinks.

Roll back to the prior pipeline release and restore the tables using BigQuery time travel to the point before the rollout then restart the Dataflow workers and seek the Pub/Sub subscription cursor back to the deployment timestamp to reprocess data is incorrect because it risks data loss or further duplication and requires reprocessing that increases cost and downtime. Time travel retention may be insufficient and rolling back without finding the root cause can allow a stray writer to continue creating the problem.

When storage grows while input rate does not, suspect duplicate writes. Confirm with queries, then use BigQuery Audit Logs and Cloud Monitoring to trace writers before attempting time travel or reprocessing. Prefer actions that do not delete data until duplication is proven.

Question 4

A Dataflow streaming pipeline reads from Pub/Sub and writes to Pub/Sub with a 30 second delivery target, but messages are arriving late. Metrics show system lag of about 5 seconds, data freshness of about 40 seconds, and event time within about 3 seconds of publish time. What is the most likely cause and what should you do?

  • ✓ B. Backlog on input subscription, scale Dataflow

The correct option is Backlog on input subscription, scale Dataflow.

System lag is only about 5 seconds which means the pipeline is keeping up once elements are inside the system. Data freshness is about 40 seconds which means new data is arriving late to the pipeline and is being read from the source well behind real time. Event time is within about 3 seconds of publish time so timestamps are not the reason for delay. The pattern of low system lag with high freshness indicates the input Pub/Sub subscription has a backlog that the job is not draining fast enough. Increasing parallelism and reader throughput by adding workers so the job can pull more messages per second will reduce the backlog and help meet the 30 second delivery target.

Output topic publish quota limits, request increase is unlikely because a constrained sink would cause workers to back up and system lag to grow well beyond 5 seconds. You would also expect publish errors or retries rather than primarily stale input data.

Slow per-record processing, add workers does not fit because slow transforms would manifest as high system lag. The metrics show the job processes data promptly once it is ingested.

Excessive allowed lateness holds data, lower lateness is not the cause because allowed lateness controls window triggering behavior and does not delay the ingestion of fresh messages from Pub/Sub. With event time close to publish time, lateness is not holding back timely output.

Compare system lag and data freshness. Low system lag with high freshness points to a source backlog and you should increase read throughput. High system lag points to insufficient processing or a slow sink. Check event time to rule out timestamp skew.

Question 5

At Northwind Outfitters you are moving point of sale fact data into BigQuery. One table named retail_dw.sales_events stores the timestamp of each purchase, the items bought, the outlet_id, and the outlet city and region. Analysts frequently run reports that count items sold during the last 45 days and explore trends by region, by city, and by outlet. You want the most efficient scans for filters on time and on location while keeping the model straightforward. How should you design the BigQuery table schema?

  • ✓ C. Partition the table on the purchase_timestamp column and cluster by region then city then outlet_id

The correct option is Partition the table on the purchase_timestamp column and cluster by region then city then outlet_id.

Partitioning on the purchase timestamp lets BigQuery prune to only the most recent 45 days when analysts filter by that range, which reduces scanned data and cost. Clustering by region first then city then outlet id groups data by the most common location predicates so queries that filter by region or city can skip large portions of the table. Ordering the cluster keys from broader geography to specific outlet makes those filters most effective while keeping the model as a single straightforward fact table.

Do not partition the table and cluster by outlet_id then city then region is inefficient for the frequent 45 day time filters because without partitioning BigQuery must consider the entire table. Clustering alone cannot provide the same pruning for time range predicates and the cluster key order provides little benefit to region or city only filters.

Use ingestion time partitioning and cluster by outlet_id then city then region does not align partitions with purchase time, so filters on purchase_timestamp cannot benefit from partition pruning and late arriving or backfilled events would further reduce effectiveness. Placing outlet first in the cluster key also weakens performance for regional or city filters.

Partition the table on the purchase_timestamp column and cluster by outlet_id then city then region gets the time partitioning right but the cluster order is suboptimal for analyses that filter by region or city. Clustering is most effective when the filtered columns lead the cluster key, so putting outlet first limits the benefit for those common predicates.

Match your table partition column to the time field used in WHERE filters and choose cluster keys that users frequently filter on. Order cluster keys with the most commonly filtered column first so queries can skip more data.

Question 6

A BigQuery ML classifier achieves 99 percent accuracy on the training set but only 70 percent on the validation set, indicating overfitting. Which actions should you take to improve generalization? (Choose 3)

  • ✓ B. Reduce features with selection or dimensionality reduction

  • ✓ C. Increase regularization like higher L2 or dropout

  • ✓ E. Add more recent labeled data

The correct options are Reduce features with selection or dimensionality reduction, Increase regularization like higher L2 or dropout, and Add more recent labeled data.

A large gap between training and validation accuracy signals high variance. Using fewer and more informative inputs simplifies the hypothesis space and curbs variance which helps close that gap. In BigQuery ML you can select only the most relevant columns in your training query or apply PCA to compress correlated features.

Stronger regularization discourages memorization of noise and promotes smoother decision boundaries. In BigQuery ML you can raise L2 penalties for linear and logistic models and configure dropout in DNN classifiers to improve generalization.

More high quality and recent labeled examples expose the model to a broader and more current data distribution which reduces variance and combats concept drift. This often yields better validation performance than trying to further fit the existing training set.

Train for many more epochs is counterproductive in an overfitting scenario because it encourages the model to fit training noise even more which usually widens the gap to validation performance.

Disable early stopping removes a safeguard that halts training when validation metrics stall or degrade which increases the risk of overfitting rather than reducing it.

When training accuracy greatly exceeds validation accuracy think reduce complexity, increase regularization, and get more data. Retain early stopping and avoid longer training unless validation metrics are still improving.

Question 7

You operate streaming ETL jobs at NovaMetrics Labs and a long running Dataflow pipeline that processes events in real time has seen end to end latency rise from about 8 seconds to nearly 75 seconds. The execution graph shows that Dataflow has fused multiple transforms into one large stage which makes stage level metrics hard to interpret. You need to locate the exact part of the pipeline that is causing the slowdown and capture useful execution metrics without changing the business logic. What should you do?

  • ✓ B. Insert a Reshuffle boundary after each major logical step and inspect stage metrics in the Dataflow console

The correct answer is Insert a Reshuffle boundary after each major logical step and inspect stage metrics in the Dataflow console.

This approach creates explicit execution boundaries so Dataflow materializes data between steps and prevents excessive fusion. With separate stages, the monitoring interface exposes per stage throughput, backlog, and latency which lets you pinpoint the exact hotspot even when the original graph had been fused. It preserves the pipeline’s business logic because it only affects execution and observability and not the functional transformations.

Once the graph is segmented, you can identify where end to end latency accumulates by comparing stage metrics in the console. You can then adjust parallelism, partitioning, or resource settings in a targeted way rather than guessing across the entire fused stage.

Enable Streaming Engine and raise the maximum number of workers focuses on scaling resources and moving shuffle to the service, which can reduce worker load and sometimes latency. It does not isolate the problematic portion of a fused stage or improve the interpretability of stage level metrics, so it does not help you locate the exact slowdown.

Introduce temporary sinks after each significant transform and compare write rates across components changes the pipeline behavior by adding extra I and O and storage dependencies. This can distort performance, increase cost, and complicate cleanup, and it is unnecessary when you can create internal boundaries that expose accurate stage metrics without altering business logic.

Add verbose debug logging inside each ParDo and analyze the logs during execution introduces overhead and noisy logs, and fused execution still obscures where time is spent. Logs do not integrate with stage level metrics in the monitoring interface, so this does not reliably reveal the precise bottleneck.

When you see fused stages making metrics hard to interpret, think about adding an internal boundary to break fusion so you can compare per stage metrics without changing the pipeline’s logic.

Question 8

Which Google Cloud service offers a managed PostgreSQL database that requires minimal changes to existing applications?

  • ✓ C. Cloud SQL

The correct option is Cloud SQL because it provides a fully managed PostgreSQL database while typically requiring only minimal changes such as updating connection details.

Cloud SQL runs the community PostgreSQL engine and supports standard drivers and tools. It handles backups, patching, and high availability which lets most applications migrate with little or no code change.

AlloyDB for PostgreSQL is managed and PostgreSQL compatible and it is optimized for high performance and advanced features, yet migrations to it often involve more architectural and networking considerations, so it is not the simplest path when the goal is minimal changes. For ease of migration Cloud SQL is usually preferred.

Cloud Spanner is a globally distributed relational service and it does not run the PostgreSQL engine. Even with its PostgreSQL interface it has different semantics and limits which usually require significant application changes, so it does not meet the requirement.

Cloud Bigtable is a NoSQL wide column database and it is not PostgreSQL and it does not support SQL queries or relational schemas, so it would require extensive redesign and is not suitable for this use case.

When a question emphasizes minimal changes pick the managed service that keeps the same engine. For PostgreSQL on Google Cloud that is Cloud SQL.

Question 9

An analytics team at CedarMart has trained a BigQuery ML model and asks you to design a serving pipeline for predictions. Your HTTP API must return a prediction for a single account_id with end to end latency under 80 milliseconds. The team runs this statement to generate results which is SELECT predicted_label, account_id FROM ML.PREDICT(MODEL ‘ml_ops.churn_v2’, TABLE account_features_v3). How should you build the pipeline so that the application can respond for an individual account_id within 80 milliseconds while continuing to use that statement?

  • ✓ C. Run a Dataflow job that writes predictions for all account_ids from the ML.PREDICT statement to Bigtable and let the API read a single row by key using the Bigtable Reader role

The correct option is Run a Dataflow job that writes predictions for all account_ids from the ML.PREDICT statement to Bigtable and let the API read a single row by key using the Bigtable Reader role. This design preserves the exact SQL the team already runs and moves the results into a store that can return a single key lookup well within the 80 millisecond requirement.

This pipeline can execute the ML.PREDICT query on a schedule or continuously and write results keyed by account_id into Bigtable. Bigtable is optimized for single row key reads with very low latency, so the API can fetch the prediction quickly and reliably. The application only needs the Bigtable Reader role, and it avoids per request query execution in BigQuery which is not designed for sub 100 millisecond point lookups.

Export the model to Vertex AI and deploy an online prediction endpoint that the API calls is not appropriate because the requirement is to keep using the ML.PREDICT statement. Moving to Vertex AI changes the serving pattern and would require a separate feature serving path rather than reusing the existing BigQuery workflow.

Create a Dataflow pipeline that runs the ML.PREDICT statement and reads results with BigQueryIO and grant the application service account the Dataflow Worker role does not meet the latency goal because the API would still be tied to BigQuery results or to a running pipeline, which is not a per request serving system. The Dataflow Worker role is also unnecessary for an application that only needs to read predictions.

Add a WHERE account_id filter to the statement and grant the application service account the BigQuery Data Viewer role will not achieve 80 millisecond responses since BigQuery is built for analytical workloads and interactive queries often exceed this latency. It also changes the agreed upon statement rather than keeping it unchanged.

When strict low latency is required, identify opportunities to precompute predictions and place them in a store designed for key-based lookups. Avoid invoking analytical systems like BigQuery per request when you need sub 100 millisecond responses.

Question 10

In Pub/Sub, what should you configure to ensure a newly created subscription can immediately access the most recent 30 days of previously published messages?

  • ✓ C. Set topic message retention to 30 days

The correct option is Set topic message retention to 30 days.

This setting keeps published messages stored at the topic for the specified duration so any subscription that is created later can immediately read those retained messages. Pub/Sub delivers the retained messages to the new subscription during its initial catch up which achieves the goal of making the last 30 days available right away.

Set subscription backlog retention to 30 days is incorrect because backlog retention applies only to messages that belong to an existing subscription. A new subscription has no backlog to retain, so it cannot access messages that were published before it was created.

Seek the new subscription to a timestamp 30 days ago is incorrect because seeking only repositions the subscription cursor within messages that have been retained for that subscription. If the topic did not retain those older messages then a new subscription has nothing to read when seeking to that time.

When a question mentions new subscriptions needing historical messages, think topic retention. Subscription retention and seek help with replay for an existing subscription only.

Question 11

At the Aster Marine Institute you operate 1,200 ocean buoys that emit one metric every second with a timestamp. Your repository holds about 1.5 TB of data and is growing by roughly 2 GB per day. You must support two access patterns. The first is to fetch a single reading for a given buoy and exact timestamp with a typical latency in the single digit millisecond range. The second is to run daily analytical workloads that include joins across the full dataset. Which storage design should you implement so that both needs are satisfied?

  • ✓ C. Use Bigtable with a row key of buoy ID concatenated with timestamp and export to BigQuery once per day

The correct answer is Use Bigtable with a row key of buoy ID concatenated with timestamp and export to BigQuery once per day.

This design aligns with the point read requirement because a row key that combines buoy ID and timestamp lets you directly address the single row that contains the reading. That gives consistently low latency lookups in the single digit millisecond range when modeled correctly. The daily export to BigQuery provides an analytical store that is built for full scans and joins, which satisfies the daily workload across the whole dataset.

The volume and growth rate fit a time series pattern that a wide column NoSQL store handles very well. Writes of one metric per buoy per second are modest and spread across keys which avoids contention. You can schedule a recurring Dataflow pipeline to move the latest data so BigQuery remains ready for the daily analytical jobs.

Use Cloud SQL for PostgreSQL with a composite index on buoy ID and timestamp and schedule a daily export to BigQuery is not a good fit because although indexed point reads can be fast, managing a multi‑terabyte and growing time series in a relational database is inefficient and costly, and it is not optimized for large daily analytical joins across the full dataset.

Use BigQuery and concatenate buoy ID and timestamp as a primary key is unsuitable for the point read requirement because BigQuery is an analytical engine with query latencies that are typically seconds rather than single digit milliseconds, and it does not provide primary key based single row lookups.

Use Bigtable with a row key of buoy ID concatenated with metric name and export to BigQuery once per day does not match the access pattern because you need to fetch by exact timestamp. A key based on metric name would require scanning by time within a metric which prevents fast direct reads for a specific timestamp.

Map the access pattern to the row key. For single digit millisecond point reads choose a key that directly addresses the row and use BigQuery only for large scale scans and joins. When you see daily analytics plus point lookups think operational store plus warehouse and plan a reliable ingest or export path.

Question 12

A neural network with 10 input features trained on approximately 4 million labeled rows is underfitting on both the training and validation sets. What should you do to improve its performance?

  • ✓ B. Add feature crosses for feature interactions

The correct option is Add feature crosses for feature interactions.

Underfitting in both training and validation indicates the model capacity or features are insufficient. With 10 features and about 4 million labeled rows, there is ample data to support richer interactions. Adding feature crosses explicitly models relationships between features that a simple network may not capture, which increases expressive power and often boosts accuracy on tabular data.

Vertex AI hyperparameter tuning can help search for better hyperparameters, however it cannot create new informative signals from the existing inputs. When a model is underfitting across training and validation, improving feature expressiveness is usually more effective than only tuning knobs.

Increase dropout during training adds regularization and reduces capacity, which generally worsens underfitting instead of alleviating it.

AutoML Tabular is a change of tooling rather than a direct fix for this neural network and it does not specifically address the lack of feature interactions that is causing underfitting. This wording maps to the older AutoML Tables branding which has been integrated into Vertex AI, so it is less likely to appear on newer exams.

When both training and validation metrics are low, think underfitting. Prioritize richer features or added capacity such as feature crosses over stronger regularization or only tweaking tuning tools.

Question 13

You manage centralized BigQuery datasets with authorized views that several departments at Orion Outfitters use. The marketing analytics team experiences large month to month swings in query charges under the on-demand model and they want a predictable monthly cost without changing how they access the shared views. What should you do to help them maintain a steady BigQuery query spend each month?

  • ✓ B. Create a BigQuery slot reservation with a baseline of 600 slots for the marketing group and keep autoscaling disabled, then bill them back from that commitment

The correct option is Create a BigQuery slot reservation with a baseline of 600 slots for the marketing group and keep autoscaling disabled, then bill them back from that commitment.

A fixed slot commitment gives the team dedicated capacity and a steady hourly rate which translates into a predictable monthly bill. Disabling autoscaling ensures there are no burst charges so spend remains flat even when usage spikes. You can assign the reservation to the marketing projects or folders without changing how they query the centralized authorized views because reservations control compute capacity while authorized views control data access.

Establish a BigQuery processed bytes quota for the marketing group and cap the maximum scanned bytes per day is not suitable because quotas restrict or fail queries rather than provide a predictable budget. Bytes processed limits are enforced per job or per user and do not convert on demand costs into a fixed monthly amount for a department.

Create a BigQuery Enterprise edition reservation with a baseline of 300 slots and enable autoscaling up to 900 slots for the marketing group would reintroduce variable charges whenever autoscaling adds slots. That defeats the goal of a steady monthly spend even though capacity can scale.

Create a BigQuery reservation with a baseline of 0 slots and enable autoscaling up to 800 slots for the marketing group provides no committed capacity and makes all usage variable. This leads to unpredictable month to month costs as charges depend entirely on how much the system scales.

When a question emphasizes predictable cost for BigQuery, choose committed slots with autoscaling turned off. Quotas and autoscaling address control and performance but they do not guarantee a steady monthly bill.

Question 14

In BigQuery, your queries filter on event_time and user_id across approximately 18 months of data, but a dry run shows a full table scan. How can you reduce the data scanned and cost with minimal changes to the SQL?

  • ✓ C. Use time partitioning on event_time with clustering by user_id

Only Use time partitioning on event_time with clustering by user_id is correct.

This approach lets BigQuery prune entire date partitions when the query filters on event_time, which prevents scanning months that are not needed. The clustering by user_id then organizes data within each partition so queries that filter on a specific user read far fewer blocks. Because your existing filters already use event_time and user_id you gain these benefits with little or no query rewrite which directly reduces scanned bytes and cost.

Add a LIMIT clause is wrong because BigQuery bills by bytes processed before applying the limit, so it does not reduce the amount of data scanned and a dry run would still show a full table scan.

Create daily sharded tables and query with a wildcard is wrong because date sharding is a legacy pattern that is less efficient and harder to manage than native partitioned tables, and wildcard queries can still scan many shards. This also requires more structural and SQL changes rather than the minimal change requested.

Create a materialized view for the last 30 days is wrong because it would only help for recent data while your workload spans about eighteen months. Queries outside that window would still scan the base table and would not meaningfully reduce overall scanned bytes.

When a query filters on time and another key, think partition on the time column and then cluster on the key. Remember that a LIMIT clause does not reduce bytes processed, and prefer native partitioned tables over date sharding.

Question 15

Larkspur Retail has applications that publish clickstream and purchase events to a Pub/Sub topic and strict ordering is not needed. Analysts require fixed one hour non overlapping aggregations and only the aggregated outputs should be stored in BigQuery for reporting. The pipeline must scale to bursts of about two million events per minute while remaining cost efficient. How should you process the stream and load the results into BigQuery?

  • ✓ C. Run a streaming Dataflow pipeline that reads from Pub/Sub applies fixed one hour tumbling windows and writes each window’s aggregates to BigQuery

The correct choice is Run a streaming Dataflow pipeline that reads from Pub/Sub applies fixed one hour tumbling windows and writes each window’s aggregates to BigQuery.

This approach ingests the stream from Pub/Sub and performs fixed one hour tumbling windows that exactly match the non overlapping aggregation requirement. It then writes only the aggregated results to BigQuery which satisfies the constraint to store aggregates only. It can auto scale to handle bursts near two million events per minute and remains cost efficient because it avoids per message invocations and does not persist raw events in BigQuery.

Event time windowing and watermarks allow each hour to close correctly and handle late data appropriately. Aggregating in the pipeline and writing summaries keeps storage and query costs low while delivering ready to use hourly reports.

Trigger a Cloud Function on each Pub/Sub message to update counters and write hourly results to BigQuery is not cost efficient at this volume because invoking a function for every message adds high overhead and coordinating stateful hourly counters is difficult. It also risks BigQuery write hot spots and quota issues during bursts and making precise hourly windows reliable is challenging.

Use a Pub/Sub BigQuery subscription to stream messages directly into BigQuery and derive hourly summaries with scheduled queries violates the requirement to store only aggregated outputs because it lands raw events first. It increases storage and query costs and introduces batch latency from scheduled queries rather than delivering continuous aggregation.

Schedule an hourly batch Dataflow job that reads the Pub/Sub backlog computes aggregates and outputs to BigQuery does not provide continuous processing and can fall behind during bursts as backlogs grow. It complicates achieving exactly once hourly aggregation and risks gaps or duplicates between runs while a continuous streaming pipeline handles these concerns natively.

Map the requirements to streaming features. When you see fixed windows on a Pub/Sub stream and a constraint to keep only aggregated results in BigQuery, choose a streaming Dataflow pipeline with windowing and a BigQuery sink. Be wary of options that store raw events or invoke per message functions.

Question 16

Which Google Cloud service should you use to create a conversational assistant that interacts with users and directs them to the appropriate support queue?

  • ✓ C. Dialogflow CX

The correct option is Dialogflow CX.

It is designed to build enterprise conversational virtual agents that manage multi turn conversations with stateful flows and intent routing. It supports webhook fulfillment and handoff to human agents through integrations with contact center platforms, which enables routing users to the correct support queue.

Contact Center AI is a broader solution portfolio that brings together virtual agents, agent assist, and insights. It is not the specific service you use to design and manage the conversational flows and routing logic, so it is not the best fit for this question.

Text-to-Speech API converts text into audio and provides natural sounding voices. It does not handle intents, session state, or queue routing, so it cannot by itself build a conversational assistant.

Speech-to-Text API transcribes audio into text. It does not provide dialog management, fulfillment, or routing capabilities, so it is not the right choice for building the assistant.

When a question asks for a full conversational agent with intent handling and routing choose Dialogflow CX. If the focus is only on converting speech to or from text then think of Speech-to-Text or Text-to-Speech instead.

Question 17

You are leading a data platform project at Cedar Finch Retail that requires a transactional database with strict ACID guarantees. The operations team wants the service to keep running during a zone outage and they expect failover to occur automatically with little manual work. Which Google Cloud approach should you adopt to meet these needs?

  • ✓ C. Cloud SQL for PostgreSQL with high availability enabled

The correct option is Cloud SQL for PostgreSQL with high availability enabled.

This choice provides a fully managed relational database that enforces strict ACID guarantees, which suits transactional workloads. The high availability setting creates a primary and a standby in different zones within the same region with synchronous replication, and the service automatically fails over to the standby during a zone outage with little to no manual work.

Bigtable instance with more than one cluster can improve availability using replication and multi cluster routing, yet Bigtable is a NoSQL wide column store that offers atomicity only at the single row level. It does not provide relational multi row ACID transactions, which makes it unsuitable for a strict transactional database requirement.

Cloud SQL for MySQL with point in time recovery enabled helps recover from corruption or accidental changes by restoring to a specific timestamp, but PITR does not deliver automatic failover during a zone outage. Without the high availability setting this instance would still experience downtime or require manual intervention.

BigQuery dataset using a multi region location offers durable, highly available storage for analytical workloads, but BigQuery is a serverless data warehouse and is not intended for OLTP use cases or strict ACID transactional semantics.

Match keywords to capabilities. When you see ACID and automatic zone failover look for managed relational services with a high availability setting rather than backup features or analytics platforms.

Question 18

A streaming Dataflow pipeline reads from Pub/Sub and writes to BigQuery, and it is backlogged while all six e2 standard 2 workers are at approximately 96 percent CPU utilization. Which changes would increase throughput? (Choose 2)

  • ✓ B. Increase the maximum worker cap

  • ✓ D. Use larger worker machines

The correct options are Increase the maximum worker cap and Use larger worker machines.

The pipeline is CPU bound because all workers are running near full utilization. Allowing the service to add more workers spreads the work across additional VMs which increases parallelism and raises total throughput. This directly addresses the observed CPU saturation.

Giving each worker a bigger machine type adds more vCPUs and memory per worker which lets each worker process more elements per unit time. This is effective when transforms are compute intensive or when additional memory reduces garbage collection pressure and serialization overhead.

Buy more BigQuery slots is not helpful here because slots govern query and some processing operations while streaming ingestion to BigQuery uses a separate ingestion path with its own quotas. Adding slots would not relieve a CPU bottleneck in the Dataflow workers or increase streaming write throughput.

Move the pipeline to us-central1 is not a generic fix for a compute bottleneck. Region choice should align with data locations to avoid cross region latency, yet changing regions without adding compute capacity will not resolve sustained 96 percent CPU on workers.

Identify the observed bottleneck first and pick changes that add that resource. If CPU is saturated then scale out or scale up. Remember that BigQuery slots help queries and not streaming inserts.

Question 19

Helios Manufacturing plans to move its on-premises Apache Hadoop environment to Google Cloud to run nightly and multi-hour batch processing for analytics. They want a managed approach that remains resilient to failures and keeps costs low for long-running jobs. How should they migrate and configure the platform to achieve these goals?

  • ✓ C. Provision a Dataproc cluster with standard persistent disks and use about 40% preemptible worker nodes. Keep datasets in Cloud Storage and update paths in your jobs from hdfs:// to gs://

The correct option is Provision a Dataproc cluster with standard persistent disks and use about 40% preemptible worker nodes. Keep datasets in Cloud Storage and update paths in your jobs from hdfs:// to gs://.

This choice gives you a fully managed Hadoop and Spark environment that reduces operational overhead and improves resilience. Keeping datasets in Cloud Storage separates storage from compute and provides durable data independent of the cluster, so losing workers only triggers task retries rather than data loss. Using a mix with roughly forty percent preemptible workers lowers costs for long running batch jobs while nonpreemptible primaries keep the cluster stable and the scheduler can resubmit interrupted tasks. Standard persistent disks are cost effective because the main data sits in Cloud Storage and local disks are mainly used for shuffle and temporary spill.

Install Hadoop and Spark on a 12-node Compute Engine managed instance group with standard machine types. Configure the Cloud Storage connector and keep data in Cloud Storage while changing paths from hdfs:// to gs:// is not right because you would be managing the entire stack on virtual machines and that undermines the goal of a managed approach. It also complicates failure handling for stateful services and provides no Hadoop aware cluster management.

Create a Dataproc cluster that uses SSD persistent disks and make about 40% of the workers preemptible. Store data in Cloud Storage and update job paths from hdfs:// to gs:// is close but it increases cost by using SSD persistent disks when the data lives in Cloud Storage. For these workloads standard persistent disks are usually sufficient unless the jobs are unusually heavy on local shuffle I O.

Rebuild the pipelines to run on Dataflow and write outputs to BigQuery instead of HDFS or Cloud Storage does not meet the requirement for a straightforward migration of existing Hadoop jobs. It requires a substantial rewrite to a different processing model and a change in storage targets which is unnecessary to achieve the stated goals.

When a question asks for minimal change migration of Hadoop with low cost and resilience, look for a managed service, externalize data to Cloud Storage, and mix in preemptible workers while keeping primaries nonpreemptible. Choose standard disks unless local I O is clearly the bottleneck.

Question 20

What should you implement so that each Google Cloud project can view its BigQuery slot allocation with a 48 hour trend without building custom pipelines?

  • ✓ C. Cloud Monitoring dashboard with BigQuery metric slots/allocated_for_project

The correct option is Cloud Monitoring dashboard with BigQuery metric slots/allocated_for_project.

This choice uses the native BigQuery Reservations metric that Cloud Monitoring already collects, so you can add it to a dashboard and set the time range to 48 hours with no custom ingestion or transformation. Each project can view its own slot allocation trend directly in Monitoring because the metric is scoped and labeled for project level visibility.

Cloud Monitoring with query/scanned_bytes is about data volume processed by queries and it does not represent slot allocation, so it cannot show how many slots are allocated to a project.

Looker Studio on BigQuery audit logs would require collecting and modeling logs, and audit logs do not provide a direct metric of project slot allocation. This does not meet the requirement to avoid building custom pipelines.

Cloud Logging export with custom metric from totalSlotMs requires setting up log sinks, parsing job statistics, and creating a custom metric, which is exactly the type of custom pipeline the question asks you to avoid. It also focuses on job consumption rather than the allocation level of slots for a project.

When you see a requirement to visualize resource usage without building pipelines, prefer built in Cloud Monitoring metrics and dashboards and match the metric name to the exact resource asked about.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.