Google Data Engineer Certification Exam Dumps and Braindumps

All GCP questions come from my GCP Data Engineer Udemy course and certificationexams.pro
Free GCP Certification Exam Topics Tests
Despite the title of this section, this is not a GCP braindump in the traditional sense and it does not promote cheating.
Traditionally, the term braindump referred to someone taking an exam, memorizing the questions, and sharing them online for others to use, which is unethical and violates certification agreements.
This set of resources is intended for legitimate learning. Start with GCP Professional Data Engineer Practice Questions, assess with Real GCP Certified Data Engineer Exam Questions, and use the concise Professional Data Engineer Braindump as a quick study aid that teaches, not cheats.
For timed drills and realistic pacing, use the Google Certified Data Engineer Exam Simulator, deepen understanding with GCP Certified Professional Data Engineer Questions and Answers, perform rapid checks with the Google Certified Data Engineer Exam Dump, and refine skills with GCP Professional Data Engineer Sample Questions and the full Google Certified Data Engineer Exam Questions or the comprehensive Google Certified Professional Data Engineer Practice Test.
Google Cloud Certification Practice Exams
Each question set has been carefully written to align with the official Google Cloud Professional Engineer exam objectives.
They mirror the tone, logic, and technical depth of real exam scenarios without copying any protected content.
Items are designed to help you learn, reason, and master key domains such as designing data storage systems, building and operationalizing data processing solutions, operationalizing machine learning models, ensuring solution quality and reliability, and managing security and compliance on Google Cloud.
If you can answer these questions and explain why the distractors are incorrect, you will not only be prepared to pass the real exam but also gain a solid understanding of how to evaluate tradeoffs in storage formats, optimize BigQuery performance, implement streaming and batch pipelines, secure data with IAM and encryption, and monitor costs and SLIs for production workloads.
About GCP Exam Dumps
You can call this a GCP exam dump if you like, but remember the purpose here is to teach through detailed explanations, realistic examples, and insights that help you think like a Google engineer.
Study with focus, practice consistently, and approach your certification with integrity.
Use the GCP Exam Simulator and the Google Certified Professional Practice Tests to prepare effectively and move closer to earning your Google Cloud Professional Engineer certification.
Git, GitHub & GitHub Copilot Certification Made Easy |
---|
Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.
Get certified in the latest AI, ML and DevOps technologies. Advance your career today. |
Google Data Engineer Exam Dump Questions
Question 1
Metro City Library is moving its catalog from an on premises warehouse to BigQuery and it tracks books along with contributor details such as authors and publication years. At present the author records are kept in a separate table that links to the books table through a common key. Based on Google’s recommended BigQuery schema practices, how should you model the data to get the best performance when patrons view lists of checked out books with their authors?
-
❏ A. Keep the normalized schema and expose a view that joins book_main to author_dim for every request
-
❏ B. Create a wide table with top level columns such as author_first_name and author_last_name
-
❏ C. Create a single denormalized BigQuery table that stores each book and nests a repeated STRUCT column named authors with the author attributes
-
❏ D. Store only author_id values in an ARRAY on the book record and keep attributes in the author_dim table
Question 2
In Apache Beam, which feature enables a DoFn to access a small reference dataset for each element when that dataset is not included in the main PCollection?
-
❏ A. CoGroupByKey
-
❏ B. Side input
-
❏ C. State and timers
-
❏ D. Custom window
Question 3
At RetailNexus, analysts run the same BigQuery query against a single fact table many times per day to refresh a KPI dashboard. The table is about 2 GB and a subset of rows is updated roughly 15 times per hour. You want to speed up these repeated reads without changing the schema or the query logic. Which optimization should you apply?
-
❏ A. Enable BigQuery BI Engine for the dataset
-
❏ B. Rely on the BigQuery results cache for subsequent identical runs
-
❏ C. Create a materialized view on the table and query the view instead
-
❏ D. Reserve additional BigQuery slots using Reservations
Question 4
Which Google Cloud managed relational database offers global scalability and high availability with minimal operational overhead for a workload of about 25 million transactions per month?
-
❏ A. AlloyDB for PostgreSQL
-
❏ B. Cloud Spanner
-
❏ C. Cloud SQL for PostgreSQL
-
❏ D. Cloud Bigtable
Question 5
You are building a streaming analytics feature for a scooter dispatch platform named GlideCity that must highlight neighborhoods with surging requests so available riders can be rebalanced. Events from mobile apps and vehicle trackers flow into Pub/Sub for ingestion and processing. Scooter location pings arrive every 6 seconds and customer trip requests arrive continuously. Your pipeline must compute rolling supply and demand aggregates every 3 seconds that cover the most recent 45 seconds of data, and it must store the results in a store that dashboards can read with very low latency. Which approach should you take?
-
❏ A. Set up a Dataflow streaming job that uses a session window and write aggregates to BigQuery
-
❏ B. Build a Dataflow pipeline with a hopping window and store the results in Cloud Bigtable
-
❏ C. Create a Dataflow pipeline that applies a hopping window and write the rolling aggregates to Memorystore for Redis
-
❏ D. Run a Dataflow pipeline with a tumbling window and persist the computed metrics in Memorystore for Redis
Question 6
How should you enable a 30 day contractor to develop and test a Cloud Dataflow pipeline using BigQuery data while safeguarding PII and keeping the production environment isolated?
-
❏ A. VPC Service Controls with prod dataset read access
-
❏ B. De identified dataset in separate project
-
❏ C. Cloud Dataflow Developer on production project
-
❏ D. BigQuery authorized views on production data
Question 7
SummitTrail Logistics runs a Cloud Spanner database that stores address reference data where table geo_countries lists countries and table geo_provinces lists provinces that reference their country_id through a foreign key. Analytical and transactional queries join these tables against about 8 million address rows and latency has increased during peak hours. Following Google guidance for Spanner data modeling, what change should you make to improve query performance?
-
❏ A. Create a secondary index on geo_provinces.country_id and rewrite joins to use the index
-
❏ B. Make geo_provinces an interleaved child of geo_countries using the country_id parent key
-
❏ C. Store all provinces for each country in one STRING column like “CA,ON,QC,BC” and parse it when needed
-
❏ D. Flatten the schema by denormalizing so each province row repeats its country attributes
Question 8
During peak periods your BigQuery dashboards and ad hoc queries slow down, and you suspect queued jobs and slot contention. Which tools should you use to inspect job metadata and slot utilization to identify the bottleneck?
-
❏ A. Use Cloud Monitoring and create an alert on BigQuery slot utilization over 90 percent
-
❏ B. Use BigQuery Admin Resource Charts with INFORMATION_SCHEMA JOBS and JOBS_TIMELINE
-
❏ C. Analyze BigQuery audit logs in Cloud Logging
Question 9
Your analytics team at Aurora Outfitters needs to move datasets from an on premises environment into BigQuery on Google Cloud. Some sources are continuous feeds at about 2 MB per second and others are nightly file drops that total roughly 5 TB. You must programmatically mask sensitive fields before the data lands in BigQuery and you want to keep costs low while supporting both streaming and batch ingestion. What should you build?
-
❏ A. Use Cloud Data Fusion with the Cloud DLP plugin to de identify records in the pipeline and then write to BigQuery
-
❏ B. Configure BigQuery Data Transfer Service to load your data and then call the Cloud DLP API to mask sensitive fields after tables are populated
-
❏ C. Develop a Dataflow pipeline with the Apache Beam SDK for Python that supports both streaming and batch and call Cloud DLP from the pipeline to mask data before writing to BigQuery
-
❏ D. Run Spark jobs on Dataproc that call the Cloud DLP API and then write the results into BigQuery
Question 10
How can you allow other Google Cloud projects to query only aggregated results from your BigQuery tables without copying data and ensure that the querying projects incur the query costs?
-
❏ A. Materialized aggregate table in a shared dataset
-
❏ B. Authorized view with cross-project sharing
-
❏ C. Column-level access policies

All GCP questions come from my GCP Data Engineer Udemy course and certificationexams.pro
Question 11
Your team at Northwind Sports operates a BigQuery dataset named crm_data with a table called customers_2026 that stores sensitive attributes such as full_name and street_address. You must share this data with two internal groups under different restrictions. The analytics group can query all customer rows but must not see any sensitive columns, and the support group needs every column but only for customers whose contract_status is ‘ACTIVE’. You configured an authorized dataset and attached policy tags to the sensitive fields, yet the analytics group still sees those fields in query results. Which actions will correctly enforce the intended access controls? (Choose 2)
-
❏ A. Use VPC Service Controls to isolate the BigQuery project and block access to sensitive columns
-
❏ B. Remove the bigquery.dataViewer role on the authorized dataset from the analytics group
-
❏ C. Grant access through authorized views that omit sensitive columns for analytics and apply a row-level filter for support
-
❏ D. Tighten the policy tag taxonomy so tagged fields are not visible to the analytics group
-
❏ E. Remove the Data Catalog Fine-Grained Reader role from the analytics group
Question 12
How should you configure a Pub/Sub push subscription to handle brief outages by retaining unacknowledged messages, retry deliveries without overwhelming the service, and send messages to a dead letter topic after 10 attempts?
-
❏ A. Increase the acknowledgement deadline to 9 minutes
-
❏ B. Immediate redelivery retries and dead-letter to another topic after 10 attempts
-
❏ C. Configure exponential backoff retries and dead-letter to a separate topic with a 10 attempt limit
-
❏ D. Use exponential backoff retries and dead-letter to the original topic with a 10 attempt limit
Question 13
Seabright Insights, a retail analytics startup, plans to run several separate data programs. Each program needs its own Compute Engine virtual machines, Cloud Storage buckets, and Cloud Functions, and all programs must follow the same compliance rules. How should the team structure the Google Cloud resource hierarchy so the controls are applied consistently and are simple to manage?
-
❏ A. Create a folder for each program and place those folders inside one project, then apply the required constraints on that project
-
❏ B. Put all resources for every program into a single project and use labels with Organization Policy conditions to enforce the rules
-
❏ C. Create one project per program and group all projects under a single folder, then set the constraints on that folder so they inherit to each project
-
❏ D. Create a separate organization for each program and attach the constraints at the organization level
Question 14
Which Google Cloud storage option provides ACID transactions, horizontal scalability, and efficient range queries on non-primary key columns for a 25 TB workload?
-
❏ A. Cloud Bigtable with a row key on sale_date and customer_id
-
❏ B. Cloud Spanner with secondary indexes on non key columns
-
❏ C. AlloyDB for PostgreSQL with secondary indexes
Question 15
You are building a low latency analytics layer for AeroTrack Labs using BigQuery streaming inserts. Each message includes a globally unique event_id and an event_time value, yet retries can produce duplicate rows that may arrive up to 45 seconds late. You need interactive queries to return exactly one row per event without modifying the ingestion pipeline. Which query pattern should you use to consistently remove duplicates at read time?
-
❏ A. Use the LAG window function with PARTITION BY event_id and filter rows where the previous value is not null
-
❏ B. Run a BigQuery MERGE to upsert into a table keyed by event_id and have queries read from that table
-
❏ C. Use ROW_NUMBER with PARTITION BY event_id and ORDER BY event_time descending then keep rows where the sequence equals 1
-
❏ D. Sort by event_time in descending order and return only one row with LIMIT 1
Question 16
Which transformation should you apply to the email field before loading it into BigQuery so the two datasets can still be joined while preventing analysts from viewing PII?
-
❏ A. Enable BigQuery dynamic data masking with a policy tag
-
❏ B. Encrypt emails with Cloud KMS and store the ciphertext
-
❏ C. Use Cloud DLP deterministic FPE FFX before loading
-
❏ D. Compute a salted SHA256 hash during load
Question 17
CedarPeak Logistics is standardizing analytics on BigQuery and wants to adopt ELT practices. The engineering team is fluent in SQL and wants a developer friendly workspace where they can write modular SQL, track changes with Git, declare dependencies between datasets, run tests, and schedule daily builds for about 180 tables in the analytics_ops dataset. Which Google Cloud service should they use to develop and manage these SQL transformation pipelines?
-
❏ A. Cloud Composer with BigQuery operators
-
❏ B. BigQuery scheduled queries
-
❏ C. Dataform
-
❏ D. Cloud Data Fusion
Question 18
How should you design Cloud Storage and Dataproc to meet a 15 minute RPO, provide low read latency during normal operations, and ensure processing continues if us-east1 becomes unavailable?
-
❏ A. US multi region Cloud Storage bucket and Dataproc in us-east1 with failover to us-central1
-
❏ B. Single region bucket in us-east1 with Object Versioning and Autoclass
-
❏ C. Dual region bucket us-east1 and us-central1 with turbo replication and local reads plus Dataproc failover
-
❏ D. Dual region bucket us-east1 and us-central1 but read through us-central1 during normal use
Question 19
Your company runs a real time ticket resale platform where buyers often place identical bids for the same listing within a few hundred milliseconds and the requests hit different regional application servers. Each bid event includes a listing identifier, the bid amount, a user identifier, and the event timestamp. You need to combine these bid events into a single real time stream so you can determine which user placed the first bid for each listing. What should you do?
-
❏ A. Store bid events in a local MySQL instance on every application server and periodically merge them into a central MySQL database
-
❏ B. Have each application server publish bid events to Cloud Pub/Sub as they happen and use Dataflow with a pull subscription to process the stream and assign the first bid per listing
-
❏ C. Write all bid events from the application servers to a shared network file system and then run an Apache Hadoop batch job to find the earliest bid
-
❏ D. Push bid events from Cloud Pub/Sub to a custom HTTPS endpoint that writes each event into Cloud SQL
Question 20
A training job on Vertex AI takes about 60 hours. You cannot change the compute resources, adjust the batch size, or use distributed training. Which change would most directly reduce the total training time?
-
❏ A. Enable Vertex AI hyperparameter tuning
-
❏ B. Use a smaller training dataset
-
❏ C. Use Vertex AI Pipelines
Question 21
Riverview Savings Cooperative operates branches across three EU regions and needs a storage platform for real time account ledger updates that requires full ACID properties and standard SQL access, and it must scale horizontally while preserving strong transactional consistency. Which Google Cloud solution should you implement?
-
❏ A. AlloyDB for PostgreSQL with cross region read replicas
-
❏ B. Cloud SQL with federated queries from BigQuery
-
❏ C. Cloud Spanner with locking read write transactions
-
❏ D. Cloud Spanner with stale reads enabled
Question 22
A set of nightly Hadoop MapReduce jobs running on Dataproc now processes 15 TB of data and is missing the SLA. You need to improve performance without increasing costs. What should you do?
-
❏ A. Move the batch to BigQuery scheduled queries
-
❏ B. Use Spark on the existing Dataproc cluster with YARN
-
❏ C. Rewrite pipelines for Dataflow with Apache Beam
-
❏ D. Increase Dataproc worker count and size
Question 23
BlueTrail Logistics operates a fleet of autonomous forklifts that send telemetry and environmental readings to an analytics platform for early fault detection and the devices currently post to a custom REST endpoint at about 50 thousand small events per minute which causes the service to fall behind during bursts and some data is dropped. The machine learning team asks you to redesign ingestion so that traffic spikes are absorbed and downstream processing stays reliable without losing messages. What should you do?
-
❏ A. Write events using BigQuery streaming inserts and have analytics jobs read the destination table
-
❏ B. Have devices upload data to a Cloud Storage bucket and let the ingestion job process the new objects
-
❏ C. Publish device messages to a Cloud Pub/Sub topic and let the ingestion service pull from a subscription
-
❏ D. Send data to a Cloud SQL for PostgreSQL instance and have the pipeline query the table for new rows
Question 24
Before building BigQuery pipelines, how should an organization quantify data quality issues across 60 days of extracted datasets?
-
❏ A. Load into BigQuery and rely on load errors
-
❏ B. Use Cloud DLP to scan the extracts
-
❏ C. Profile the extracts and run a data quality assessment
-
❏ D. Configure Dataplex data quality rules after loading
Question 25
NovaRetail operates a Standard Tier Memorystore for Redis instance in its production project and needs to conduct a quarterly disaster recovery drill that faithfully reproduces a Redis failover while ensuring production data remains untouched. What should the team do?
-
❏ A. Initiate a manual failover using the limited-data-loss protection mode on the production Memorystore for Redis instance
-
❏ B. Create a Standard Tier Memorystore for Redis instance in a staging project and run a manual failover with the force-data-loss protection mode
-
❏ C. Add an additional replica to the production Redis instance and then perform a manual failover with the force-data-loss protection mode
-
❏ D. Create a Standard Tier Memorystore for Redis instance in a staging project and perform a manual failover with the limited-data-loss protection mode
Google Data Engineer Exam Braindump Questions

All GCP questions come from my GCP Data Engineer Udemy course and certificationexams.pro
Question 1
Metro City Library is moving its catalog from an on premises warehouse to BigQuery and it tracks books along with contributor details such as authors and publication years. At present the author records are kept in a separate table that links to the books table through a common key. Based on Google’s recommended BigQuery schema practices, how should you model the data to get the best performance when patrons view lists of checked out books with their authors?
-
✓ C. Create a single denormalized BigQuery table that stores each book and nests a repeated STRUCT column named authors with the author attributes
The correct option is Create a single denormalized BigQuery table that stores each book and nests a repeated STRUCT column named authors with the author attributes.
This approach follows BigQuery guidance to denormalize and use nested and repeated fields for one to many relationships such as books to authors. By storing author attributes inside each book row as a repeated STRUCT, you avoid joins at query time when patrons view lists of checked out books, which improves latency and reduces bytes scanned. This structure naturally supports multiple authors per book and lets queries select only needed fields for efficient reads.
Keep the normalized schema and expose a view that joins book_main to author_dim for every request is incorrect because a view does not materialize the join and every request would still perform the join, which increases cost and latency compared to denormalized nested data.
Create a wide table with top level columns such as author_first_name and author_last_name is incorrect because a fixed set of author columns does not model multiple authors cleanly and either forces duplication of book rows or many sparse columns, which is inflexible and inefficient.
Store only author_id values in an ARRAY on the book record and keep attributes in the author_dim table is incorrect because you would still need to join to fetch author attributes that patrons expect to see, which reintroduces the join cost that denormalization is meant to avoid.
When you see a one to many relationship that is frequently read together, prefer denormalize with nested and repeated fields to avoid joins. Look for hints about performance on read and minimizing bytes scanned.
Question 2
In Apache Beam, which feature enables a DoFn to access a small reference dataset for each element when that dataset is not included in the main PCollection?
-
✓ B. Side input
The correct option is Side input.
A Side input lets a DoFn access a small reference dataset that is not part of the main PCollection. It is provided per element and per window and is intended for read only auxiliary data such as lookup tables. Beam materializes this data appropriately for each window so the DoFn can use it alongside each incoming element.
CoGroupByKey joins multiple keyed PCollections on their keys. It is used to merge datasets by key rather than to supply a small reference dataset to each element within a DoFn.
State and timers enable per key stateful processing and time based callbacks in streaming pipelines. They are for managing evolving per key state rather than providing a static reference dataset to all elements.
Custom window defines how elements are grouped into windows based on time. Windowing configures grouping and triggering behavior and does not supply auxiliary data to a DoFn.
When a question asks how to feed a small read only dataset to every element in a DoFn, think side inputs. If it mentions joining multiple keyed PCollections, consider CoGroupByKey instead.
Question 3
At RetailNexus, analysts run the same BigQuery query against a single fact table many times per day to refresh a KPI dashboard. The table is about 2 GB and a subset of rows is updated roughly 15 times per hour. You want to speed up these repeated reads without changing the schema or the query logic. Which optimization should you apply?
-
✓ C. Create a materialized view on the table and query the view instead
The correct option is Create a materialized view on the table and query the view instead.
It precomputes the query result and keeps it fresh with incremental maintenance as the base table changes, so repeated runs read precomputed data rather than rescanning the full table. This is well suited when only a subset of rows changes many times per hour because BigQuery only recomputes the parts that changed. You do not need to alter the table schema and you only need to point your query to the materialized view.
Enable BigQuery BI Engine for the dataset is not the best fit here because it accelerates supported interactive analytics with in memory processing but it does not precompute and incrementally maintain the exact result of your query. With frequent updates you still incur recomputation, SQL feature support can be limiting, and capacity sizing is an extra concern, so it is less reliable than a materialized view for this pattern.
Rely on the BigQuery results cache for subsequent identical runs will not help because the cache is invalidated whenever the underlying table changes. Since rows are updated many times per hour the cache would rarely be used and most runs would still scan the table.
Reserve additional BigQuery slots using Reservations does not address the root cause because slots guarantee capacity for concurrency rather than avoid rescanning unchanged data. For a 2 GB table that is read repeatedly the real speedup comes from precomputing the result, which is exactly what a materialized view provides.
Check whether the data changes between runs. If it does then the results cache will not help and a materialized view is often the fastest way to serve repeated reads.
Question 4
Which Google Cloud managed relational database offers global scalability and high availability with minimal operational overhead for a workload of about 25 million transactions per month?
-
✓ B. Cloud Spanner
The correct option is Cloud Spanner for a managed relational database that provides global scalability and high availability with minimal operations.
Cloud Spanner is a fully managed relational service that delivers strong consistency, automatic multi region replication, and transparent failover. It scales horizontally to handle millions of transactions per month while minimizing operational overhead through managed maintenance and backups.
AlloyDB for PostgreSQL offers high performance and high availability, yet it is primarily a regional service and does not provide the same global horizontal scalability and multi region transactional consistency that Cloud Spanner is designed for.
Cloud SQL for PostgreSQL is a managed instance service that works well for regional deployments and smaller scale needs. It does not offer seamless global scaling or cross region transactional consistency and typically requires more operational effort to scale and achieve high availability compared with Cloud Spanner.
Cloud Bigtable is a managed NoSQL wide column database and is not a relational service. It is optimized for analytical or large scale key value workloads rather than transactional relational workloads, so it does not meet the requirement for a managed relational database.
Match the keywords to the product. If you see global relational scale with strong consistency and minimal operations, think Cloud Spanner. If the need is regional PostgreSQL compatibility, consider other services, and remember that Bigtable is not relational.
Question 5
You are building a streaming analytics feature for a scooter dispatch platform named GlideCity that must highlight neighborhoods with surging requests so available riders can be rebalanced. Events from mobile apps and vehicle trackers flow into Pub/Sub for ingestion and processing. Scooter location pings arrive every 6 seconds and customer trip requests arrive continuously. Your pipeline must compute rolling supply and demand aggregates every 3 seconds that cover the most recent 45 seconds of data, and it must store the results in a store that dashboards can read with very low latency. Which approach should you take?
-
✓ C. Create a Dataflow pipeline that applies a hopping window and write the rolling aggregates to Memorystore for Redis
The correct option is Create a Dataflow pipeline that applies a hopping window and write the rolling aggregates to Memorystore for Redis.
A hopping window with a 45 second size and a 3 second period matches the requirement to compute rolling aggregates that always cover the most recent 45 seconds and refresh every 3 seconds. In Apache Beam on Dataflow this is the same concept as a sliding window, which continuously advances and produces overlapping windows that align with the requested cadence. Writing the results to Memorystore for Redis gives very low read latency and fast updates, which suits live dashboards that must reflect fresh counts for supply and demand.
Set up a Dataflow streaming job that uses a session window and write aggregates to BigQuery is not appropriate because session windows group events by gaps in activity rather than fixed time spans, so they do not produce a consistent 45 second rolling view. BigQuery is optimized for analytics rather than subsecond lookups, and streaming inserts can have higher latency for dashboard reads.
Build a Dataflow pipeline with a hopping window and store the results in Cloud Bigtable selects a low latency database, yet it is less suitable than an in memory cache for rapidly refreshed rolling aggregates that dashboards poll frequently. Bigtable is excellent for large scale time series and wide key access patterns, but for hot counters updated every few seconds and read with very low latency, Redis is a better fit.
Run a Dataflow pipeline with a tumbling window and persist the computed metrics in Memorystore for Redis uses a nonoverlapping window type that cannot produce a rolling 45 second view updated every 3 seconds. Tumbling windows emit only at fixed, nonoverlapping boundaries, so the dashboard would miss intermediate updates between boundaries.
Translate the requirement into window parameters by mapping the coverage to the window size and the refresh rate to the window period. Then choose the sink by the latency the dashboard needs and prefer in memory caches when reads must be extremely fast and continuously refreshed.
Question 6
How should you enable a 30 day contractor to develop and test a Cloud Dataflow pipeline using BigQuery data while safeguarding PII and keeping the production environment isolated?
-
✓ B. De identified dataset in separate project
The correct option is De identified dataset in separate project. This approach lets the contractor build and test with representative data while removing PII and keeping all activity outside the production environment.
Create a new project and copy only the needed BigQuery tables into a de-identified dataset so sensitive fields are masked or tokenized. Use tooling such as Cloud DLP or SQL transforms to de-identify data, then grant the contractor only the minimal BigQuery and Dataflow permissions in that project. Run Dataflow staging and worker resources there so production data, quotas, and IAM remain isolated.
VPC Service Controls with prod dataset read access is not sufficient because it still gives direct read access to production data. VPC Service Controls primarily help reduce data exfiltration risk and do not remove PII or provide the isolation required for a temporary contractor.
Cloud Dataflow Developer on production project places development in the production environment and increases the risk of accidental impact. It also does not prevent exposure to PII and violates the goal of keeping the contractor away from production.
BigQuery authorized views on production data can restrict columns or rows, yet the work still runs against production and any misconfiguration can expose sensitive data. It also fails to provide the environment isolation expected for a short term contractor.
When the scenario mentions contractors or short term access, prefer a separate project with least privilege and a de-identified dataset. Solutions that still touch production rarely meet isolation and PII protection requirements.
Question 7
SummitTrail Logistics runs a Cloud Spanner database that stores address reference data where table geo_countries lists countries and table geo_provinces lists provinces that reference their country_id through a foreign key. Analytical and transactional queries join these tables against about 8 million address rows and latency has increased during peak hours. Following Google guidance for Spanner data modeling, what change should you make to improve query performance?
-
✓ B. Make geo_provinces an interleaved child of geo_countries using the country_id parent key
The correct choice is Make geo_provinces an interleaved child of geo_countries using the country_id parent key.
Interleaving the child under its parent co-locates province rows with their country rows on the same key range which makes joins between countries and provinces much cheaper. With a primary key that starts with country_id the reads for country and its provinces become contiguous so Spanner can serve both transactional and analytical joins with fewer cross-partition hops and lower latency during peak load. This follows Spanner schema design guidance that favors locality when the access pattern frequently joins a small child set to its parent.
Create a secondary index on geo_provinces.country_id and rewrite joins to use the index is not the best fix for this pattern because the join still requires fetching base table rows and can increase write amplification due to index maintenance. It may help isolated filters by country but it does not deliver the same locality benefits as interleaving for frequent parent child joins.
Store all provinces for each country in one STRING column like “CA,ON,QC,BC” and parse it when needed is an anti-pattern because it harms queryability and prevents efficient joins and referential integrity. Parsing strings adds CPU cost and makes filtering and joining against 8 million address rows inefficient.
Flatten the schema by denormalizing so each province row repeats its country attributes duplicates data and increases the risk of inconsistency and higher write costs. It still does not provide the data locality that an interleaved parent child design gives for the joins described.
When a Spanner question describes frequent parent child joins with a modest number of children per parent, think of interleaving to gain data locality and reduce cross-partition work. Validate whether the workload is mostly joins rather than simple filters since indexes help selective predicates while interleaving helps locality for joins.
Question 8
During peak periods your BigQuery dashboards and ad hoc queries slow down, and you suspect queued jobs and slot contention. Which tools should you use to inspect job metadata and slot utilization to identify the bottleneck?
-
✓ B. Use BigQuery Admin Resource Charts with INFORMATION_SCHEMA JOBS and JOBS_TIMELINE
The correct option is Use BigQuery Admin Resource Charts with INFORMATION_SCHEMA JOBS and JOBS_TIMELINE.
BigQuery Admin Resource Charts let you visualize reservation and slot utilization over time so you can see when slots are saturated and whether there is backlog. Pairing that with INFORMATION_SCHEMA JOBS and JOBS_TIMELINE gives you per job metadata and execution timelines so you can confirm jobs were queued and map them to periods of high slot usage. Together these tools directly reveal whether slot contention or queued jobs are the source of the slowdown.
Use Cloud Monitoring and create an alert on BigQuery slot utilization over 90 percent is not sufficient for diagnosis because an alert only tells you that utilization was high. It does not provide the detailed job level metadata or timelines needed to attribute the contention to specific queries.
Analyze BigQuery audit logs in Cloud Logging can confirm that jobs ran and who triggered them but it does not report slot utilization or queue wait times. Audit logs lack the timeline and reservation insights required to pinpoint slot contention.
Start by checking slot utilization in Admin Resource Charts and then query INFORMATION_SCHEMA JOBS and JOBS_TIMELINE to tie queued jobs to the spike, and use alerts only for awareness not for root cause.
Question 9
Your analytics team at Aurora Outfitters needs to move datasets from an on premises environment into BigQuery on Google Cloud. Some sources are continuous feeds at about 2 MB per second and others are nightly file drops that total roughly 5 TB. You must programmatically mask sensitive fields before the data lands in BigQuery and you want to keep costs low while supporting both streaming and batch ingestion. What should you build?
-
✓ C. Develop a Dataflow pipeline with the Apache Beam SDK for Python that supports both streaming and batch and call Cloud DLP from the pipeline to mask data before writing to BigQuery
The correct option is Develop a Dataflow pipeline with the Apache Beam SDK for Python that supports both streaming and batch and call Cloud DLP from the pipeline to mask data before writing to BigQuery.
A Beam pipeline on Dataflow uses one unified programming model for both streaming and batch so it can handle the continuous 2 MB per second feeds and the 5 TB nightly loads with the same codebase. You can invoke Cloud DLP from within the pipeline so sensitive fields are de identified before any write occurs to BigQuery which satisfies the requirement to mask data prior to landing. The service is serverless and autoscaling which helps keep costs low for variable throughput and it offers native connectors to Pub or Sub, Cloud Storage, and BigQuery for both streaming inserts and efficient batch loads.
This approach also supports cost control because you can stream only the real time feeds while using batch loads from Cloud Storage to BigQuery for the nightly 5 TB files. The masking happens in flight which avoids ever storing raw sensitive values in BigQuery or in intermediate tables.
Use Cloud Data Fusion with the Cloud DLP plugin to de identify records in the pipeline and then write to BigQuery can perform masking, yet it introduces additional platform and execution costs and is better aligned to managed ETL patterns than to a single unified pipeline that efficiently handles both continuous streaming and large nightly batch at low cost.
Configure BigQuery Data Transfer Service to load your data and then call the Cloud DLP API to mask sensitive fields after tables are populated violates the requirement to mask data before it lands in BigQuery and DTS is focused on scheduled batch transfers from supported sources rather than custom on premises streaming feeds.
Run Spark jobs on Dataproc that call the Cloud DLP API and then write the results into BigQuery could achieve the transformation but you must provision and manage clusters and keep resources running for streaming which raises operational overhead and cost compared to a fully managed serverless runner. It also does not provide the same streamlined experience for one pipeline that covers both modes.
When a question asks for both streaming and batch with low cost and masking that happens before data lands, look for a serverless unified pipeline that can call DLP inline. This often points to Dataflow with Apache Beam rather than DTS, Data Fusion, or Dataproc.
Question 10
How can you allow other Google Cloud projects to query only aggregated results from your BigQuery tables without copying data and ensure that the querying projects incur the query costs?
-
✓ B. Authorized view with cross-project sharing
The correct option is Authorized view with cross-project sharing.
An authorized view exposes only the results of a SQL query to other projects while keeping the underlying tables secured. You define the aggregation in the view so consumers can query only the aggregated output and cannot see the raw rows. Because a view is virtual there is no data copy and the data remains in your dataset.
With Authorized view with cross-project sharing you grant the view access to the source dataset and you grant the consumer project permission to query the view. When consumers run the query from their own project the query job is billed to their project so they pay for the processing.
Materialized aggregate table in a shared dataset is not appropriate because it creates and stores a separate table which means copying or materializing data and it requires refresh and management. It also does not inherently enforce that consumers can query only aggregated results beyond what you precompute.
Column-level access policies restrict which columns are visible but they still expose row-level data for the allowed columns and they do not force aggregation. They do not provide a way to share only a computed aggregate result set across projects.
When a question stresses only aggregated results and without copying data think authorized views. Confirm who pays by checking which project runs the query job and ensure the consumer runs it from their project.
Question 11
Your team at Northwind Sports operates a BigQuery dataset named crm_data with a table called customers_2026 that stores sensitive attributes such as full_name and street_address. You must share this data with two internal groups under different restrictions. The analytics group can query all customer rows but must not see any sensitive columns, and the support group needs every column but only for customers whose contract_status is ‘ACTIVE’. You configured an authorized dataset and attached policy tags to the sensitive fields, yet the analytics group still sees those fields in query results. Which actions will correctly enforce the intended access controls? (Choose 2)
-
✓ C. Grant access through authorized views that omit sensitive columns for analytics and apply a row-level filter for support
-
✓ E. Remove the Data Catalog Fine-Grained Reader role from the analytics group
The correct actions are Grant access through authorized views that omit sensitive columns for analytics and apply a row-level filter for support and Remove the Data Catalog Fine-Grained Reader role from the analytics group.
Grant access through authorized views that omit sensitive columns for analytics and apply a row-level filter for support ensures the analytics group can query only the columns exposed by the view while the support group can see every column but only rows where the filter returns ACTIVE customers. Authorized views let you share query results without granting direct table access, which means users can only see the columns that the view selects and only the rows that the view or policy allows.
Remove the Data Catalog Fine-Grained Reader role from the analytics group is necessary because column-level security with policy tags is enforced through Data Catalog IAM on policy tags. If users hold that role on the relevant policy tag taxonomy, they can read tagged columns. Removing it ensures the sensitive columns are not readable by analytics, which aligns with the intended restriction.
Use VPC Service Controls to isolate the BigQuery project and block access to sensitive columns is incorrect because VPC Service Controls protect against data exfiltration across service boundaries and do not provide column-level masking or row filtering within BigQuery.
Remove the bigquery.dataViewer role on the authorized dataset from the analytics group is not a fix for column-level or row-level restrictions. It would simply prevent the group from accessing objects in that dataset and would not selectively hide sensitive columns or filter rows in the shared data.
Tighten the policy tag taxonomy so tagged fields are not visible to the analytics group is incorrect because taxonomy structure alone does not change access. Access to tagged columns is controlled by IAM on policy tags, so removing the Fine-Grained Reader role is what enforces the restriction.
When you see requirements that combine column masking and row filtering, think of authorized views or row-level security for sharing and remember that policy-tagged columns are enforced through Data Catalog IAM, especially the Fine-Grained Reader role.
Question 12
How should you configure a Pub/Sub push subscription to handle brief outages by retaining unacknowledged messages, retry deliveries without overwhelming the service, and send messages to a dead letter topic after 10 attempts?
-
✓ C. Configure exponential backoff retries and dead-letter to a separate topic with a 10 attempt limit
The correct option is Configure exponential backoff retries and dead-letter to a separate topic with a 10 attempt limit.
This configuration preserves messages until they are acknowledged and it protects your service during brief outages by spacing out retries rather than sending them all at once. A dead letter policy that uses a separate topic with a maximum of 10 delivery attempts ensures messages are moved after 10 tries so you can inspect or reprocess them without blocking the main pipeline.
Increase the acknowledgement deadline to 9 minutes is not sufficient because push delivery failures are retried by the service independent of the acknowledgement deadline and this setting does not provide a dead letter path or protect your endpoint from bursts.
Immediate redelivery retries and dead-letter to another topic after 10 attempts does not meet the requirement to avoid overload because immediate retries can flood the endpoint during an outage. Using spaced retries is the recommended pattern.
Use exponential backoff retries and dead-letter to the original topic with a 10 attempt limit is not valid because the dead letter topic must be different from the source topic and sending failed messages back to the same topic can create loops and confusion.
Translate each requirement to a feature. Preserve messages and avoid overload points to retry backoff and brief outages favor exponential backoff. Dead-letter after a count means configure a dead letter topic on a separate topic with a specific max delivery attempts.
Question 13
Seabright Insights, a retail analytics startup, plans to run several separate data programs. Each program needs its own Compute Engine virtual machines, Cloud Storage buckets, and Cloud Functions, and all programs must follow the same compliance rules. How should the team structure the Google Cloud resource hierarchy so the controls are applied consistently and are simple to manage?
-
✓ C. Create one project per program and group all projects under a single folder, then set the constraints on that folder so they inherit to each project
The correct option is Create one project per program and group all projects under a single folder, then set the constraints on that folder so they inherit to each project.
This approach places a consistent control point at the folder where you can set Organization Policies and IAM once and have them automatically inherited by every project for each program. Inheritance ensures uniform compliance while allowing each program to have its own isolated project for quotas, permissions, lifecycle, and billing scoping. This matches the Google Cloud resource hierarchy in which an organization contains folders and folders contain projects, and the resources such as Compute Engine, Cloud Storage, and Cloud Functions live inside those projects. When you apply constraints on the folder they flow down to all child projects and therefore to the resources within them.
Create a folder for each program and place those folders inside one project, then apply the required constraints on that project is incorrect because folders do not live inside projects. Projects are children of folders or the organization. Even aside from the hierarchy error, attaching constraints at the project would force you to duplicate and maintain policies separately on each project rather than setting them once at a higher level.
Put all resources for every program into a single project and use labels with Organization Policy conditions to enforce the rules is incorrect because a single project removes isolation for quotas and IAM and increases blast radius. Labels are metadata for filtering and cost reporting and they are not a security boundary, and most Organization Policy constraints are not enforced by label, so this would not provide consistent or reliable control.
Create a separate organization for each program and attach the constraints at the organization level is incorrect because an organization maps to a Google Workspace or Cloud Identity account and most companies operate a single organization. Creating multiple organizations complicates identity, billing, and policy management and makes collaboration across programs harder.
When you need consistent controls across many isolated workloads, think one project per workload under a shared folder so policies inherit. Remember that labels are for organization and cost reporting and are not a security boundary, and that organization policy is best applied at a level that matches the scope you want.
Question 14
Which Google Cloud storage option provides ACID transactions, horizontal scalability, and efficient range queries on non-primary key columns for a 25 TB workload?
-
✓ B. Cloud Spanner with secondary indexes on non key columns
The correct option is Cloud Spanner with secondary indexes on non key columns because it provides ACID transactions, horizontal scalability, and efficient range queries on indexed non primary key columns for a 25 TB workload.
Spanner is a distributed relational database that offers fully managed ACID transactions with strong consistency across rows and tables. It automatically shards data to scale horizontally while preserving transactional guarantees, which makes it well suited for tens of terabytes and beyond. Secondary indexes in Spanner enable efficient range scans on non primary key columns, which addresses the requirement for performant queries on alternate attributes.
Cloud Bigtable with a row key on sale_date and customer_id is not suitable because it does not support multirow ACID transactions and it lacks secondary indexes. It depends on row key design for query efficiency, so it would optimize only for that single access pattern and would not provide general range queries on other columns.
AlloyDB for PostgreSQL with secondary indexes provides ACID and rich indexing, yet it does not offer automatic horizontal write scalability across nodes. It can store 25 TB with strong performance, but scaling writes would require manual sharding at the application layer, so it does not meet the horizontal scalability requirement.
When you see a need for ACID plus transparent horizontal scalability and efficient secondary indexes on non primary key columns, map that to Spanner. If the design centers on a single access pattern through a row key, think Bigtable.
Question 15
You are building a low latency analytics layer for AeroTrack Labs using BigQuery streaming inserts. Each message includes a globally unique event_id and an event_time value, yet retries can produce duplicate rows that may arrive up to 45 seconds late. You need interactive queries to return exactly one row per event without modifying the ingestion pipeline. Which query pattern should you use to consistently remove duplicates at read time?
-
✓ C. Use ROW_NUMBER with PARTITION BY event_id and ORDER BY event_time descending then keep rows where the sequence equals 1
The correct option is Use ROW_NUMBER with PARTITION BY event_id and ORDER BY event_time descending then keep rows where the sequence equals 1.
Use ROW_NUMBER with PARTITION BY event_id and ORDER BY event_time descending then keep rows where the sequence equals 1 is the standard read-time deduplication pattern in BigQuery. It assigns a ranking within each event_id based on event_time and returns only the top ranked row, which yields exactly one row per event. This works well with streaming inserts where retries may produce duplicates for up to a short period because duplicates share the same event_id and usually the same event_time, and the window function still selects a single canonical record per key. If ties are possible on event_time you can add a secondary tiebreaker such as an ingestion timestamp to keep this deterministic while still using the same pattern.
Use the LAG window function with PARTITION BY event_id and filter rows where the previous value is not null is incorrect because that filter keeps nonfirst rows in the partition, which returns the duplicates rather than eliminating them. Even if inverted, LAG does not provide a straightforward way to retain exactly one row while deterministically selecting the most recent record per key.
Run a BigQuery MERGE to upsert into a table keyed by event_id and have queries read from that table is incorrect because it requires write-time deduplication and modification of the data rather than removing duplicates at read time. It adds operational overhead and does not meet the requirement to avoid changing the ingestion pipeline.
Sort by event_time in descending order and return only one row with LIMIT 1 is incorrect because it returns only a single row for the entire dataset instead of one row per event_id, so it does not deduplicate per key.
When a question asks for exactly one row per key at read time, think window functions. Partition by the unique key and use ROW_NUMBER ordered by a recency field, then keep only rank 1. Add a secondary tiebreaker if timestamps can be equal.

All GCP questions come from my GCP Data Engineer Udemy course and certificationexams.pro
Question 16
Which transformation should you apply to the email field before loading it into BigQuery so the two datasets can still be joined while preventing analysts from viewing PII?
-
✓ C. Use Cloud DLP deterministic FPE FFX before loading
The correct option is Use Cloud DLP deterministic FPE FFX before loading.
With Use Cloud DLP deterministic FPE FFX before loading you apply a deterministic, format-preserving cryptographic transformation using a centrally managed key so the same email always becomes the same token in every dataset. This keeps the columns joinable since equal inputs map to equal tokens. Analysts cannot view PII because they only see tokens, and only controlled reidentification workflows that hold the key can reverse the transformation.
Enable BigQuery dynamic data masking with a policy tag is not appropriate because masking occurs only at query time and does not transform the stored data. It also does not produce a derived token that can be used to join across datasets, and users with sufficient permissions may still access the raw values.
Encrypt emails with Cloud KMS and store the ciphertext is unsuitable because standard symmetric encryption in Cloud KMS is non-deterministic, so the same email will not yield the same ciphertext each time. That breaks joins and the ciphertext is not format preserving, which can also complicate schema design.
Compute a salted SHA256 hash during load is weak for PII at typical email entropy and is susceptible to dictionary or brute force attacks, especially if a shared salt is needed to keep datasets joinable. It also provides no controlled reidentification path that many governance workflows require.
When a question asks to keep data joinable while hiding PII, prefer a deterministic pseudonymization method that can be reversed only by authorized processes. If an option uses masking it likely does not change stored values, and if it uses random encryption or hashing then joins will usually break or security may be weak.
Question 17
CedarPeak Logistics is standardizing analytics on BigQuery and wants to adopt ELT practices. The engineering team is fluent in SQL and wants a developer friendly workspace where they can write modular SQL, track changes with Git, declare dependencies between datasets, run tests, and schedule daily builds for about 180 tables in the analytics_ops dataset. Which Google Cloud service should they use to develop and manage these SQL transformation pipelines?
-
✓ C. Dataform
The correct option is Dataform.
Dataform provides a developer friendly SQL workspace inside BigQuery where engineers can write modular SQL in folders and files, declare dependencies between datasets, and build a lineage graph. It integrates with Git so teams can version control changes and use pull requests for reviews. It supports SQL based tests and assertions to validate data quality, and it can schedule daily runs that build a large number of models which fits the need to rebuild about 180 tables in the analytics_ops dataset.
Because Dataform is purpose built for ELT in BigQuery, it handles incremental models, environment configuration, and dependency ordering without requiring Python or external orchestration. This lets a SQL fluent team manage transformations cleanly and reliably while staying within the BigQuery ecosystem.
Cloud Composer with BigQuery operators is an orchestration service based on Airflow and it is well suited for complex cross service workflows, yet it does not provide a first class SQL development workspace, integrated Git version control, or built in SQL testing for dataset models. It adds orchestration overhead that is unnecessary for a primarily SQL based ELT workflow in BigQuery.
BigQuery scheduled queries can run SQL on a schedule, however they do not natively manage inter query dependencies across many models, do not offer a modular project structure with Git integration, and do not include built in testing. Coordinating 180 tables with ordered builds and lineage is difficult with this option.
Cloud Data Fusion focuses on visual ETL and integration pipelines and often leverages Dataflow or Dataproc for execution. It is not optimized for a SQL first ELT development experience in BigQuery and it lacks the native SQL project structure, dependency graph, Git workflows, and testing that the team requires.
When a scenario emphasizes a SQL first workflow with Git version control, dependency graphs, tests, and native scheduling inside BigQuery, choose Dataform. If the need is only orchestration across many services, think about Cloud Composer, and if it is only simple timing of a single query, think about scheduled queries.
Question 18
How should you design Cloud Storage and Dataproc to meet a 15 minute RPO, provide low read latency during normal operations, and ensure processing continues if us-east1 becomes unavailable?
-
✓ C. Dual region bucket us-east1 and us-central1 with turbo replication and local reads plus Dataproc failover
The correct answer is Dual region bucket us-east1 and us-central1 with turbo replication and local reads plus Dataproc failover.
This design satisfies a 15 minute recovery point objective because Turbo Replication provides an SLA that new objects are replicated to both regions within 15 minutes. With a dual region that stores data in both us-east1 and us-central1, Dataproc jobs can read locally from the same region during normal operations which keeps read latency low. If us-east1 becomes unavailable you can fail over the Dataproc workload to us-central1 and continue processing with local reads against the same bucket.
This approach aligns storage and compute in each region so steady state reads stay local and disaster recovery is predictable because the replication window is governed by the Turbo Replication commitment.
US multi region Cloud Storage bucket and Dataproc in us-east1 with failover to us-central1 is not correct because multi region buckets do not support Turbo Replication, so you do not have an SLA that guarantees replication within 15 minutes. Without that commitment the 15 minute recovery point objective is not assured even though a compute failover is described.
Single region bucket in us-east1 with Object Versioning and Autoclass is not correct because a single region bucket cannot continue processing if us-east1 is unavailable. Versioning protects against overwrite and delete operations and Autoclass manages storage classes, but neither provides cross region replication or a disaster recovery capability.
Dual region bucket us-east1 and us-central1 but read through us-central1 during normal use is not correct because forcing reads through us-central1 increases latency during normal operations for workloads in us-east1 and it omits Turbo Replication and a clear Dataproc failover plan, so the 15 minute recovery point objective and continuity requirements are not met.
When a question specifies an RPO in minutes, look for features with an explicit replication commitment such as Turbo Replication for dual region Cloud Storage. Then co locate compute with data for local reads during steady state and plan a region level failover path for the workload.
Question 19
Your company runs a real time ticket resale platform where buyers often place identical bids for the same listing within a few hundred milliseconds and the requests hit different regional application servers. Each bid event includes a listing identifier, the bid amount, a user identifier, and the event timestamp. You need to combine these bid events into a single real time stream so you can determine which user placed the first bid for each listing. What should you do?
-
✓ B. Have each application server publish bid events to Cloud Pub/Sub as they happen and use Dataflow with a pull subscription to process the stream and assign the first bid per listing
The correct option is Have each application server publish bid events to Cloud Pub/Sub as they happen and use Dataflow with a pull subscription to process the stream and assign the first bid per listing.
This approach creates a single global stream with low latency and high throughput, which is ideal when identical bids can arrive within milliseconds from different regions. A streaming pipeline can key events by listing identifier and use event timestamps, windowing, and stateful processing to select the earliest bid while handling out-of-order arrivals and late data. Using a pull subscription lets the workers control flow and scale horizontally so the system remains reliable under spikes.
Store bid events in a local MySQL instance on every application server and periodically merge them into a central MySQL database is incorrect because it introduces merge delays and conflicts, cannot reliably order near-simultaneous events across regions in real time, and does not provide streaming semantics or scalable processing for deduplication and first-event selection.
Write all bid events from the application servers to a shared network file system and then run an Apache Hadoop batch job to find the earliest bid is incorrect because it is a batch solution that adds significant latency, which prevents real time determination of the first bid, and it does not handle continuous streaming with event-time ordering.
Push bid events from Cloud Pub/Sub to a custom HTTPS endpoint that writes each event into Cloud SQL is incorrect because it turns the problem into high-QPS transactional writes without stream processing semantics, risks duplicates due to at-least-once delivery, and lacks native event-time and windowing capabilities needed to consistently pick the first bid at scale.
When you see requirements for real time processing across regions and handling out-of-order events, map them to Pub/Sub for ingestion and Dataflow for streaming with event time, windowing, and state. Prefer managed streaming services over databases or batch jobs for low latency ordering and deduplication.
Question 0
A training job on Vertex AI takes about 60 hours. You cannot change the compute resources, adjust the batch size, or use distributed training. Which change would most directly reduce the total training time?
-
✓ B. Use a smaller training dataset
The correct option is Use a smaller training dataset because with fixed compute, fixed batch size, and no distributed training, the most direct way to reduce wall clock time is to reduce the number of examples the training job must process.
Reducing the dataset size lowers the number of steps per epoch and the total amount of I O and compute performed. This shortens each epoch and the overall run time. There is a trade off with model quality, yet it is the only option presented that directly cuts the work the trainer must do under the stated constraints.
Enable Vertex AI hyperparameter tuning is not correct because tuning runs many training trials to search the space of parameters, which typically increases total elapsed time for the experiment and does not shorten any single training run.
Use Vertex AI Pipelines is not correct because pipelines orchestrate steps and dependencies but they do not make the underlying training step run faster. The training component will take the same time given the same resources and data.
When compute and batch size are fixed, look for levers that reduce the amount of work like fewer steps or fewer examples. Be cautious of options that add orchestration or experiments since they usually increase overall time rather than shorten a single run.
Question 1
Riverview Savings Cooperative operates branches across three EU regions and needs a storage platform for real time account ledger updates that requires full ACID properties and standard SQL access, and it must scale horizontally while preserving strong transactional consistency. Which Google Cloud solution should you implement?
-
✓ C. Cloud Spanner with locking read write transactions
The correct option is Cloud Spanner with locking read write transactions.
Cloud Spanner with locking read write transactions provides full ACID guarantees with externally consistent transactions and standard SQL while scaling horizontally. You can deploy multi region instances in the EU that preserve strong transactional consistency for both reads and writes which is essential for real time ledger updates. Its locking read write transactions offer serializable isolation so concurrent updates remain correct as you scale.
AlloyDB for PostgreSQL with cross region read replicas does not support horizontally scalable multi region writes with strong consistency. Cross region replicas are read only and are designed for read scalability and disaster recovery which means they cannot meet the requirement for strongly consistent transactional updates across regions.
Cloud SQL with federated queries from BigQuery targets analytics use cases where BigQuery can read from Cloud SQL. This pairing does not add horizontal write scalability or strong multi region transactional consistency and Cloud SQL primarily scales vertically which makes it unsuitable for a high scale ledger.
Cloud Spanner with stale reads enabled intentionally relaxes consistency by allowing reads at a past timestamp which improves latency for some workloads but does not satisfy the need for real time strong consistency in a ledger system.
When you see ACID with horizontal scale and strong consistency across regions, map the requirement to Cloud Spanner read write transactions. Be cautious with replicas or features like stale reads since they often trade freshness for latency.
Question 2
A set of nightly Hadoop MapReduce jobs running on Dataproc now processes 15 TB of data and is missing the SLA. You need to improve performance without increasing costs. What should you do?
-
✓ B. Use Spark on the existing Dataproc cluster with YARN
The correct option is Use Spark on the existing Dataproc cluster with YARN.
This choice replaces disk heavy MapReduce with an in memory execution engine that reduces I O and accelerates shuffles and aggregations. It runs on the current Dataproc cluster with the same resources so you do not add nodes or larger machine types and you keep cost flat while improving runtime to help meet the SLA.
It also minimizes migration effort because both engines are available on Dataproc. You can adjust job submission and code with relatively small changes rather than performing a full platform rewrite.
Move the batch to BigQuery scheduled queries is not appropriate because it requires rewriting the workload in SQL and moving data into BigQuery. Query pricing is based on data processed which can increase cost and this runs outside your Dataproc cluster so it does not meet the constraint to avoid higher cost on the existing platform.
Rewrite pipelines for Dataflow with Apache Beam would require a significant migration and validation effort and introduces a different pricing model. This is unlikely to help you meet the immediate nightly SLA and it can increase cost during and after the transition.
Increase Dataproc worker count and size would likely speed up the jobs but it directly raises cluster cost which violates the requirement to achieve faster results without higher cost.
When you see a requirement for faster results without higher cost first consider engine or configuration improvements that use the same infrastructure. Prefer in memory processing and better resource utilization before scaling hardware or migrating platforms.
Question 3
BlueTrail Logistics operates a fleet of autonomous forklifts that send telemetry and environmental readings to an analytics platform for early fault detection and the devices currently post to a custom REST endpoint at about 50 thousand small events per minute which causes the service to fall behind during bursts and some data is dropped. The machine learning team asks you to redesign ingestion so that traffic spikes are absorbed and downstream processing stays reliable without losing messages. What should you do?
-
✓ C. Publish device messages to a Cloud Pub/Sub topic and let the ingestion service pull from a subscription
The correct option is Publish device messages to a Cloud Pub/Sub topic and let the ingestion service pull from a subscription.
Cloud Pub/Sub decouples producers from consumers and provides durable buffering with elastic throughput so it easily absorbs traffic spikes. Producers can publish quickly and acknowledgements ensure at least once delivery. Consumers pull with flow control and can scale horizontally so downstream processing stays reliable at the service pace. Message retention and automatic redelivery protect against transient failures so you avoid losing messages while keeping latency low.
Write events using BigQuery streaming inserts and have analytics jobs read the destination table is not a message ingestion solution. Streaming inserts are subject to quotas and throttling during bursts which can lead to failed inserts that you must retry yourself. This approach does not provide durable queuing or backpressure management and it couples producers to warehouse availability.
Have devices upload data to a Cloud Storage bucket and let the ingestion job process the new objects is better for batch files rather than many small, high frequency events. An object per event creates high overhead and can run into request rate and listing inefficiencies. Notifications and object processing introduce latency and there is no native retry and redelivery model for individual messages.
Send data to a Cloud SQL for PostgreSQL instance and have the pipeline query the table for new rows uses a relational database as a queue which is an anti pattern. It does not scale well for bursty high write rates, adds polling complexity and contention, and lacks durable queue semantics like retention, acknowledgements, and redelivery.
When you see requirements to absorb bursts and avoid message loss think about decoupling producers from consumers with a managed, durable messaging service. Match cues like at least once delivery, retention, and independent scaling to Cloud Pub/Sub.
Question 4
Before building BigQuery pipelines, how should an organization quantify data quality issues across 60 days of extracted datasets?
-
✓ C. Profile the extracts and run a data quality assessment
The correct option is Profile the extracts and run a data quality assessment. This approach quantifies data quality across the 60 days of extracts before any pipeline design and it produces concrete metrics on completeness, validity, uniqueness and similar dimensions. It lets you establish a baseline and trends so you can address issues early and avoid rework in BigQuery.
By profiling and assessing quality on the extracts you compute statistics such as null counts, distinct counts, value distributions and pattern conformance and you can evaluate rules against expected ranges or formats. This yields measurable results that support planning and remediation prior to ingestion and aligns with best practices for building reliable pipelines.
Load into BigQuery and rely on load errors is not appropriate because load errors mainly surface schema or type mismatches and they do not measure core quality dimensions like duplicates or out of range values. It also requires loading data before you understand its fitness which can waste time and cost.
Use Cloud DLP to scan the extracts focuses on discovering and classifying sensitive information rather than measuring general data quality. It will not quantify null rates, duplicates, referential integrity or validity checks that are needed for a quality assessment.
Configure Dataplex data quality rules after loading happens too late for this requirement since you need to quantify issues in the extracts before building pipelines. Post ingestion rules help enforce quality in production but they do not provide the upfront assessment across the historical extracts you were asked to measure.
First decide whether the question asks for assessment before ingestion or after. If you must quantify issues across source extracts ahead of design then choose profiling and a dedicated data quality assessment rather than relying on load errors or classification tools.
Question 5
NovaRetail operates a Standard Tier Memorystore for Redis instance in its production project and needs to conduct a quarterly disaster recovery drill that faithfully reproduces a Redis failover while ensuring production data remains untouched. What should the team do?
-
✓ B. Create a Standard Tier Memorystore for Redis instance in a staging project and run a manual failover with the force-data-loss protection mode
The correct choice is Create a Standard Tier Memorystore for Redis instance in a staging project and run a manual failover with the force-data-loss protection mode.
This approach isolates the drill from production so the team can validate application behavior during a real failover while keeping production data safe. The staging instance mirrors the production topology and a manual failover that promotes the replica even when not fully synchronized best reproduces an abrupt primary loss that can occur during real incidents.
Initiate a manual failover using the limited-data-loss protection mode on the production Memorystore for Redis instance is wrong because it directly affects production and violates the requirement to keep production data untouched. Even with a protection mode that tries to minimize data loss the operation still introduces disruption to the live environment.
Add an additional replica to the production Redis instance and then perform a manual failover with the force-data-loss protection mode is incorrect because Standard Tier supports one primary with a single replica and you cannot add extra replicas. It also impacts production which the scenario forbids.
Create a Standard Tier Memorystore for Redis instance in a staging project and perform a manual failover with the limited-data-loss protection mode is not the best answer because it does not faithfully simulate a sudden primary failure. The protection mode attempts to avoid data loss which reduces fidelity for disaster recovery drills that should validate behavior under worst case conditions.
When a question asks to test failover without touching production look for isolation such as a staging project and then choose the mode that mirrors a real outage. In Redis Standard Tier that usually means preferring a setting that simulates an abrupt primary loss rather than one that minimizes data loss.
Jira, Scrum & AI Certification |
---|
Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..
You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today. |
Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.