Coffee Talk: Java, News, Stories and Opinions

BLOG

GCP Certified Data Practitioner Sample Questions

TechTarget

All questions are from my GCP Data Practitioner Udemy course and certificationexams.pro

Free Google Cloud Data Associate Exam Topics Test

The Google Cloud Data Practitioner Associate certification validates your ability to manage, prepare, and analyze data using Google Cloud tools and services. It focuses on core areas such as BigQuery, Cloud Storage, data pipelines, ETL workflows, and managing secure, scalable datasets that support modern analytics and AI initiatives.

To prepare effectively, start with the GCP Associate Data Practitioner Practice Questions. These questions mirror the tone, logic, and structure of the real certification exam and help you become familiar with Google’s testing format. You can also explore Real GCP Certified Data Practitioner Exam Questions to test yourself against authentic, scenario-based challenges that simulate real data management tasks on the Google Cloud Platform.

For focused study, review GCP Associate Data Practitioner Sample Questions that cover topics such as IAM configuration, dataset permissions, SQL queries in BigQuery, and troubleshooting common issues in Cloud Storage.

Google Certified Data Practitioner Exam Simulator

Each section of the GCP Certified Associate Data Practitioner Questions and Answers collection is designed to teach as well as test. These materials reinforce critical Google Cloud data concepts and provide explanations to help you understand why certain responses are correct, preparing you to think like a real data associate.

For full readiness, use the Google Certified Data Practitioner Exam Simulator and take full-length Google Certified Associate Data Practitioner Practice Tests. These simulations reproduce the pacing and structure of the actual exam, helping you manage time effectively and gain confidence in a real testing environment.

If you prefer focused sessions, try the Google Certified Data Practitioner Exam Dump and Associate Data Practitioner Braindump collections. These organize questions by domain such as data ingestion, transformation, and access control, allowing you to strengthen specific knowledge areas.

Google Cloud Certification Practice Exams

Working through these Google Certified Data Practitioner Exam Questions will build the analytical and technical skills required to manage data efficiently across the Google Cloud ecosystem. By mastering these exercises, you’ll be prepared to implement secure, optimized data workflows and support data-driven decision-making in your organization.

Start today with the GCP Associate Data Practitioner Practice Questions. Train using the Exam Simulator and measure your progress with complete practice tests. Prepare to earn your certification and advance your career as a Google Cloud Data Associate.

Git, GitHub & GitHub Copilot Certification Made Easy
Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry. Master vibe coding and prove it with the GitHub Copilot certification Prove you know how CI/CD with the GitHub Action certification Prove you know your way around Git and GitHub with the GitHub Foundation certification. Show you know how to work on a team with a Scrum Master cert or Product Owner credential Let employers know you understand LLMs and Agentic AI with a GCP AI Leader certification Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Master vibe coding and prove it with the GitHub Copilot certification
Prove you know how CI/CD with the GitHub Action certification
Prove you know your way around Git and GitHub with the GitHub Foundation certification.
Show you know how to work on a team with a Scrum Master cert or Product Owner credential
Let employers know you understand LLMs and Agentic AI with a GCP AI Leader certification

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

GCP Data Practitioner Associate Sample Questions

Question 1

Aurora Mutual is building an analytics workspace in BigQuery for a dataset named cust360 in the US region. Only a small group of approved users may view columns that contain PII, yet most analysts must still run queries that return trends and aggregated metrics without exposing the raw PII fields. Using Google Cloud native capabilities, what should the data team implement?

❏ A. Place PII in a separate BigQuery table with restricted access and join only when necessary
❏ B. Use Cloud DLP to automatically redact PII from all BigQuery query results
❏ C. Apply BigQuery column level security with policy tags and publish authorized views for analysts
❏ D. Configure BigQuery row level access policies to prevent exposure of PII columns

Question 2

Which approach trains product recommendations directly in BigQuery to keep processing close to the data?

❏ A. Vertex AI Matching Engine
❏ B. Build a custom TensorFlow recommender on Vertex AI
❏ C. BigQuery ML matrix factorization for recommendations
❏ D. Dataproc Spark MLlib ALS

Question 3

The analytics team at example.com needs to load Google Analytics data into BigQuery automatically every 24 hours for leadership dashboards, and they want a managed schedule that runs without manual steps once configured. Which service should they use to set up this recurring import?

❏ A. Cloud Dataflow
❏ B. BigQuery Data Transfer Service
❏ C. Cloud Composer
❏ D. gcloud CLI

Question 4

In Cloud Storage, how can you make about 45 images publicly accessible while keeping all other objects private?

❏ A. Turn on “Uniform bucket-level access” and grant the “Viewer” role to “allAuthenticatedUsers” on the bucket
❏ B. Keep bucket private with “Fine-grained access control” and grant “allUsers” Reader on only the required objects
❏ C. Enable “Public Access Prevention” and add object ACLs for anonymous read
❏ D. Keep “Uniform bucket-level access” enabled and distribute signed URLs for the selected images

Question 5

At Crestline Retail you are studying purchasing patterns in BigQuery. The table named retail_ledger contains the columns shopper_id, transaction_date, item_id and spend_amount. You need to find the total amount each shopper spent during the past nine months and keep only those shoppers whose combined spend is greater than 1500 dollars. Which SQL statement should you run in BigQuery to achieve this?

❏ A. SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY item_id HAVING total_spent > 1500
❏ B. SELECT shopper_id, AVG(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500
❏ C. SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500
❏ D. SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date < DATE_ADD(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500

Question 6

How should you load only new or changed rows from an on premises Oracle 12c database into BigQuery every 30 minutes?

❏ A. Database Migration Service to Cloud SQL then query to BigQuery
❏ B. Dataflow JDBC pipeline with incremental loads every 30 minutes
❏ C. Cloud Functions with JDBC to BigQuery
❏ D. Dataproc with Sqoop then load to BigQuery

Question 7

Rivermark Apparel receives CSV files in a Cloud Storage bucket every 24 hours and must validate, transform, and load the data into BigQuery as a repeatable pipeline. What Google Cloud service setup should they choose to coordinate the end to end workflow?

❏ A. Cloud Run jobs scheduled with Cloud Scheduler
❏ B. Cloud Functions for validation with Dataflow for transformation and BigQuery for loading
❏ C. Cloud Composer to orchestrate validation transformation and loading across Cloud Storage and BigQuery
❏ D. Dataproc for validation and transformation with BigQuery Data Transfer Service to load the results

Question 8

Which Google Cloud data store best supports high throughput time series telemetry that will reach five billion rows in three months and is queried by device ID and contiguous time ranges?

❏ A. Cloud Spanner
❏ B. Bigtable
❏ C. Cloud SQL
❏ D. BigQuery

Question 9

Blue Harbor Logistics needs a real time telemetry pipeline for refrigerated trucks that send JSON readings for temperature and GPS at about 20,000 messages per minute. The events must be transformed into a consistent schema and streamed into BigQuery to power live dashboards. The platform should auto scale, keep operational work low, and rely on managed Google Cloud services. What should you implement?

❏ A. Run a container on Cloud Run that polls device endpoints every 30 seconds, cleans the payloads, and inserts rows into BigQuery with the BigQuery API
❏ B. Send incoming events to Cloud Storage and trigger Cloud Functions to validate and transform files before loading them into BigQuery in small batches
❏ C. Build a streaming Dataflow pipeline that reads from Pub/Sub, applies the required transformations, and writes to BigQuery using the Storage Write API
❏ D. Configure a Pub/Sub BigQuery subscription that writes raw messages into a BigQuery table without transformations

Question 10

In BigQuery ML time series forecasting for the next 12 weeks, which metric best measures how close predictions are to actual values?

❏ A. R squared
❏ B. Mean absolute percentage error
❏ C. Root mean squared error
❏ D. ROC AUC

All questions are from my GCP Data Practitioner Udemy course and certificationexams.pro

Question 11

Riverton Apparel captures web and app event logs from phones, tablets and laptops and stores them in BigQuery with 18 months of history. Analysts routinely build audience segments using metrics such as session length, bounce rate and navigation paths. You need to calculate these measures quickly and also reduce query costs for repeated analyses. Which BigQuery capability should you adopt to boost performance and simplify these behavioral calculations?

❏ A. Materialized views
❏ B. Federated queries
❏ C. Partitioned tables with clustering
❏ D. User-defined functions

Question 12

Which Google Cloud service supports bandwidth throttling during business hours and provides detailed transfer logs when migrating 150 TB from on premises NAS to Cloud Storage?

❏ A. gcloud storage CLI
❏ B. Google Cloud Storage Transfer Service
❏ C. BigQuery Data Transfer Service

Question 13

The analytics team at mcnz.com uses Looker and needs a reusable field that calculates profit margin for each SKU, defined as (sales_revenue minus product_cost) divided by sales_revenue, so the same logic is available in multiple Explores and dashboards. Which LookML construct should you define in the model to implement this calculation once and make it widely available?

❏ A. view
❏ B. measure
❏ C. dimension
❏ D. derived table

Question 14

Which Google Cloud service offers the lowest cost managed storage for rarely accessed audit logs that must be retained for 15 years?

❏ A. BigQuery
❏ B. Cloud Storage Archive tier
❏ C. Cloud Storage Nearline
❏ D. Cloud Storage Coldline

Question 15

A data analyst at Solstice Outfitters has about 48 CSV files with customer profiles stored in a Cloud Storage bucket. Before building downstream ingestion into BigQuery the analyst wants a fast way to reveal data quality problems such as missing values and inconsistent data types. What should the analyst do first?

❏ A. Use Pub/Sub to stream the CSV files into BigQuery for real time validation
❏ B. Load a representative sample into a temporary BigQuery table and run SQL to check for nulls type mismatches and other anomalies
❏ C. Convert the CSV files to JSON with Dataflow before loading into BigQuery
❏ D. Infer schemas on a small sample with Vertex AI Workbench to detect inconsistencies

Question 16

Which managed GCP service should you use to schedule a nightly Dataproc PySpark job at 0245 UTC that reads from Cloud Storage and writes results to BigQuery table foo.region_sales?

❏ A. Cloud Scheduler
❏ B. Dataproc Workflow Templates
❏ C. Cloud Composer
❏ D. BigQuery Scheduled Queries

Question 17

A regional credit union stores daily encrypted database snapshots in a Cloud Storage bucket for regulatory compliance and disaster recovery. Each snapshot is kept for five years, and the data is rarely accessed except during audits or when restoring systems. The organization wants to minimize storage costs while keeping the data durable and retrievable when needed. Which lifecycle policy should you configure?

❏ A. Store backups in Nearline storage and delete them after five years
❏ B. Keep backups in Standard storage for the first one year then transition to Archive and delete them after five years
❏ C. Place backups in Archive storage and delete them after five years
❏ D. Store backups in Coldline storage and delete them after five years

Question 18

Which Google Cloud service should you use to store and analyze 30 TB of tabular data using standard SQL for dashboards and ad hoc queries with daily ingestion?

❏ A. Cloud SQL
❏ B. Dataproc
❏ C. BigQuery
❏ D. Cloud Spanner

Question 19

At BlueRiver Freight, the analytics team discovers that the BigQuery table shipping_contacts contains inconsistent address abbreviations such as “Rd” and “Road” and many values include leading, trailing, or repeated internal spaces. You want to standardize the address text and remove superfluous whitespace directly in BigQuery SQL without moving the data to another service. Which approach should you take?

❏ A. Use REPLACE to enforce standard abbreviations and LTRIM and RTRIM to strip spaces
❏ B. Cloud Dataflow with Apache Beam
❏ C. Use REGEXP_REPLACE to normalize address patterns and TRIM to strip unnecessary whitespace
❏ D. Export to Cloud Data Fusion for transformations and reload the dataset into BigQuery

Question 20

How should you minimize cost for BigQuery data that is heavily queried for 6 months and must remain available for 10 years for audits?

❏ A. Use BigQuery long term storage after 120 days and delete after 10 years
❏ B. Partition by date and export older than 6 months to Cloud Storage Archive with a 10 year delete rule
❏ C. Export older data to Cloud Storage Nearline with a 10 year delete rule

Google Cloud Data Practitioner Sample Questions Answered

All questions are from my GCP Data Practitioner Udemy course and certificationexams.pro

Question 1

✓ C. Apply BigQuery column level security with policy tags and publish authorized views for analysts

The correct option is Apply BigQuery column level security with policy tags and publish authorized views for analysts.

BigQuery column level security with policy tags lets you tag PII columns and restrict them to a small approved group. Users who are not granted the policy tag cannot select those sensitive columns which enforces least privilege at the column boundary. You can then create and share authorized views that return only non sensitive fields or aggregated results so analysts can explore trends without ever reading the raw PII.

Place PII in a separate BigQuery table with restricted access and join only when necessary relies on modeling and process rather than enforceable column permissions. It complicates joins and still risks exposure during joins or through query misconfigurations, and it does not provide fine grained column protection inside a single table.

Use Cloud DLP to automatically redact PII from all BigQuery query results is not how DLP works. Cloud DLP can inspect and de identify data through jobs or transformations, yet it does not transparently intercept and redact every BigQuery query result for all workloads.

Configure BigQuery row level access policies to prevent exposure of PII columns filters rows and does not hide or protect specific columns. Row level policies cannot satisfy the requirement to restrict access to PII fields while still allowing broad access to non sensitive columns.

Exam Tip

When the requirement is to protect specific fields yet still enable wide analytical access, look for column level controls with policy tags and the use of authorized views. If an option claims to automatically redact query results for every query it is likely not a native BigQuery capability.

Question 2

Which approach trains product recommendations directly in BigQuery to keep processing close to the data?

✓ C. BigQuery ML matrix factorization for recommendations

The correct option is BigQuery ML matrix factorization for recommendations.

BigQuery ML supports a matrix factorization model that is purpose built for recommendation use cases and is trained with standard SQL directly inside BigQuery. This keeps computation close to the data which reduces data movement and operational overhead while leveraging BigQuery scale for large interaction datasets.

Vertex AI Matching Engine focuses on vector similarity search over embeddings and it does not train collaborative filtering recommendation models. It also runs outside BigQuery which means it does not meet the requirement to train directly where the data resides.

Build a custom TensorFlow recommender on Vertex AI can work for recommendations yet training would occur on Vertex AI infrastructure after exporting or streaming data from BigQuery. That approach offers flexibility but it does not keep processing inside BigQuery.

Dataproc Spark MLlib ALS trains with ALS on Spark clusters that you manage on Dataproc and it requires moving data or setting up connectors from BigQuery. This is not training directly in BigQuery and adds cluster management overhead.

Exam Tip

When a question emphasizes keeping training in BigQuery with SQL and minimizing data movement, look for BigQuery ML options rather than general Vertex AI or Dataproc choices.

Question 3

✓ B. BigQuery Data Transfer Service

The correct option is BigQuery Data Transfer Service.

BigQuery Data Transfer Service provides a managed way to schedule recurring imports of supported SaaS data into BigQuery. It can be configured to run every 24 hours and then it automatically executes without manual steps. It also offers built in monitoring and retries which suits recurring analytics workloads and dashboard refreshes.

Cloud Dataflow is a stream and batch processing service that would require you to build and operate a custom pipeline and a separate scheduler. This is unnecessary when a managed transfer service already supports the use case.

Cloud Composer is a managed orchestration service that runs Airflow and it is better for complex workflows. It does not provide a native Google Analytics to BigQuery connector and would add operational overhead for a simple daily import.

gcloud CLI is a command line tool and not a managed scheduler. You would need to script and trigger jobs yourself which contradicts the requirement for a fully managed schedule without manual steps.

Exam Tip

When a question asks for a managed recurring import from a Google or SaaS source into BigQuery, map it to the BigQuery Data Transfer Service rather than building pipelines or adding orchestration.

Question 4

In Cloud Storage, how can you make about 45 images publicly accessible while keeping all other objects private?

✓ B. Keep bucket private with “Fine-grained access control” and grant “allUsers” Reader on only the required objects

The correct choice is Keep bucket private with \”Fine-grained access control\” and grant \”allUsers\” Reader on only the required objects. This allows only the selected images to be publicly accessible while every other object remains private.

Using fine grained access control lets you apply object ACLs to just those 45 images. Grant the allUsers entity the Reader permission on those objects and leave the rest unchanged. The bucket itself stays private and there is no public access to objects that you did not modify.

Turn on \”Uniform bucket-level access\” and grant the \”Viewer\” role to \”allAuthenticatedUsers\” on the bucket would expose the entire bucket to anyone with a Google account and not just your organization. Uniform bucket level access also removes object ACLs, so you cannot selectively publish only certain objects.

Enable \”Public Access Prevention\” and add object ACLs for anonymous read cannot work because Public Access Prevention blocks granting public access through both IAM and ACLs, so anonymous reads would be rejected.

Keep \”Uniform bucket-level access\” enabled and distribute signed URLs for the selected images is not appropriate when you want objects to be publicly accessible in an ongoing way. Signed URLs are meant for temporary or controlled sharing and require managing expirations and links rather than making the objects persistently public.

Exam Tip

When only a subset of objects must be public, think fine grained object ACLs with allUsers. Uniform bucket level access removes object ACLs and applies IAM at the bucket level, and Public Access Prevention blocks any public sharing.

Question 5

✓ C. SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500

The correct option is SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500.

This query correctly limits rows to the past nine months by comparing the transaction date to the current date minus nine months. It aggregates spend by shopper using SUM and groups by shopper_id which produces one row per shopper. It then uses a HAVING filter on the aggregated total to keep only shoppers whose total exceeds 1500.

SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY item_id HAVING total_spent > 1500 is wrong because it groups by item_id while selecting shopper_id which causes an invalid grouping and does not produce per shopper totals.

SELECT shopper_id, AVG(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 9 MONTH) GROUP BY shopper_id HAVING total_spent > 1500 is wrong because it calculates an average rather than the required sum.

SELECT shopper_id, SUM(spend_amount) AS total_spent FROM retail_ledger WHERE transaction_date 1500 is wrong because it uses a future cutoff which would include almost all historical transactions rather than only the past nine months.

Exam Tip

When filtering on an aggregate use HAVING after you group and make sure your grouping columns match the nonaggregated columns in the select list. For rolling time windows anchor the filter to the current date with DATE_SUB or similar functions rather than adding time into the future.

Question 6

How should you load only new or changed rows from an on premises Oracle 12c database into BigQuery every 30 minutes?

✓ B. Dataflow JDBC pipeline with incremental loads every 30 minutes

The correct option is Dataflow JDBC pipeline with incremental loads every 30 minutes. This best meets the requirement to pull only new or changed rows on a fixed schedule and land them directly in BigQuery.

This approach uses the JDBC connector to issue a parameterized query that filters on a high water mark column such as an updated timestamp or an incrementing ID so only deltas are read. The pipeline writes the results straight to BigQuery which avoids unnecessary staging. You can run it every 30 minutes by deploying a reusable template and scheduling executions which gives reliable throughput, parallelism, and built in retry behavior for production grade ingestion from Oracle.

Database Migration Service to Cloud SQL then query to BigQuery is not appropriate because Database Migration Service is designed to migrate or replicate databases into Cloud SQL rather than feed BigQuery. It introduces an extra hop and added complexity without solving the requirement to capture and load only changed rows into the warehouse.

Cloud Functions with JDBC to BigQuery is a poor fit because functions are optimized for short lived event driven tasks and are not ideal for database polling, connection pooling, or bulk data movement on a strict schedule. Timeouts and concurrency characteristics make them unreliable for consistent incremental extracts from Oracle.

Dataproc with Sqoop then load to BigQuery adds heavy operational overhead for a frequent 30 minute cadence and Sqoop itself has been retired which makes it a less viable choice on newer exams. It would also require staging and custom logic to implement incremental change capture to reach BigQuery.

Exam Tip

When you see a requirement for incremental relational loads into BigQuery on a fixed schedule, prefer managed data pipelines that have native connectors and template scheduling. Avoid multi hop migrations and general purpose serverless code for ongoing database ingestion.

Question 7

✓ C. Cloud Composer to orchestrate validation transformation and loading across Cloud Storage and BigQuery

The correct choice is Cloud Composer to orchestrate validation transformation and loading across Cloud Storage and BigQuery.

Cloud Composer provides a managed Apache Airflow environment that is designed to coordinate multi step data workflows. It can run on a schedule that matches the daily file arrival, use sensors to detect objects in Cloud Storage, run validation and transformation tasks, and then load the results into BigQuery. It also offers dependency management, retries, alerting, and centralized monitoring which are essential for a reliable and repeatable pipeline.

Cloud Run jobs scheduled with Cloud Scheduler is primarily a time based trigger that starts a containerized job and it does not provide native directed acyclic graph orchestration, cross service dependencies, or built in task level retries and lineage that are standard in a workflow orchestrator.

Cloud Functions for validation with Dataflow for transformation and BigQuery for loading lists useful components but it lacks a dedicated orchestrator. You could chain triggers, yet this becomes brittle as complexity grows and it makes dependency handling, retries, and centralized observability more difficult than using a workflow service.

Dataproc for validation and transformation with BigQuery Data Transfer Service to load the results combines a processing cluster with a transfer tool that is intended for supported source to BigQuery transfers. It does not coordinate custom validation steps and job dependencies across services, so it does not meet the need for end to end workflow orchestration.

Exam Tip

When a question asks for end to end coordination across multiple services, look for an orchestrator that provides scheduling, dependencies, and retries such as Cloud Composer rather than assembling individual compute or transfer tools.

Question 8

Which Google Cloud data store best supports high throughput time series telemetry that will reach five billion rows in three months and is queried by device ID and contiguous time ranges?

✓ B. Bigtable

The correct option is Bigtable.

Bigtable is a wide column NoSQL database that excels at very high write throughput and low latency reads at massive scale. You can design the row key to start with the device ID followed by a time component which keeps data in lexicographic order. This design enables efficient range scans over contiguous time windows for a single device and it scales comfortably to billions of rows with predictable performance.

Cloud Spanner is horizontally scalable and strongly consistent, yet it is a relational service optimized for distributed transactions and relational schemas. For write heavy time series that are accessed by key and time range, a wide column store purpose built for range scans is a better fit.

Cloud SQL is a managed relational database that primarily scales vertically. It would struggle with sustained high ingestion rates and with managing billions of rows, and sharding to meet the scale adds operational complexity and does not match the access pattern as efficiently.

BigQuery is an analytical data warehouse that is optimized for large batch analytics rather than low latency per device lookups. Streaming inserts have quotas and per row costs, and frequent small writes are not ideal. It is better suited as a downstream analytics sink rather than the primary time series store for this workload.

Exam Tip

Start by mapping the access pattern to the storage model. For high throughput time series with queries by device ID and contiguous time ranges choose a store that supports ordered keys and efficient range scans, and design the row key as device identifier then timestamp.

Question 9

✓ C. Build a streaming Dataflow pipeline that reads from Pub/Sub, applies the required transformations, and writes to BigQuery using the Storage Write API

The correct option is Build a streaming Dataflow pipeline that reads from Pub/Sub, applies the required transformations, and writes to BigQuery using the Storage Write API.

This design uses a fully managed streaming pipeline that scales automatically and keeps operational work low. Messages are ingested into Pub/Sub for durable, elastic buffering, then the pipeline parses the JSON, validates and normalizes fields, handles late or out of order data if needed, and enforces a consistent schema. It then streams into BigQuery using the high throughput write interface which provides low latency, strong ordering within streams, and excellent reliability for real time analytics.

This approach directly supports the required throughput of about twenty thousand messages per minute, delivers near real time updates for dashboards, and centralizes transformation logic in a single managed service rather than spreading it across custom code and triggers.

Run a container on Cloud Run that polls device endpoints every 30 seconds, cleans the payloads, and inserts rows into BigQuery with the BigQuery API is not a good fit because polling introduces unnecessary latency and complexity, and doing row inserts through the API is inefficient at this scale. This creates more operational burden and does not provide the stateful streaming features needed for robust transformation.

Send incoming events to Cloud Storage and trigger Cloud Functions to validate and transform files before loading them into BigQuery in small batches is primarily a batch pattern and would add file staging overhead and higher latency. It does not meet the real time requirement and would struggle to keep dashboards live while also increasing operational coordination.

Configure a Pub/Sub BigQuery subscription that writes raw messages into a BigQuery table without transformations cannot satisfy the need for a consistent schema because it bypasses the transformation layer. It can be useful when events already match the target schema, but here the pipeline must validate and transform the data before loading.

Exam Tip

When you see requirements for real time ingestion, transformations, and low operations with auto scaling, map the pattern to Pub/Sub ingestion, a streaming Dataflow pipeline for processing, and BigQuery streaming writes.

Question 10

In BigQuery ML time series forecasting for the next 12 weeks, which metric best measures how close predictions are to actual values?

✓ C. Root mean squared error

The correct option is Root mean squared error.

This metric computes the square root of the average squared differences between forecasts and actual values. It is expressed in the same units as the target so it directly reflects how far predictions are from the true values. It also penalizes larger errors more strongly which aligns well with evaluating forecasting accuracy in BigQuery ML.

R squared measures the proportion of variance explained by the model but it does not represent the typical magnitude of prediction errors in the target units so it is not the best choice for closeness.

Mean absolute percentage error reports error as a percentage which can be useful across scales but it can be undefined when actuals are zero and can overemphasize errors on small values so it is less reliable when the goal is closeness in original units.

ROC AUC evaluates ranking quality for classification problems and does not apply to regression or time series forecasting tasks.

Exam Tip

When a question asks how close predictions are to actual values prefer error metrics in the target units such as RMSE. Exclude classification metrics and be cautious with percentage metrics if zeros may occur.

Question 11

✓ C. Partitioned tables with clustering

The correct option is Partitioned tables with clustering.

This approach lets you partition the event logs by a date or timestamp column and then organize data within each partition by keys that analysts frequently filter or group on. That combination prunes entire partitions for date range queries and reduces bytes scanned within each partition for common audience segment filters. As a result, calculations for session length, bounce rate and navigation paths run faster and cost less across the 18 months of history. It also simplifies repeated analyses because the physical layout is aligned with typical behavioral workloads.

Materialized views are best for a stable aggregate over a specific query pattern and they have limitations with complex joins, windowed sessionization and frequent changes to segment definitions. They would not flex well for varied behavioral analyses and may not reduce cost for ad hoc exploration.

Federated queries are for querying data stored outside BigQuery. Your data already resides in BigQuery, so using this feature would not improve performance and could introduce unnecessary overhead.

User-defined functions help with code reuse and readability, but they do not change how much data is scanned and they usually do not lower cost or significantly speed up queries on large behavioral datasets.

Exam Tip

Look for workloads that filter by time and by a few selective columns. The best move is to partition by a date or timestamp and then cluster by the most commonly filtered or grouped keys to reduce scanned bytes and improve speed.

Question 12

Which Google Cloud service supports bandwidth throttling during business hours and provides detailed transfer logs when migrating 150 TB from on premises NAS to Cloud Storage?

✓ B. Google Cloud Storage Transfer Service

The correct option is Google Cloud Storage Transfer Service.

Google Cloud Storage Transfer Service is designed for large migrations from on premises NAS into Cloud Storage and supports scheduling so you can run jobs only outside business hours or limit activity during the day. It also provides bandwidth controls so you can throttle throughput when needed and it writes detailed operational and audit logs to Cloud Logging so you can track performance and troubleshoot issues.

gcloud storage CLI can copy data to Cloud Storage but it does not provide enterprise grade scheduling windows, centrally managed bandwidth throttling for on premises agents, or comprehensive job level logging suitable for a 150 TB migration.

BigQuery Data Transfer Service moves data into BigQuery from sources such as SaaS applications and Cloud Storage. It does not migrate files from NAS to Cloud Storage and it does not offer bandwidth throttling for this use case.

Exam Tip

When a question mentions very large data volumes with needs like bandwidth control, scheduling, and detailed logs, think of the managed migration services rather than basic command line tools. Match the destination to the service focus and prefer solutions that integrate with Cloud Logging for auditability.

Question 13

✓ B. measure

The correct option is measure.

A measure is designed for aggregated calculations and ratios such as profit margin where you divide the difference between sales revenue and product cost by sales revenue. Defining this once in LookML as a measure makes the field reusable across Explores and dashboards and ensures it is computed at the query grain while respecting filters and groupings. This is how you centralize business logic so analysts get a consistent profit margin everywhere.

The view option is not a field type that performs the calculation. It is a container for fields and choosing view alone would not implement the profit margin logic.

A dimension is row level and is not appropriate for an aggregated ratio. Using a dimension for profit margin risks incorrect results because the math should be done after aggregation at query time.

A derived table creates a new table from a query and is used for precomputing or shaping data. It is unnecessary for a simple reusable field and would add avoidable complexity compared to a single measure.

Exam Tip

When you see a calculation that needs aggregation or a ratio that must respect filters and groupings, think measure. If it is a row level attribute, think dimension. Reserve derived tables for when you must precompute or reshape data.

Question 14

Which Google Cloud service offers the lowest cost managed storage for rarely accessed audit logs that must be retained for 15 years?

✓ B. Cloud Storage Archive tier

The correct option is Cloud Storage Archive tier because it provides the lowest cost managed storage for data that is rarely accessed and it is ideal for retaining audit logs for 15 years.

This tier is purpose built for long term archival where reads are infrequent. It offers the lowest per gigabyte storage price among Cloud Storage classes and supports retention policies with Bucket Lock so you can enforce multi year compliance requirements.

BigQuery is an analytics warehouse rather than an archival object store. Even with its long term storage pricing after tables go unchanged it is generally more expensive for multi year cold retention and query based access adds cost overhead.

Cloud Storage Nearline is optimized for data accessed about once a month or less and its storage price is higher than the lowest cost class in Cloud Storage. It is not the best fit for data that must be kept for 15 years with very rare access.

Cloud Storage Coldline is designed for data accessed less than once a quarter or once a year and while it is cheaper than Nearline it still costs more than the lowest cost class in Cloud Storage, so it is not the most cost effective choice for 15 year retention.

Exam Tip

Map frequency keywords to storage classes. When you see rarely accessed and very long retention think Archive. Monthly access points to Nearline and yearly access points to Coldline.

Question 15

✓ B. Load a representative sample into a temporary BigQuery table and run SQL to check for nulls type mismatches and other anomalies

The correct option is Load a representative sample into a temporary BigQuery table and run SQL to check for nulls type mismatches and other anomalies. This gives the analyst quick visibility into missing values and inconsistent types while avoiding unnecessary pipeline work.

By loading a small sample into a temporary table the analyst can immediately profile the data using simple SQL. Queries with COUNTIF reveal how many NULLs exist in each column. SAFE_CAST exposes type mismatches without failing the query. Additional checks for uniqueness, value ranges, and pattern conformity are straightforward, and sampling keeps cost and turnaround time low while still exposing systemic issues. BigQuery can load CSVs directly from Cloud Storage and this approach is purpose built for rapid validation before a wider ingestion.

Use Pub/Sub to stream the CSV files into BigQuery for real time validation is not appropriate because Pub/Sub targets event streaming rather than batch files in Cloud Storage. Standing up streaming infrastructure adds complexity and time and it does not provide built in data quality checks for this exploratory task.

Convert the CSV files to JSON with Dataflow before loading into BigQuery adds unnecessary processing and cost. BigQuery natively loads CSV and changing file format does not improve the ability to detect missing values or type issues compared with simple SQL checks.

Infer schemas on a small sample with Vertex AI Workbench to detect inconsistencies relies on notebooks that are better suited for experimentation and modeling. Ad hoc schema inference here is slower and less reliable than loading a sample into BigQuery and running targeted SQL validations.

Exam Tip

When the question asks for the fastest way to assess file quality in Cloud Storage think of a small sample in BigQuery with quick SQL checks. Use SAFE_CAST to surface type issues and COUNTIF to find missing values before you build pipelines.

All questions are from my GCP Data Practitioner Udemy course and certificationexams.pro

Question 16

Which managed GCP service should you use to schedule a nightly Dataproc PySpark job at 0245 UTC that reads from Cloud Storage and writes results to BigQuery table foo.region_sales?

✓ B. Dataproc Workflow Templates

The correct option is Dataproc Workflow Templates because it is the native managed orchestration for Dataproc jobs and it allows you to define the PySpark step and run it on a recurring schedule so a nightly 0245 UTC run that reads from Cloud Storage and writes to the BigQuery table foo.region_sales fits this capability.

This service lets you package the cluster configuration and the PySpark job in a reusable template and then schedule executions on a predictable cadence. You gain built in reliability features such as retries and job dependency handling while avoiding the overhead of running your own scheduler, which makes it a straightforward choice for a single recurring Dataproc job that moves data from Cloud Storage to BigQuery.

Cloud Scheduler on its own only triggers HTTP targets or Pub or Sub topics and it does not define or manage Dataproc jobs or clusters. You would still need additional components to submit and manage the PySpark job, so it is not the best single managed service to choose for this use case.

Cloud Composer is a managed Airflow environment that can orchestrate complex pipelines, yet it is heavier to operate and more costly for a simple nightly Dataproc job. It is better suited when you have multi step DAGs and cross service dependencies rather than a single scheduled job.

BigQuery Scheduled Queries only schedules SQL statements executed by BigQuery. It cannot run a Dataproc PySpark job that reads from Cloud Storage and writes the results to BigQuery, so it does not meet the requirement.

Exam Tip

First map the workload to the native service. For a Dataproc job think Workflow Templates. Use Composer only when you need complex DAGs and use Scheduled Queries only for BigQuery SQL.

Question 17

✓ C. Place backups in Archive storage and delete them after five years

The correct option is Place backups in Archive storage and delete them after five years.

This choice fits data that is rarely accessed and must be kept for a long period. It provides the lowest storage cost among the classes for infrequently accessed data while maintaining high durability and millisecond retrieval so audits and restores remain practical. A lifecycle rule that deletes the objects after five years enforces the retention requirement and reduces ongoing storage spend.

Store backups in Nearline storage and delete them after five years is not optimal because Nearline is intended for data accessed about once per month and it costs more than the colder classes for long term retention. For multi year archives with very rare access there is a more cost effective class.

Keep backups in Standard storage for the first one year then transition to Archive and delete them after five years adds unnecessary cost because Standard is priced for frequent access which the scenario does not require. Starting in the coldest suitable class avoids paying higher first year costs.

Store backups in Coldline storage and delete them after five years is close but still not the lowest cost for data that will be kept for years and accessed very rarely. A colder class offers better price efficiency for this access pattern.

Exam Tip

Start with the expected access frequency and choose the coldest class that still meets retrieval needs. Pair it with a lifecycle rule to auto delete at the required retention period and remember minimum storage duration and access costs during audits.

Question 18

Which Google Cloud service should you use to store and analyze 30 TB of tabular data using standard SQL for dashboards and ad hoc queries with daily ingestion?

✓ C. BigQuery

The correct option is BigQuery.

It is a fully managed serverless data warehouse that supports standard SQL and is designed for interactive analytics at the tens of terabytes scale and beyond. It delivers high concurrency for dashboards and ad hoc exploration and it supports daily ingestion through batch loads or low latency streaming without requiring you to manage infrastructure.

Cloud SQL targets transactional relational workloads and while it supports SQL it is not optimized for large scale analytical scans around 30 TB and would require extensive tuning and resources that make it a poor fit for BI style queries.

Dataproc provides managed Hadoop and Spark clusters that are well suited for batch processing and custom pipelines. It is not a serverless SQL analytics service for fast interactive dashboards and it introduces cluster management overhead that does not match the requirement.

Cloud Spanner is a globally distributed relational database for strongly consistent transactional workloads. It is not intended for large scale analytical querying for dashboards and ad hoc analysis.

Exam Tip

Look for cues like TB scale, standard SQL, dashboards, and ad hoc queries. These usually point to a serverless analytics warehouse such as BigQuery rather than OLTP databases or Hadoop style clusters.

Question 19

✓ C. Use REGEXP_REPLACE to normalize address patterns and TRIM to strip unnecessary whitespace

The correct option is Use REGEXP_REPLACE to normalize address patterns and TRIM to strip unnecessary whitespace.

REGEXP_REPLACE lets you match patterns rather than only exact substrings, which means you can standardize abbreviations like Road and Rd with word boundaries or other patterns and avoid accidental partial replacements. You can also collapse repeated internal spaces by replacing one or more whitespace characters with a single space. TRIM efficiently removes leading and trailing spaces so together these functions deliver complete cleanup directly in BigQuery SQL without moving data.

Use REPLACE to enforce standard abbreviations and LTRIM and RTRIM to strip spaces is not ideal because REPLACE only performs literal substitutions, which makes it difficult to target whole words and variants and can lead to unintended changes. LTRIM and RTRIM remove only leading and trailing spaces and they do not fix repeated internal spaces.

Cloud Dataflow with Apache Beam does not meet the requirement to work directly in BigQuery since it shifts the work to another service and adds operational overhead for a simple text normalization task.

Export to Cloud Data Fusion for transformations and reload the dataset into BigQuery also violates the requirement because it moves data out of BigQuery and reintroduces it, which adds cost and latency for a task that SQL can handle natively.

Exam Tip

When a question says to do it in BigQuery prefer built in SQL functions. Use REGEXP_REPLACE for pattern based edits and TRIM for edges, and remember that TRIM does not collapse internal whitespace.

Question 20

How should you minimize cost for BigQuery data that is heavily queried for 6 months and must remain available for 10 years for audits?

✓ B. Partition by date and export older than 6 months to Cloud Storage Archive with a 10 year delete rule

The correct option is Partition by date and export older than 6 months to Cloud Storage Archive with a 10 year delete rule.

This works because date partitioning lets you keep the recent six months of data in BigQuery for fast and heavy querying while scanning only the needed partitions to reduce query costs. Once the data cools after six months you export it to Cloud Storage Archive where storage is priced for very infrequent access and is therefore much cheaper for long term retention. A lifecycle delete rule set to ten years enforces the retention requirement automatically and it removes the need for manual cleanup.

Keeping only the recent partitions in BigQuery minimizes both storage and query costs during the active period. Moving older data to Archive preserves availability for audits since it can still be retrieved when needed and the overall spend remains low for the long tail of retention.

Use BigQuery long term storage after 120 days and delete after 10 years is not optimal because long term storage pricing is based on inactivity of table or partition data and it begins sooner than 120 days. More importantly this keeps all data in BigQuery for ten years which costs more than moving cold and rarely accessed data to Cloud Storage Archive for the audit period.

Export older data to Cloud Storage Nearline with a 10 year delete rule is less cost effective because Nearline is intended for data accessed roughly once a month while Archive is the lowest cost option for data that is rarely accessed such as annual or ad hoc audits over a decade.

Exam Tip

Map data temperature to services. Keep hot data in BigQuery with partitioning to limit scan costs and move cold data to the right Cloud Storage class with lifecycle policies. Pick Archive for rare access over many years and Nearline only when monthly access is expected.

Jira, Scrum & AI Certification
Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out.. Show you know how to work on a team with a Scrum Master cert or Product Owner credential Get Atlassian certified as a Jira Cloud Expert or a as a Jira Project Manager Let employers know you understand LLMs and Agentic AI with a GCP AI Leader certification Maybe even master vibe coding and prove it with the GitHub Copilot certification And show off how well you know the cloud with these AWS, GCP and Azure certifications You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

Show you know how to work on a team with a Scrum Master cert or Product Owner credential
Get Atlassian certified as a Jira Cloud Expert or a as a Jira Project Manager
Let employers know you understand LLMs and Agentic AI with a GCP AI Leader certification
Maybe even master vibe coding and prove it with the GitHub Copilot certification
And show off how well you know the cloud with these AWS, GCP and Azure certifications

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.