GCP Google Certified DevOps Engineer Practice Exams

GCP Certification Exam Topics and Tests

Over the past few months, I have been helping cloud engineers, DevOps specialists, and infrastructure professionals prepare for the GCP Certified Professional DevOps Engineer certification. A good start? Prepare with GCP Professional DevOps Engineer Practice Questions and Real GCP Certified DevOps Engineer Exam Questions.

Through my training programs and the free GCP Certified Professional DevOps Engineer Questions and Answers available at certificationexams.pro, I have identified common areas where candidates benefit from deeper understanding.

Google Cloud Certification Exam Simulators

That insight helped shape a comprehensive set of GCP Professional DevOps Engineer Sample Questions that closely match the tone, logic, and challenge of the official Google Cloud exam.

You can also explore the GCP Certified Professional DevOps Engineer Practice Test to measure your readiness. Each question includes clear explanations that reinforce key concepts such as automation pipelines, SLO management, monitoring, and alerting.

These materials are not about memorization.

They focus on helping you build the analytical and technical skills needed to manage Google Cloud environments with confidence.

Real Google Cloud Exam Questions

If you are looking for Google Certified DevOps Engineer Exam Questions, this resource provides authentic, scenario-based exercises that capture the structure and complexity of the real exam.

The Google Certified DevOps Engineer Exam Simulator recreates the pacing and experience of the official test, helping you practice under realistic conditions.

You can also review the Professional DevOps Engineer Braindump style study sets grouped by domain to reinforce your understanding through applied practice. Study consistently, practice diligently, and approach the exam with confidence.

With the right preparation, you will join a community of skilled DevOps professionals trusted by organizations worldwide.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Question 1

Blue Harbor Capital wants to streamline how it exports Google Cloud logs for analysis and to choose a configuration that balances storage cost with data retention. The team plans to keep only the necessary logs in BigQuery for long term analytics while avoiding unnecessary spend. What approach should they use?

  • ❏ A. Export all logs to Cloud Storage without filtering, process with Dataflow to remove unwanted records, then load curated results into BigQuery for historical reporting

  • ❏ B. Create one sink per log category and route to Pub/Sub for streaming analysis, then write into BigQuery using a Dataflow pipeline

  • ❏ C. Create a single Cloud Logging sink with an advanced filter that exports only required entries to BigQuery and set table or partition expiration to control retention and costs

  • ❏ D. Export every log to BigQuery without filters and later use SQL to select the needed records, then rely on BigQuery Data Transfer Service to manage retention

Question 2

Which approach ensures container images are vulnerability scanned and blocks GKE deployment when high severity issues are found?

  • ❏ A. Binary Authorization with signed images

  • ❏ B. GKE default configuration

  • ❏ C. Artifact Registry scanning with Cloud Build gate

  • ❏ D. Cloud Deploy only

Question 3

You manage a latency sensitive API on Google Compute Engine for the analytics startup BlueKite Insights that runs in us-central1, and leadership requires business continuity with an RTO under 45 seconds if a whole zone goes down. You need a design that will shift traffic automatically without manual steps in the event of a zonal outage. What should you set up?

  • ❏ A. Use a zonal managed instance group and enable automatic restart and live migration

  • ❏ B. Configure an external HTTP(S) Load Balancer with a single backend service in one zone

  • ❏ C. Create a regional managed instance group that distributes instances across at least two zones in the region

  • ❏ D. Use Cloud DNS failover to switch between two unmanaged instance groups that both run in the same zone

Question 4

Which GCP service should you use to centrally manage encryption keys with the strongest protection and automatic rotation to reduce blast radius?

  • ❏ A. VPC Service Controls

  • ❏ B. Cloud KMS with automatic rotation

  • ❏ C. Secret Manager

  • ❏ D. Inject secrets at provisioning

Question 5

A logistics startup named TallyRoute runs its development services on Google Kubernetes Engine. In this environment the applications emit very chatty logs, and developers inspect them with kubectl logs and do not rely on Cloud Logging. There is no common log schema across these services. You want to lower Cloud Logging spending related to application logs while still retaining GKE operational logs for troubleshooting. What should you do?

  • ❏ A. Run gcloud container clusters update dev-west1 –logging=SYSTEM for the development cluster

  • ❏ B. Add an exclusion on the _Default sink that filters out workload entries with resource.type = “k8s_container” and severity <= DEBUG

  • ❏ C. Run gcloud logging sinks update _Default –disabled in the development project

  • ❏ D. Create a Log Router sink that exports all k8s_container logs to BigQuery and set table expiration to 2 days

Question 6

Which solution lets Cloud Build run builds with private VPC access to call internal APIs without using public endpoints and with minimal operations?

  • ❏ A. Cloud Deploy

  • ❏ B. Internal HTTP(S) Load Balancer

  • ❏ C. Private pools for Cloud Build

  • ❏ D. External HTTP(S) Load Balancer with Cloud Armor

Question 7

You are the on call engineer at Lumina Metrics, a retail analytics startup that runs critical services on Google Cloud. A severity one outage was declared 20 minutes ago and customers cannot load dashboards. You need to organize responders and choose communication methods so the team can restore service quickly and safely. What should you do?

  • ❏ A. Ask one engineer to fill every role including Incident Commander, Communications Lead, and Operations Lead and use only email threads to share updates

  • ❏ B. Let all responders work independently without assigning roles and coordinate through ad hoc messages to reduce overhead

  • ❏ C. Appoint an Incident Commander, assign distinct Communications and Operations leads, and coordinate in a persistent real time chat channel for collaboration and decision tracking

  • ❏ D. Create a Cloud Pub/Sub topic for the incident and post updates there while leaving roles informal to save time

Question 8

For a UK only website in europe-west2 using the Envoy based external HTTP(S) load balancer, which network tier and scope minimize cost while meeting the constraints?

  • ❏ A. Premium Tier with a global external HTTP(S) load balancer

  • ❏ B. Standard Tier with a regional internal HTTP(S) load balancer

  • ❏ C. Standard Tier with a regional external HTTP(S) Application Load Balancer

  • ❏ D. Premium Tier with a regional external HTTP(S) load balancer

Question 9

Your platform team operates a multi-tier application on Google Cloud. During a midweek change window that lasted 90 minutes, a teammate updated a VPC firewall rule and accidentally blocked a critical backend, which caused a production incident that impacted many users at example.com. The team wants to follow Google recommendations to reduce the risk of this type of mistake. What should you do?

  • ❏ A. Perform firewall updates only during a scheduled maintenance window

  • ❏ B. Automate all infrastructure updates so that humans do not edit resources directly

  • ❏ C. Require peer review and approval for every change before it is rolled out

  • ❏ D. Enable VPC Firewall Rules Logging and alert on high deny rates in Cloud Monitoring

Question 10

Which GCP service should manage secrets so a CI/CD pipeline for GKE Autopilot avoids exposing values in source or logs and allows rotating credentials every 60 days without changing pipeline code?

  • ❏ A. Cloud Storage with CMEK

  • ❏ B. Check Kubernetes Secrets into Git

  • ❏ C. Secret Manager with IAM for Cloud Build and GKE

  • ❏ D. Cloud KMS encrypted blobs in repo

Question 11

You are the DevOps lead at Trailforge Books where your microservices application runs on Google Cloud and you must improve runtime performance while gaining clear visibility into resource consumption. You plan to use Google Cloud’s operations suite for observability and alerting. Which actions should you take to meet these goals? (Choose 2)

  • ❏ A. Use Cloud Trace to analyze distributed latency and pinpoint bottlenecks so you can tune the service

  • ❏ B. Configure Cloud Monitoring to collect CPU and memory metrics for all services and create alerting policies with threshold conditions

  • ❏ C. Turn off Cloud Logging to reduce latency and lower resource usage

  • ❏ D. Publish a custom “requests_per_second” metric to Cloud Monitoring and configure Cloud Run to autoscale directly from that metric

  • ❏ E. Deploy an in house Prometheus and Grafana stack instead of using the operations suite for monitoring

Question 12

A Cloud Build pipeline stops producing container images after a recent cloudbuild.yaml change. Following SRE practices for root cause analysis and safe rollback, what should you do?

  • ❏ A. Disable the build trigger then build and push images from a developer laptop

  • ❏ B. Increase the Cloud Build timeout and run the build again

  • ❏ C. Diff the last known good cloudbuild.yaml against the current Git revision and revert or fix the regression

  • ❏ D. Rotate the credentials used by the Cloud Build push step then run the build again

Question 13

A DevOps team at Pinecrest Analytics plans to manage Google Cloud infrastructure as code, and they require a declarative configuration that can be stored in Git and that can automate both creation and updates of resources across multiple projects. Which service should they use?

  • ❏ A. Config Connector

  • ❏ B. Google Cloud Console

  • ❏ C. Google Cloud Deployment Manager

  • ❏ D. Google Cloud Build

Question 14

Which GCP services should you use to trace and correlate interservice latency and errors in a GKE microservices application to identify the root cause?

  • ❏ A. Cloud Profiler and Cloud Logging

  • ❏ B. Cloud Trace and Cloud Monitoring

  • ❏ C. Network Intelligence Center

  • ❏ D. VPC Service Controls and Cloud Armor

Question 15

At Riverbeam Media your SRE team enabled a small canary for a new checkout feature in a GCP hosted web application. Within eight minutes your alerts report a sharp rise in HTTP 500 responses and the p95 latency has increased significantly. You want to minimize customer impact as fast as possible. What should you do first?

  • ❏ A. Start a detailed root cause investigation using Cloud Trace and Cloud Logging

  • ❏ B. Immediately add more backend instances to try to absorb the load

  • ❏ C. Revert the canary rollout right away so traffic goes back to the last stable version

  • ❏ D. Begin documenting the incident timeline for the postmortem

Question 16

Which workflow should you use to reduce Terraform merge conflicts and ensure only approved changes reach the main source of truth in Google Cloud?

  • ❏ A. Cloud Source Repositories with direct commits to main and Cloud Build apply on push

  • ❏ B. Git with feature branches and PRs with reviews and automated Terraform checks and main as source of truth

  • ❏ C. Versioned Cloud Storage bucket as the canonical Terraform code store with manual object renames

Question 17

You support an order processing service for Riverview Retail that runs on Compute Engine instances. The compliance team requires that an alert be sent if end to end transaction latency is greater than 3 seconds, and it must only notify if that condition continues for more than 10 minutes. How should you implement this in Google Cloud Monitoring to meet the requirement?

  • ❏ A. Build a Cloud Function that runs for each transaction and sends an alert whenever processing time surpasses 3 seconds

  • ❏ B. Define a request latency SLO in Service Monitoring and configure an error budget burn rate alert over a 10 minute window

  • ❏ C. Create a Cloud Monitoring alert that evaluates the 99th percentile transaction latency and fires when it remains above 3 seconds for at least 10 minutes

  • ❏ D. Configure a Cloud Monitoring alert on the average transaction latency that triggers when it is above 3 seconds for 10 minutes

Question 18

Which GCP design enables secure Cloud Storage uploads, minimal cost when idle, and rapid CPU bound batch processing immediately when files arrive?

  • ❏ A. GKE with a watcher and a worker deployment that scales down when idle

  • ❏ B. Cloud Storage with IAM and a Cloud Function that scales a Compute Engine MIG with an image that auto shuts down

  • ❏ C. Cloud Run jobs triggered by Pub or Sub notifications from Cloud Storage

Question 19

AuroraPay runs a fintech platform on Cloud Spanner as its primary database and needs to roll out a schema change that adds three secondary indexes and modifies two existing tables on a multi region instance. The release must keep latency and throughput impact as low as possible during business hours. What rollout plan should the team use to perform this change?

  • ❏ A. Delete the affected tables and drop the old indexes, then recreate the schema in one maintenance window

  • ❏ B. Clone the database into a new Spanner instance using backup and restore, apply the schema changes there, then cut over all traffic

  • ❏ C. Apply the schema updates in phases by creating the new indexes first and waiting for backfill to finish, then alter the existing tables

  • ❏ D. Execute the full set of schema changes at once so the total change window is shortest

Question 20

How can developers test the latest Cloud Run revision without routing any production traffic to it?

  • ❏ A. Shift all traffic to LATEST using gcloud run services update-traffic

  • ❏ B. Deploy with –no-traffic and a tag then use the tag URL

  • ❏ C. Grant roles/run.invoker and call the private URL

  • ❏ D. Use Cloud Load Balancing to route tester IPs to LATEST

Question 21

A fintech company named LumenPay runs a payments API on Compute Engine and forwards application logs to Cloud Logging. During an audit you learn that some records contain PII and every sensitive value begins with the prefix custinfo. You must keep those matching entries in a restricted storage location for later investigation and you must stop those entries from being retained in Cloud Logging. What should you do?

  • ❏ A. Configure the Ops Agent with a filter that drops log lines containing custinfo and then use Storage Transfer Service to upload the filtered content to a locked Cloud Storage bucket

  • ❏ B. Create a Pub/Sub sink with a filter for custinfo and trigger a Cloud Function that stores the messages in BigQuery with customer managed encryption keys

  • ❏ C. Set up a logs router sink with an advanced filter that matches custinfo and route matching entries to a Cloud Storage bucket then add a logs based exclusion with the same filter so Cloud Logging does not retain them

  • ❏ D. Create a basic logs filter for custinfo and configure a sink that exports matching records to Cloud Storage while relying on default retention for the rest of the logs

Question 22

Which Google Cloud approach best ensures 99.9 percent availability and low latency with efficient autoscaling while adopting a cloud native architecture?

  • ❏ A. Cloud Run

  • ❏ B. GKE with microservices and autoscaling

  • ❏ C. App Engine standard

Question 23

BlueRiver Capital is adopting Spinnaker to roll out a service that warms an in memory cache of about 3 GB during startup and it typically finishes initialization in about 4 minutes. You want the canary analysis to be fair and to reduce bias from cold start effects in the cache. How should you configure the canary comparison?

  • ❏ A. Compare the canary with the existing deployment of the current production version

  • ❏ B. Compare the canary with a new deployment of the previous production version

  • ❏ C. Compare the canary with a fresh deployment of the current production version

  • ❏ D. Compare the canary with a baseline built from a 30 day rolling average of Cloud Monitoring production metrics

Question 24

A vendor requires a JSON service account key and does not support Workload Identity Federation. The organization policy iam.disableServiceAccountKeyCreation blocks key creation. What should you do to complete the integration while following best practices?

  • ❏ A. Disable iam.disableServiceAccountKeyCreation at the organization root

  • ❏ B. Use Workload Identity Federation

  • ❏ C. Add a temporary project exception for iam.disableServiceAccountKeyCreation to create one user managed key then remove it

  • ❏ D. Create the key in a different project without the constraint and share it

Question 25

You are writing a blameless post incident review for a 35 minute outage at BrightWave Media that impacted about 70 percent of customers during the evening peak. Your goal is to help the organization avoid a repeat of this type of failure in the future. Which sections should you include in the report to best support long term prevention? (Choose 2)

  • ❏ A. A comparison of this incident’s severity to earlier incidents

  • ❏ B. A prioritized remediation plan with specific actions owners and due dates

  • ❏ C. A list of employees to blame for the outage

  • ❏ D. A complete export of Cloud Monitoring dashboards and raw logs from the affected window

  • ❏ E. A clear analysis of the primary cause and contributing factors of the outage

GCP DevOps Professional Exam Dump Answers

Question 1

Blue Harbor Capital wants to streamline how it exports Google Cloud logs for analysis and to choose a configuration that balances storage cost with data retention. The team plans to keep only the necessary logs in BigQuery for long term analytics while avoiding unnecessary spend. What approach should they use?

  • ✓ C. Create a single Cloud Logging sink with an advanced filter that exports only required entries to BigQuery and set table or partition expiration to control retention and costs

The correct option is Create a single Cloud Logging sink with an advanced filter that exports only required entries to BigQuery and set table or partition expiration to control retention and costs.

This approach filters at the source so only the logs you actually need are exported to BigQuery which directly lowers ingestion and storage costs. You can use the Logging query language to build precise filters and the export remains simple to manage with one single sink with an advanced filter.

BigQuery provides native lifecycle controls so you can set table or partition expiration to automatically remove older data. This keeps long term analytics feasible while preventing unnecessary spend without building extra pipelines or manual deletion jobs.

Export all logs to Cloud Storage without filtering, process with Dataflow to remove unwanted records, then load curated results into BigQuery for historical reporting is inefficient because it exports everything and stores it before curation which increases storage and processing cost and complexity. It also adds unnecessary steps when you can filter directly at the sink and load only the needed entries into BigQuery.

Create one sink per log category and route to Pub/Sub for streaming analysis, then write into BigQuery using a Dataflow pipeline adds operational overhead and cost that the scenario does not require. Pub/Sub and Dataflow are suited to real time streaming use cases, yet the goal here is controlled long term analytics in BigQuery with simple retention, which is met by filtering at export and using BigQuery expiration.

Export every log to BigQuery without filters and later use SQL to select the needed records, then rely on BigQuery Data Transfer Service to manage retention drives up ingestion and storage cost and does not solve retention. The BigQuery Data Transfer Service does not manage table retention, while table or partition expiration is the correct mechanism.

Filter at the source using Cloud Logging sinks and use BigQuery expiration to manage retention. Look for options that reduce data volume up front and apply built in lifecycle controls rather than building extra pipelines.

Question 2

Which approach ensures container images are vulnerability scanned and blocks GKE deployment when high severity issues are found?

  • ✓ C. Artifact Registry scanning with Cloud Build gate

The correct option is Artifact Registry scanning with Cloud Build gate.

This approach uses Artifact Registry vulnerability scanning to evaluate images as they are pushed and it records severity information that can be queried. A Cloud Build pipeline can check the scan results and fail the build when high severity issues are found which prevents the deployment to GKE. Because the gate runs before deployment it ensures that noncompliant images never reach the cluster.

Binary Authorization with signed images focuses on verifying signatures and provenance rather than evaluating vulnerability severity. Without a policy tied to vulnerability scan results or an attestation for vulnerabilities it would not automatically block deployment for high severity findings, so it does not meet the requirement as stated.

GKE default configuration does not perform vulnerability scanning and it does not block deployments based on vulnerability severity by default.

Cloud Deploy only orchestrates releases and can run verifications, yet it does not scan images on its own and cannot enforce a severity based block without integrating with a scanner or a build step.

When the requirement is to block deployment, look for a CI or CD gate that reads scanner results and enforces a severity threshold. Signing proves trust, while scanning provides risk data that a gate can act on.

Question 3

You manage a latency sensitive API on Google Compute Engine for the analytics startup BlueKite Insights that runs in us-central1, and leadership requires business continuity with an RTO under 45 seconds if a whole zone goes down. You need a design that will shift traffic automatically without manual steps in the event of a zonal outage. What should you set up?

  • ✓ C. Create a regional managed instance group that distributes instances across at least two zones in the region

The correct choice is Create a regional managed instance group that distributes instances across at least two zones in the region.

This design places identical instances in multiple zones within the region, which removes the single zone as a point of failure. When you place the group behind an external HTTP(S) Load Balancer, health checks quickly detect a zonal failure and the load balancer routes traffic only to healthy instances in the surviving zones. This provides automatic failover without manual steps and can meet a recovery time objective under 45 seconds with typical health check settings.

Managed instance groups also provide autohealing and uniform configuration which increases resilience and keeps the fleet consistent as traffic shifts across zones.

Use a zonal managed instance group and enable automatic restart and live migration is not sufficient because a zonal group keeps all instances in one zone. Automatic restart only restarts a VM in the same zone and live migration helps during host maintenance events. Neither solves a full zonal outage and traffic cannot shift to another zone automatically.

Configure an external HTTP(S) Load Balancer with a single backend service in one zone does not meet the requirement because when that zone fails all backends become unhealthy and the load balancer has nowhere else to send traffic. There is no cross zone redundancy.

Use Cloud DNS failover to switch between two unmanaged instance groups that both run in the same zone cannot work because both groups are in the same zone and would fail together. DNS based failover is also constrained by record time to live and client caching which can exceed 45 seconds, and unmanaged instance groups do not provide autohealing.

When you see a need for automatic failover across zones with a tight RTO, choose a regional managed instance group with an external HTTP(S) Load Balancer and health checks. Be wary of DNS based designs because TTL caching can delay failover.

Question 4

Which GCP service should you use to centrally manage encryption keys with the strongest protection and automatic rotation to reduce blast radius?

  • ✓ B. Cloud KMS with automatic rotation

The correct option is Cloud KMS with automatic rotation.

Cloud KMS centrally manages cryptographic keys with fine grained IAM, audit logging, and organization wide visibility. It supports key rotation policies that automatically create new key versions on a schedule which limits the blast radius if a key is exposed. For the strongest protection, Cloud KMS also allows HSM backed keys using the HSM protection level while still benefiting from the same central management and automatic rotation capabilities.

VPC Service Controls protects access to Google Cloud APIs by creating service perimeters to reduce data exfiltration risk. It does not create, store, rotate, or manage cryptographic keys, so it does not meet the requirement.

Secret Manager stores application secrets such as API keys and passwords and is not a key management system. While you can rotate secrets, it does not centrally manage or automatically rotate cryptographic keys like Cloud KMS.

Inject secrets at provisioning describes an operational pattern rather than a Google Cloud service. It does not provide centralized key management or enforce automatic key rotation.

Map the requirement words to services. If the question emphasizes central key management and automatic rotation with strongest protection then think of Cloud KMS and HSM backed keys rather than data perimeter or application secret storage services.

Question 5

A logistics startup named TallyRoute runs its development services on Google Kubernetes Engine. In this environment the applications emit very chatty logs, and developers inspect them with kubectl logs and do not rely on Cloud Logging. There is no common log schema across these services. You want to lower Cloud Logging spending related to application logs while still retaining GKE operational logs for troubleshooting. What should you do?

  • ✓ B. Add an exclusion on the _Default sink that filters out workload entries with resource.type = “k8s_container” and severity <= DEBUG

The correct option is Add an exclusion on the _Default sink that filters out workload entries with resource.type = “k8s_container” and severity ⇐ DEBUG.

This exclusion prevents chatty application container logs at low severities from being ingested into Cloud Logging, which directly reduces logging costs. It leaves GKE system and control plane logs untouched because those are not emitted with the k8s_container resource type. Developers can continue to use kubectl logs for development troubleshooting while the project retains GKE operational visibility in Cloud Logging.

Run gcloud container clusters update dev-west1 –logging=SYSTEM for the development cluster is not the best choice because it disables all workload log collection and it also omits API server logs unless you explicitly include them. The requirement is to keep GKE operational logs for troubleshooting and this change can remove important control plane visibility.

Run gcloud logging sinks update _Default –disabled in the development project is incorrect because disabling the default sink broadly stops routing most logs, including crucial GKE system and control plane logs. That violates the requirement to retain operational logs.

Create a Log Router sink that exports all k8s_container logs to BigQuery and set table expiration to 2 days does not reduce Cloud Logging ingestion costs because logs are ingested before export, and it introduces additional BigQuery storage and query costs. It also adds little value given the lack of a common schema and the developers reliance on kubectl logs.

When asked to reduce Cloud Logging costs, think about targeted exclusions that stop low value logs from being ingested. Filtering by resource.type and severity often preserves critical operational logs while cutting spend.

Question 6

Which solution lets Cloud Build run builds with private VPC access to call internal APIs without using public endpoints and with minimal operations?

  • ✓ C. Private pools for Cloud Build

The correct option is Private pools for Cloud Build.

Private pools for Cloud Build run your builds on dedicated workers in your project that you can connect to your VPC. Builds run without public IP addresses and can reach internal endpoints over private RFC 1918 addresses. This directly satisfies the need to call internal APIs without using public endpoints. Because private pools are a managed Cloud Build feature, you get minimal operational overhead while keeping network control in your project.

Cloud Deploy is a release orchestration service that promotes artifacts to targets. It does not provide a private network execution environment for builds and therefore does not meet the requirement for private VPC access during build time.

Internal HTTP(S) Load Balancer exposes internal services behind an internal frontend, but it does not change how Cloud Build workers connect. Using this alone would not place builds inside your VPC, so it does not ensure private build connectivity without public endpoints.

External HTTP(S) Load Balancer with Cloud Armor is designed for public endpoints with web security policies. It relies on public access which conflicts with the requirement to avoid public endpoints for build traffic and it adds unnecessary complexity for this use case.

When a question asks for build access to internal services with minimal operations, look for a managed feature that puts the build runtime inside your network. For Cloud Build this points to private pools rather than load balancers or deployment tools.

Question 7

You are the on call engineer at Lumina Metrics, a retail analytics startup that runs critical services on Google Cloud. A severity one outage was declared 20 minutes ago and customers cannot load dashboards. You need to organize responders and choose communication methods so the team can restore service quickly and safely. What should you do?

  • ✓ C. Appoint an Incident Commander, assign distinct Communications and Operations leads, and coordinate in a persistent real time chat channel for collaboration and decision tracking

The correct option is Appoint an Incident Commander, assign distinct Communications and Operations leads, and coordinate in a persistent real time chat channel for collaboration and decision tracking.

This approach creates clear ownership and decision making which reduces confusion and speeds restoration. The Incident Commander directs priorities and risk, the Operations lead focuses on technical diagnosis and changes, and the Communications lead provides timely and consistent stakeholder updates. A persistent real time chat channel gives all responders a single place to collaborate, capture decisions, and maintain context which supports handoffs and later review.

Ask one engineer to fill every role including Incident Commander, Communications Lead, and Operations Lead and use only email threads to share updates is inefficient and risky because it overloads one person and creates bottlenecks. Email is not a real time medium for coordination and it fragments information which slows decision making during a critical outage.

Let all responders work independently without assigning roles and coordinate through ad hoc messages to reduce overhead leads to duplicated work, conflicting actions, and unclear authority. High severity incidents need explicit roles and a single channel to keep actions aligned and safe.

Create a Cloud Pub/Sub topic for the incident and post updates there while leaving roles informal to save time misuses a system to system messaging service and does not support human collaboration. Leaving roles informal increases confusion and risk while Pub/Sub does not provide the interactive discussion and decision tracking that incident response requires.

Favor options that establish clear roles with an Incident Commander and named leads and use a single real time channel that preserves history for coordination and decisions. Be cautious when answers rely on email, ad hoc messaging, or informal ownership.

Question 8

For a UK only website in europe-west2 using the Envoy based external HTTP(S) load balancer, which network tier and scope minimize cost while meeting the constraints?

  • ✓ C. Standard Tier with a regional external HTTP(S) Application Load Balancer

The correct option is Standard Tier with a regional external HTTP(S) Application Load Balancer.

This choice keeps traffic within the region and uses the Envoy based regional external Application Load Balancer that is designed for localized audiences. It avoids global anycast and long haul transit which reduces egress and data processing costs while still providing a public endpoint for users in the UK.

Premium Tier with a global external HTTP(S) load balancer is unnecessary for a UK only audience because it uses global anycast and worldwide edge presence which typically costs more and is intended for global reach.

Standard Tier with a regional internal HTTP(S) load balancer cannot serve a public website because it is only reachable on internal IP addresses within your VPC and therefore does not meet the requirement for an external site.

Premium Tier with a regional external HTTP(S) load balancer provides no benefit for a single region UK audience and generally costs more than Standard Tier for the same regional delivery so it does not minimize cost.

Match scope and tier to the traffic pattern. If users are in one region then choose regional scope and prefer Standard Tier for lower cost. Use Premium Tier and global scope only when you truly need worldwide ingress and global routing.

Question 9

Your platform team operates a multi-tier application on Google Cloud. During a midweek change window that lasted 90 minutes, a teammate updated a VPC firewall rule and accidentally blocked a critical backend, which caused a production incident that impacted many users at example.com. The team wants to follow Google recommendations to reduce the risk of this type of mistake. What should you do?

  • ✓ C. Require peer review and approval for every change before it is rolled out

The correct option is Require peer review and approval for every change before it is rolled out.

Requiring peer review and approval adds a second knowledgeable person to validate intent, scope, and blast radius before any configuration is applied. This practice helps catch mistakes like an overly broad deny rule and creates an auditable control point. You can implement peer review and approval with infrastructure as code and gated promotions in your build and deploy pipelines which aligns with recommended safe change practices.

Perform firewall updates only during a scheduled maintenance window is not a preventive control. The incident already happened during a change window and a window does not reduce the chance of a bad rule being pushed. It only limits when changes occur.

Automate all infrastructure updates so that humans do not edit resources directly is valuable but incomplete. Automation without peer review and approval can push a bad change faster to more systems. The better control is to combine automation with review.

Enable VPC Firewall Rules Logging and alert on high deny rates in Cloud Monitoring helps with detection after the fact. It does not prevent the misconfiguration and users can still be impacted before alerts are processed and acted upon.

When a question asks how to prevent outages from configuration mistakes, prefer controls that add verification before rollout such as reviews and approvals rather than reactive monitoring or scheduling changes.

Question 10

Which GCP service should manage secrets so a CI/CD pipeline for GKE Autopilot avoids exposing values in source or logs and allows rotating credentials every 60 days without changing pipeline code?

  • ✓ C. Secret Manager with IAM for Cloud Build and GKE

The correct option is Secret Manager with IAM for Cloud Build and GKE.

This service centrally stores secrets and provides fine grained access control through IAM which lets you grant only the Cloud Build service account and the GKE Autopilot workloads the permissions they need. You can set a rotation policy for every 60 days and rely on secret versioning so pipeline and workload configurations can reference the latest version without any code changes. It reduces the risk of leaking values in source or logs because the platform retrieves secrets at runtime and masks secret values in build logs when used through supported integrations.

Cloud Storage with CMEK is designed for object storage and customer managed encryption keys do not provide secret specific features like automatic rotation, versioned secret access, or tight IAM bindings at the secret level. Using it for secrets increases operational overhead and the risk of accidental exposure.

Check Kubernetes Secrets into Git exposes sensitive data in source control and audit trails which violates best practices and makes rotation error prone and manual.

Cloud KMS encrypted blobs in repo uses a key management service rather than a secret management service which requires custom encryption and decryption handling in the pipeline, risks accidental logging during decryption, and complicates rotation since you would have to re encrypt and update references rather than simply advancing a secret version.

When you see requirements for centralized secret storage, fine grained IAM, seamless rotation, and no code changes, think Secret Manager with the relevant service identities rather than storage buckets or raw KMS usage.

Question 11

You are the DevOps lead at Trailforge Books where your microservices application runs on Google Cloud and you must improve runtime performance while gaining clear visibility into resource consumption. You plan to use Google Cloud’s operations suite for observability and alerting. Which actions should you take to meet these goals? (Choose 2)

  • ✓ A. Use Cloud Trace to analyze distributed latency and pinpoint bottlenecks so you can tune the service

  • ✓ B. Configure Cloud Monitoring to collect CPU and memory metrics for all services and create alerting policies with threshold conditions

The correct options are Use Cloud Trace to analyze distributed latency and pinpoint bottlenecks so you can tune the service and Configure Cloud Monitoring to collect CPU and memory metrics for all services and create alerting policies with threshold conditions.

Use Cloud Trace to analyze distributed latency and pinpoint bottlenecks so you can tune the service directly addresses application performance in a microservices environment. Tracing shows end to end request paths and highlights where latency is introduced so you can focus optimization on the most impactful services and calls. It also helps validate improvements by comparing traces before and after changes.

Configure Cloud Monitoring to collect CPU and memory metrics for all services and create alerting policies with threshold conditions provides the resource visibility you need. System metrics like CPU utilization and memory usage reveal saturation and inefficiencies across services. Threshold based alerting notifies you early when resources approach unsafe levels so you can intervene or scale appropriately and it ties cleanly into dashboards and SLO monitoring.

Turn off Cloud Logging to reduce latency and lower resource usage is incorrect because turning off logs removes critical observability and does not reliably improve performance. A better approach is to tune log levels, use sampling and exclusions, and retain essential logs for troubleshooting and security.

Publish a custom “requests_per_second” metric to Cloud Monitoring and configure Cloud Run to autoscale directly from that metric is incorrect because Cloud Run fully managed does not autoscale from custom Cloud Monitoring metrics. It scales based on request concurrency and configured min and max instances so you cannot wire a custom requests per second metric to control scaling.

Deploy an in house Prometheus and Grafana stack instead of using the operations suite for monitoring is incorrect because the requirement is to use the operations suite. Running your own stack adds operational overhead and duplicates capabilities that are already provided natively including integrations, alerting, and dashboards.

Map the signal to the right tool. Use traces for latency and request paths, metrics for resource usage and alerts, and logs for detailed diagnostics. When an option suggests disabling a core observability signal it is usually a red flag.

Question 12

A Cloud Build pipeline stops producing container images after a recent cloudbuild.yaml change. Following SRE practices for root cause analysis and safe rollback, what should you do?

  • ✓ C. Diff the last known good cloudbuild.yaml against the current Git revision and revert or fix the regression

The correct option is Diff the last known good cloudbuild.yaml against the current Git revision and revert or fix the regression.

This approach follows change focused troubleshooting and safe rollback practices. The build broke right after a configuration change which strongly suggests the failure is change induced. Comparing the last known good configuration with the current revision quickly isolates the exact regression. Reverting to the known good state restores delivery fast while you continue root cause analysis in a controlled and auditable way. Using version control keeps the pipeline reproducible and prevents side effects from ad hoc fixes.

Disable the build trigger then build and push images from a developer laptop is risky and violates reproducibility and supply chain controls. It bypasses Cloud Build automation and auditability and can introduce untracked differences in toolchains and credentials which increases risk during an incident rather than reducing it.

Increase the Cloud Build timeout and run the build again treats a symptom that is not indicated by the scenario. The failure started after a configuration change which points to a logic or config error rather than a time limit. Increasing the timeout delays feedback and does not address the root cause.

Rotate the credentials used by the Cloud Build push step then run the build again is not aligned with the trigger for the failure. The issue followed a cloudbuild.yaml change rather than an authentication event. Rotating credentials without evidence can add new variables and create further disruptions.

When a failure begins right after a configuration change first compare the new revision with the last known good and roll back to a safe state. Prefer version controlled fixes and reversible changes over ad hoc tweaks like timeouts or credentials.

Question 13

A DevOps team at Pinecrest Analytics plans to manage Google Cloud infrastructure as code, and they require a declarative configuration that can be stored in Git and that can automate both creation and updates of resources across multiple projects. Which service should they use?

  • ✓ C. Google Cloud Deployment Manager

The correct option is Google Cloud Deployment Manager because it provides a declarative infrastructure as code model that you can store in Git and it supports automated creation and updates of resources across multiple projects.

With Deployment Manager you define resources in YAML configurations and you can use Jinja or Python templates for reuse and composition. You can commit these files to version control and use previews and updates to apply changes safely. Teams can reuse the same templates in different projects and can target specific projects during deployments, which aligns with multi project management requirements.

Config Connector also offers declarative configuration and works well with Git based workflows, yet it requires a Kubernetes control plane and manages resources through Kubernetes custom resources. The question does not mention Kubernetes and it asks for a native Google Cloud service to automate creation and updates directly, therefore Config Connector is not the best fit here.

The Google Cloud Console is an interactive web interface for manual administration and it does not provide infrastructure as code or Git based workflows. It cannot automatically create and update resources from declarative configurations.

Google Cloud Build is a CI and CD service that runs pipelines and automation. It can execute infrastructure tools in a pipeline, yet it is not itself a declarative infrastructure service and it does not define resource configurations.

When you see the keywords declarative and stored in Git and you do not see any mention of Kubernetes then map to services that natively define infrastructure as code on Google Cloud rather than build or console tools.

Question 14

Which GCP services should you use to trace and correlate interservice latency and errors in a GKE microservices application to identify the root cause?

  • ✓ B. Cloud Trace and Cloud Monitoring

The correct answer is Cloud Trace and Cloud Monitoring.

Cloud Trace provides distributed tracing for microservices so you can follow requests across services in GKE and see spans and timing that reveal interservice latency and errors. Cloud Monitoring collects metrics and builds dashboards and alerts, and it correlates service metrics with traces so you can identify where latency or failures originate. Used together they let you trace requests end to end and correlate the observed latency and error signals to find the root cause.

Cloud Profiler and Cloud Logging is not ideal for tracing interservice latency. Cloud Profiler samples CPU and memory to optimize code performance while Cloud Logging stores logs and can help with error visibility but it does not give you distributed request traces across services.

Network Intelligence Center focuses on network topology, reachability analysis, and path performance. It does not trace application requests across microservices or correlate application errors to specific service calls.

VPC Service Controls and Cloud Armor address security by reducing data exfiltration risk and providing web application firewall and DDoS protection. These services do not provide distributed tracing or application performance correlation.

When a question mentions interservice latency or finding the root cause in microservices, look for distributed tracing and pair it with metrics and alerting. In GKE this often means Cloud Trace with Cloud Monitoring.

Question 15

At Riverbeam Media your SRE team enabled a small canary for a new checkout feature in a GCP hosted web application. Within eight minutes your alerts report a sharp rise in HTTP 500 responses and the p95 latency has increased significantly. You want to minimize customer impact as fast as possible. What should you do first?

  • ✓ C. Revert the canary rollout right away so traffic goes back to the last stable version

The correct action is Revert the canary rollout right away so traffic goes back to the last stable version.

This choice follows standard incident response where you mitigate customer impact first. A canary is designed to be easy to undo so moving traffic back to the last known good version quickly stops the spike in HTTP 500s and reduces the p95 latency back toward normal. Once stability is restored you can proceed with investigation and longer term fixes without ongoing user harm.

Start a detailed root cause investigation using Cloud Trace and Cloud Logging is not the first step because investigation takes time and does not immediately reduce user impact. You should restore service health first and then use tracing and logs to find the cause.

Immediately add more backend instances to try to absorb the load is unlikely to help because a surge in 500 errors with a canary usually indicates a functional or configuration regression rather than simple capacity shortage. Scaling out can waste resources and still return errors.

Begin documenting the incident timeline for the postmortem is an important practice but it should not come before mitigation. Stabilize the service first and then complete timeline documentation during and after the incident.

When a canary causes rapid SLO violations prioritize mitigation first. Roll back to the last known good version to stop the bleeding, then investigate and document.

Question 16

Which workflow should you use to reduce Terraform merge conflicts and ensure only approved changes reach the main source of truth in Google Cloud?

  • ✓ B. Git with feature branches and PRs with reviews and automated Terraform checks and main as source of truth

The correct option is Git with feature branches and PRs with reviews and automated Terraform checks and main as source of truth.

This workflow isolates changes in feature branches which reduces merge conflicts by keeping unrelated work separate. Pull requests enable peer review and policy enforcement so only reviewed and approved changes can be merged. Automated checks run Terraform formatting, validation, and plan in continuous integration and block merges when checks fail. Protecting the main branch ensures it remains the single source of truth and that only approved and tested changes reach it.

Cloud Source Repositories with direct commits to main and Cloud Build apply on push is not appropriate because direct commits bypass peer review and CI checks and increase the likelihood of conflicts and misconfigurations. In addition, Cloud Source Repositories is deprecated which makes it less likely to be favored on newer exams.

Versioned Cloud Storage bucket as the canonical Terraform code store with manual object renames is incorrect because Cloud Storage is object storage rather than a version control system. It does not provide branching, merging, pull requests, or review workflows and manual renames are error prone and do not prevent conflicts or unauthorized changes.

When a question asks about reducing conflicts and enforcing approvals look for workflows that use feature branches, pull requests, automated checks, and a protected main branch. Be cautious of answers that rely on direct commits or object storage for source control.

Question 17

You support an order processing service for Riverview Retail that runs on Compute Engine instances. The compliance team requires that an alert be sent if end to end transaction latency is greater than 3 seconds, and it must only notify if that condition continues for more than 10 minutes. How should you implement this in Google Cloud Monitoring to meet the requirement?

  • ✓ C. Create a Cloud Monitoring alert that evaluates the 99th percentile transaction latency and fires when it remains above 3 seconds for at least 10 minutes

The correct choice is Create a Cloud Monitoring alert that evaluates the 99th percentile transaction latency and fires when it remains above 3 seconds for at least 10 minutes.

This approach uses a metrics-based alert on the latency metric and evaluates tail performance rather than the mean. In Cloud Monitoring you can require that the condition remains true for a specified duration so the alert only notifies if the threshold is exceeded continuously for ten minutes. This directly satisfies the end to end latency threshold and persistence requirement and avoids false positives from short spikes.

Build a Cloud Function that runs for each transaction and sends an alert whenever processing time surpasses 3 seconds is incorrect because it pushes alerting into custom code and would generate a high volume of notifications on transient spikes. It also does not natively enforce a ten minute sustained breach without additional complex stateful logic.

Define a request latency SLO in Service Monitoring and configure an error budget burn rate alert over a 10 minute window is not appropriate because burn rate alerts detect accelerated SLO budget consumption rather than a simple threshold that must persist for a set duration. The requirement is a strict latency threshold that must remain breached for ten minutes, not an error budget depletion pattern.

Configure a Cloud Monitoring alert on the average transaction latency that triggers when it is above 3 seconds for 10 minutes is risky because the average can mask tail latency, which means many slow transactions could be hidden by faster ones. The compliance requirement is better met by monitoring the tail behavior rather than the mean.

When an alert must fire only after a condition persists, look for the duration setting in the alert condition. For latency thresholds prefer percentiles to capture tail latency rather than averages that can hide outliers.

Question 18

Which GCP design enables secure Cloud Storage uploads, minimal cost when idle, and rapid CPU bound batch processing immediately when files arrive?

  • ✓ B. Cloud Storage with IAM and a Cloud Function that scales a Compute Engine MIG with an image that auto shuts down

The correct option is Cloud Storage with IAM and a Cloud Function that scales a Compute Engine MIG with an image that auto shuts down.

This design secures uploads by using Cloud Storage IAM so only authorized principals or signed URLs can write objects. It reacts immediately because a Cloud Storage event triggers the Cloud Function as soon as a file lands. The function can increase the size of a managed instance group from zero to the required capacity so VMs start quickly and perform the CPU bound work. The custom image or startup script finishes the batch processing and then shuts the instance down so the autoscaler returns the group to zero when there is no workload. This provides near zero cost while idle and rapid scale out on demand.

GKE with a watcher and a worker deployment that scales down when idle does not minimize cost when idle because you still pay for cluster resources such as nodes and control plane fees. It also adds operational complexity compared to a simple event driven trigger from Cloud Storage.

Cloud Run jobs triggered by Pub or Sub notifications from Cloud Storage is not supported because Cloud Run jobs are executed on demand or by a scheduler and they do not have direct event triggers from Pub or Sub or Cloud Storage. You would need an intermediary component to start a job which adds latency and complexity and does not meet the requirement as written.

Map each requirement to a native capability. Secure uploads points to Cloud Storage IAM or signed URLs. Minimal idle cost usually means scale to zero using serverless or an autoscaled MIG. Immediate processing suggests event driven triggers from Cloud Storage to Cloud Functions or Eventarc.

Question 19

AuroraPay runs a fintech platform on Cloud Spanner as its primary database and needs to roll out a schema change that adds three secondary indexes and modifies two existing tables on a multi region instance. The release must keep latency and throughput impact as low as possible during business hours. What rollout plan should the team use to perform this change?

  • ✓ C. Apply the schema updates in phases by creating the new indexes first and waiting for backfill to finish, then alter the existing tables

The correct option is Apply the schema updates in phases by creating the new indexes first and waiting for backfill to finish, then alter the existing tables. This approach uses Cloud Spanner online schema changes and separates the most resource intensive work from the table alterations so it keeps latency and throughput impact lower during business hours on a multi region instance.

Creating secondary indexes in Spanner is an online operation that triggers a backfill which runs in the background while the database continues to serve reads and writes. By building the indexes first and waiting for backfill to complete, you ensure queries can immediately benefit from the new indexes when you later modify the tables. This reduces full scans and large read workloads during the table changes and it shortens any lock hold times for DDL steps that must briefly synchronize metadata.

Phasing the rollout also avoids stacking multiple long running backfills and table alterations at the same time which would compete for CPU, memory, and IOPS across replicas in a multi region configuration. Spreading the work keeps resource usage smoother and helps maintain steady performance during business hours.

Delete the affected tables and drop the old indexes, then recreate the schema in one maintenance window is disruptive and risks significant downtime and data loss. Spanner supports online schema changes so deleting and recreating objects is unnecessary and would violate the requirement to minimize impact.

Clone the database into a new Spanner instance using backup and restore, apply the schema changes there, then cut over all traffic introduces a complex cutover with potential data divergence because backups are point in time. It adds risk and downtime without benefit since Spanner can apply these changes online in place.

Execute the full set of schema changes at once so the total change window is shortest concentrates multiple index backfills and table alterations concurrently which amplifies contention and resource spikes. This is more likely to increase latency and reduce throughput during peak hours compared to a phased rollout.

When a question mentions Spanner schema changes during business hours, look for an approach that uses online changes with phased rollout and waits for index backfill to complete before altering tables.

Question 20

How can developers test the latest Cloud Run revision without routing any production traffic to it?

  • ✓ B. Deploy with –no-traffic and a tag then use the tag URL

The correct option is Deploy with –no-traffic and a tag then use the tag URL.

Deploying with the no traffic flag creates the new revision without sending any requests to it. Adding a revision tag generates a dedicated tag URL that directly addresses that specific revision. Testers can use the tag URL to exercise the new code while the default service URL continues routing all production traffic to the stable revision. This isolates testing and satisfies the requirement to avoid routing any production traffic to the latest revision.

Shift all traffic to LATEST using gcloud run services update-traffic is incorrect because it explicitly moves production traffic to the newest revision which violates the requirement to keep production traffic away from the new deployment during testing.

Grant roles/run.invoker and call the private URL is incorrect because the default service URL still respects the service traffic splits. Without a tag or a separate service there is no guaranteed way to reach the new revision directly, so this does not isolate testing from production.

Use Cloud Load Balancing to route tester IPs to LATEST is incorrect because a load balancer can route to a Cloud Run service but it cannot select an untagged internal revision such as the latest one. Only tagged revisions or separate services are addressable for targeted routing.

When a question asks how to test a new Cloud Run revision without affecting users, look for options that mention no traffic, a revision tag, and a tag URL. Be wary of choices that move traffic to LATEST or rely on external routing tricks because those usually send production traffic to the new revision.

Question 21

A fintech company named LumenPay runs a payments API on Compute Engine and forwards application logs to Cloud Logging. During an audit you learn that some records contain PII and every sensitive value begins with the prefix custinfo. You must keep those matching entries in a restricted storage location for later investigation and you must stop those entries from being retained in Cloud Logging. What should you do?

  • ✓ C. Set up a logs router sink with an advanced filter that matches custinfo and route matching entries to a Cloud Storage bucket then add a logs based exclusion with the same filter so Cloud Logging does not retain them

The correct answer is Set up a logs router sink with an advanced filter that matches custinfo and route matching entries to a Cloud Storage bucket then add a logs based exclusion with the same filter so Cloud Logging does not retain them.

This approach uses the Cloud Logging router to capture only entries that contain the custinfo prefix and delivers them to a Cloud Storage bucket where you can apply restrictive IAM controls and retention policies for investigations. Adding a logs based exclusion with the same filter prevents those entries from being stored in Cloud Logging buckets. This satisfies the requirement to keep the sensitive records in a restricted location while ensuring they are not retained in Cloud Logging.

Configure the Ops Agent with a filter that drops log lines containing custinfo and then use Storage Transfer Service to upload the filtered content to a locked Cloud Storage bucket is incorrect because dropping the lines at the agent loses the data you need to keep. Storage Transfer Service is not used to collect and stream VM logs to Cloud Storage from agents and this path would not meet the requirement to preserve the sensitive entries.

Create a Pub/Sub sink with a filter for custinfo and trigger a Cloud Function that stores the messages in BigQuery with customer managed encryption keys is incorrect because it does not stop Cloud Logging from retaining the entries as it lacks an exclusion. It also stores the records in BigQuery which is not required and adds unnecessary components.

Create a basic logs filter for custinfo and configure a sink that exports matching records to Cloud Storage while relying on default retention for the rest of the logs is incorrect because it omits the logs based exclusion. Without an exclusion the sensitive entries would still be retained in Cloud Logging.

When a requirement says keep specific logs and also prevent their retention in Logging, pair a sink using an advanced filter with a matching logs based exclusion. Watch for keywords like prefix or contains that hint you must use an advanced filter rather than a basic one.

Question 22

Which Google Cloud approach best ensures 99.9 percent availability and low latency with efficient autoscaling while adopting a cloud native architecture?

  • ✓ B. GKE with microservices and autoscaling

The correct option is GKE with microservices and autoscaling.

With GKE you can deploy microservices across multi zone or regional clusters to remove single zone failure and support a 99.9 percent availability goal. Replicated pods behind Google Cloud Load Balancing and readiness and liveness probes help maintain service health during failures and updates which supports low latency under load.

Autoscaling is efficient on GKE because you can combine the Horizontal Pod Autoscaler to scale pods on metrics, the Vertical Pod Autoscaler to right size resources, and the Cluster Autoscaler to add or remove nodes. This gives fine grained control so the platform can react quickly to demand while reducing costs when traffic drops, and it aligns well with cloud native design using containers and declarative configuration.

The Cloud Run option is powerful for stateless request driven services and it does support fast scaling with minimal operations. However it offers less control over scheduling, topology, and advanced autoscaling policies than GKE, and cold starts or regional deployment choices can make strict low latency and availability objectives harder without extra configuration. It also lacks the full Kubernetes feature set that complex microservice topologies often need.

The App Engine standard option provides managed runtimes and autoscaling but it is more opinionated and less flexible for container based microservices. It has fewer tuning options for advanced autoscaling and traffic management than GKE, and it does not provide the same multi zone orchestration model that helps you design for high availability.

Map the requirement keywords to platform capabilities. When you see high availability, low latency, and efficient autoscaling for microservices, think Kubernetes on Google Cloud with regional clusters, HPA, and the Cluster Autoscaler.

Question 23

BlueRiver Capital is adopting Spinnaker to roll out a service that warms an in memory cache of about 3 GB during startup and it typically finishes initialization in about 4 minutes. You want the canary analysis to be fair and to reduce bias from cold start effects in the cache. How should you configure the canary comparison?

  • ✓ C. Compare the canary with a fresh deployment of the current production version

The correct configuration is to Compare the canary with a fresh deployment of the current production version.

This approach creates a control that experiences the same cold start behavior as the canary. Both instances warm the cache at the same time and under the same traffic which reduces bias from startup transients. Spinnaker canary analysis is designed to compare an experiment and a control that are as similar as possible other than the change under test. A freshly deployed control of the current version keeps environment and startup conditions aligned so the metrics reflect the code change rather than differences in cache warmth or runtime age.

Compare the canary with the existing deployment of the current production version is incorrect because the long running control instance already has a warm cache and stable runtime characteristics. That would make the comparison unfair since the canary would be penalized by cold start effects that the control does not experience.

Compare the canary with a new deployment of the previous production version is incorrect because it introduces a version mismatch that confounds the analysis. Differences in metrics could be caused by the version change rather than the canary conditions and it no longer isolates the impact of the new release.

Compare the canary with a baseline built from a 30 day rolling average of Cloud Monitoring production metrics is incorrect because a historical baseline does not run side by side and cannot capture current startup effects or daily variations. Canary analysis is most reliable when the control runs concurrently and shares the same traffic and environment.

When a service has significant startup transients, choose a control that is a fresh deployment of the same version and run it side by side with the canary. If the platform allows it, start the analysis window after the expected warm up period to further reduce noise.

Question 24

A vendor requires a JSON service account key and does not support Workload Identity Federation. The organization policy iam.disableServiceAccountKeyCreation blocks key creation. What should you do to complete the integration while following best practices?

  • ✓ C. Add a temporary project exception for iam.disableServiceAccountKeyCreation to create one user managed key then remove it

Add a temporary project exception for iam.disableServiceAccountKeyCreation to create one user managed key then remove it is correct because it provides the minimal and temporary allowance needed to meet the vendor requirement while keeping the organization-wide control intact.

This approach follows least privilege and change control. You create a project-level exception only long enough to generate a single user managed key for the specific service account. You then immediately re-enable the constraint. You should grant only the required roles to that service account, store the key securely, monitor its use with audit logs, and plan to retire it when the vendor can support a more secure federation model.

Disable iam.disableServiceAccountKeyCreation at the organization root is incorrect because it weakens the control across all projects. It broadens risk unnecessarily and does not follow best practice to scope exceptions as narrowly as possible.

Use Workload Identity Federation is incorrect because the vendor does not support it in this scenario. Although WIF is the preferred method to avoid long‑lived keys, it is not feasible here.

Create the key in a different project without the constraint and share it is incorrect because it bypasses governance intent and complicates access management and auditing. It can also conflict with organization policy inheritance and increases the blast radius by distributing credentials across projects.

When a control blocks a needed change, create a temporary and narrowly scoped exception at the smallest scope, complete the task, then promptly revert the setting.

Question 25

You are writing a blameless post incident review for a 35 minute outage at BrightWave Media that impacted about 70 percent of customers during the evening peak. Your goal is to help the organization avoid a repeat of this type of failure in the future. Which sections should you include in the report to best support long term prevention? (Choose 2)

  • ✓ B. A prioritized remediation plan with specific actions owners and due dates

  • ✓ E. A clear analysis of the primary cause and contributing factors of the outage

The correct options are A prioritized remediation plan with specific actions owners and due dates and A clear analysis of the primary cause and contributing factors of the outage.

A strong review must drive concrete change. This plan turns learning into measurable work that is tracked to completion. It establishes ownership, timelines, and priority so that the most impactful fixes happen first and the organization can verify that prevention steps are actually implemented.

Understanding why the failure happened is essential for prevention. This analysis distinguishes the primary cause from contributing factors, maps how the incident unfolded, and reveals the systemic gaps that allowed it to escalate. With that clarity the team can design targeted fixes, improve detection and response, and reduce the chance of recurrence.

A comparison of this incident’s severity to earlier incidents does not directly support prevention. Historical context can be useful for reporting, yet it does not by itself identify what to fix or who will do the work.

A list of employees to blame for the outage contradicts blameless culture and discourages honest reporting and learning. Assigning blame to individuals obscures systemic issues and reduces the effectiveness of long term improvements.

A complete export of Cloud Monitoring dashboards and raw logs from the affected window may be helpful as supporting data, but a data dump without synthesized findings does not guide action. The review should summarize insights derived from data rather than include exhaustive raw outputs.

When a question highlights a blameless review, choose options that emphasize root cause and contributing factors along with actionable remediation that names owners and due dates. Avoid choices that focus on blame, raw data dumps, or comparisons that do not lead to prevention.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.