Free AWS Certified Data Engineer Sample Questions

All questions come from my AWS Engineer Udemy course and certificationexams.pro
Free AWS Data Engineer Exam Topics Test
The AWS Certified Data Engineer Associate exam validates your ability to design, build, and maintain data processing systems that support analytics and business intelligence across AWS services. It focuses on key domains including data ingestion, transformation, storage optimization, and query performance.
To prepare effectively, begin with the AWS Data Engineer Associate Practice Questions. These questions mirror the tone, logic, and structure of the real certification exam and help you become familiar with AWS’s question style and reasoning approach. You can also explore Real AWS Data Engineer Exam Questions for authentic, scenario-based challenges that simulate real data engineering tasks.
For focused study, review AWS Data Engineer Sample Questions covering Glue ETL pipelines, Redshift optimization, S3 data partitioning, and troubleshooting common issues.
AWS Certification Exam Simulator
Each section of the AWS Data Engineer Questions and Answers collection is designed to teach as well as test. These materials reinforce essential AWS data concepts and provide clear explanations that help you understand why specific responses are correct.
For complete readiness, use the AWS Data Engineer Associate Exam Simulator and take full-length Certified Data Engineer Associate Exam Questions. These simulations reproduce the pacing and structure of the actual certification exam so you can manage your time effectively and gain confidence under real test conditions.
If you prefer focused study sessions, try the AWS Data Engineer Exam Dump and AWS Data Engineer Certification Braindump collections. These organize questions by topic such as data transformation, workflow orchestration, schema evolution, and governance, allowing you to strengthen your knowledge in key areas.
Working through these exercises builds the analytical and practical skills needed to design efficient pipelines and ensure data integrity across AWS environments. Start your preparation today with the AWS Data Engineer Associate Practice Questions and measure your progress using the AWS Data Engineer Associate Exam Simulator. Prepare to earn your certification and advance your career as a trusted AWS Data Engineer.
Git, GitHub & GitHub Copilot Certification Made Easy |
---|
Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.
Get certified in the latest AI, ML and DevOps technologies. Advance your career today. |
AWS Data Engineer Associate Sample Questions
Question 1
A digital media analytics firm runs Apache Hive on Amazon EMR. Around midday, roughly 90% of daily Hive queries execute in a short burst and performance degrades, yet monitoring shows HDFS usage consistently stays below 12%. What change should be made to improve performance during these spikes?
-
❏ A. Configure uniform instance groups for core and task nodes and scale based on the CloudWatch CapacityRemainingGB metric
-
❏ B. Configure EC2 Spot Fleet for EMR core and task nodes and scale on the YARNMemoryAvailablePercentage metric
-
❏ C. Enable instance groups for core and task nodes and drive automatic scaling using the YARNMemoryAvailablePercentage CloudWatch metric
-
❏ D. Turn on EMR managed scaling but target HDFS CapacityRemainingGB rather than YARN memory utilization
Question 2
For daily time-based scans on S3 telemetry grouped by sensor_model, which storage format and layout maximize query performance and cost efficiency?
-
❏ A. Apache Parquet, partitioned by sensor_model, sorted by event_time
-
❏ B. Compressed CSV, partitioned by event_date, sorted by health_state
-
❏ C. Apache ORC, partitioned by event_date, sorted by sensor_model
-
❏ D. Apache Iceberg on S3, monthly partitions
Question 3
An online brokerage is assessing its workloads with the AWS Well-Architected Tool to tighten security. They want centralized control over identities and credentials with routine rotation, and they also need database passwords to rotate automatically without interrupting applications. Which AWS services should they choose to satisfy these requirements?
-
❏ A. AWS CloudTrail and AWS Systems Manager Automation
-
❏ B. AWS Well-Architected Tool and AWS Key Management Service (KMS)
-
❏ C. AWS Identity and Access Management (IAM) and AWS Secrets Manager
-
❏ D. Amazon Cognito and AWS Secrets Manager
Question 4
Which approach processes Amazon Kinesis Data Streams records and updates Amazon RDS with sub-15-ms latency, auto-scaling, and minimal operations?
-
❏ A. Amazon Kinesis Data Analytics (Apache Flink) writing directly to Amazon RDS
-
❏ B. AWS Lambda consumers for Kinesis Data Streams updating Amazon RDS
-
❏ C. EC2 Auto Scaling workers polling Kinesis Data Streams and persisting to Amazon RDS
-
❏ D. AWS Glue streaming job loading to Amazon RDS
Question 5
A retail analytics group at Meridian Outfitters runs complex Amazon Athena queries against clickstream logs stored in Amazon S3, with the AWS Glue Data Catalog holding the table metadata. Query planning has slowed significantly because the table is split into hundreds of thousands of partitions across date and region keys. The team wants to cut Athena planning overhead and speed up query execution while continuing to use Athena. Which actions would best achieve these goals? (Choose 2)
-
❏ A. Use AWS Glue jobs to merge small S3 objects into larger files across all partitions
-
❏ B. Enable partition projection in Athena for the table and define partition value ranges and formats
-
❏ C. Implement AWS Glue Elastic Views to manage partitions for the Athena table
-
❏ D. Convert the dataset in S3 to Apache Parquet with compression and update the table metadata
-
❏ E. Rely on Athena DML to automatically reduce the number of partitions during queries
Question 6
Athena queries over large CSV data in S3 are slow and costly. What should you do to reduce data scanned and improve performance without moving off Athena?
-
❏ A. Use Athena CTAS to generate daily Parquet extracts and query those
-
❏ B. Enable Athena materialized views over the CSV tables
-
❏ C. Run AWS Glue ETL to convert CSV to Parquet with partitions and update the Data Catalog
-
❏ D. Upgrade to Athena engine v3 and enable query result reuse
Question 7
A digital advertising analytics startup uses Amazon Athena to query clickstream logs in Amazon S3 with SQL. Over the last 9 months, the number of data engineers tripled and Athena costs spiked. Most routine queries finish in under five seconds and read only small amounts of data. The company wants to enforce different hourly and daily caps on the total data scanned per workgroup to control spend without blocking short queries. What should you configure in Athena?
-
❏ A. Define multiple per-query limits using Athena per-query data usage controls
-
❏ B. AWS Budgets
-
❏ C. Set several workgroup-wide data usage limits in Athena to enforce hourly and daily scanned-data thresholds
-
❏ D. Configure a single workgroup-wide limit that combines all hourly and daily thresholds into one setting
Question 8
Which AWS service enables scheduled subscriptions to third-party datasets with automatic delivery to Amazon S3 every 6 hours?
-
❏ A. Amazon Kinesis Data Firehose
-
❏ B. AWS Data Exchange service
-
❏ C. AWS Glue crawler
-
❏ D. Amazon AppFlow
Question 9
VeloTrack Logistics runs a write-intensive payments service on Amazon RDS for PostgreSQL, and the primary instance shows sustained CPU utilization above 85 percent during peak posting windows, causing slow transactions. Which actions should a data engineer take to directly reduce CPU pressure on the database instance? (Choose 2)
-
❏ A. Implement Amazon ElastiCache to cache frequent lookups and reduce read traffic
-
❏ B. Migrate the DB instance to a larger class with more vCPU and memory
-
❏ C. Create an Amazon RDS read replica to move reporting reads off the primary
-
❏ D. Turn on Amazon RDS Performance Insights and tune the highest-CPU queries
-
❏ E. Increase the Provisioned IOPS allocation for the DB storage
Question 10
During an AWS DMS migration with near-zero downtime, how can you verify the target matches the source before cutover while minimizing source impact?
-
❏ A. AWS Glue Data Quality
-
❏ B. Enable AWS DMS data validation
-
❏ C. Amazon Aurora zero-ETL integration
-
❏ D. AWS DMS premigration assessment
Question 11
A regional media analytics firm named BrightStream wants developers to be able to start and stop specific Amazon EC2 instances only between 08:00 and 18:00 on weekdays, and the platform team needs to see exactly which identity performed each change while also automating monthly patching across all instances. Which combination of AWS services and features should be used to meet these goals?
-
❏ A. IAM policies with tag-based permissions, AWS CloudWatch Logs for change tracking, and AWS Systems Manager Patch Manager
-
❏ B. IAM policy conditions using aws:CurrentTime with date operators for business-hour access, AWS CloudTrail for auditing, and AWS Systems Manager Patch Manager
-
❏ C. AWS Config, Amazon EventBridge Scheduler, and AWS Systems Manager State Manager
-
❏ D. AWS Systems Manager Session Manager, AWS CloudTrail, and Amazon EC2 Auto Scaling
Question 12
Which Lake Formation features enable a central catalog, cross-service discovery, and fine-grained, tag-based governance for datasets from Amazon S3 and a JDBC source?
-
❏ A. IAM policies + Glue ETL jobs + AWS RAM
-
❏ B. Glue Data Catalog with LF tag-based access control + Lake Formation blueprints
-
❏ C. Glue Crawlers + Lake Formation blueprints + IAM roles
-
❏ D. Glue Data Catalog + Lake Formation permissions (no LF tags) + Glue jobs
Question 13
At LumaRide Mobility, analysts run the same KPI dashboard query every 10 minutes against Amazon Redshift. The statement applies heavy aggregations and joins across about 28 TB of historical data in a fact_trips table, causing long runtimes. Which approach will yield the largest performance gain for this repeatedly executed query?
-
❏ A. Use Amazon Redshift Spectrum to read external tables in Amazon S3 instead of local Redshift storage
-
❏ B. Create a standard view that encapsulates the SQL and query the view when needed
-
❏ C. Create a materialized view in Amazon Redshift with automatic refresh so the aggregations are precomputed and stored
-
❏ D. Rely on the Amazon Redshift result cache by ensuring the exact same query text is executed each time
Question 14
Which AWS Organizations policy enforces organization-wide maximum permissions when attached to the root or OUs?
-
❏ A. AWS IAM Access Analyzer
-
❏ B. AWS Organizations Service Control Policies (SCPs)
-
❏ C. IAM permissions boundaries
-
❏ D. AWS Control Tower
Question 15
A data engineer at a regional logistics startup runs heavy analytics approximately every six weeks. For each cycle, they launch a new Amazon Redshift provisioned cluster, process queries for about three hours, export results and snapshots to an Amazon S3 bucket, and then delete the cluster. They want to keep these periodic analyses while avoiding capacity planning, patching, and lifecycle scripting for clusters. Which approach will achieve this with the least ongoing operational effort?
-
❏ A. Use Amazon EventBridge Scheduler to trigger an AWS Step Functions workflow that creates a Redshift cluster, runs the jobs, copies data to S3, and then terminates the cluster
-
❏ B. Use Amazon Redshift Serverless to run the analytics on demand with automatic scaling
-
❏ C. Configure Redshift zero-ETL integrations to handle the batch analytics workload
-
❏ D. Purchase Amazon Redshift reserved node offerings for the cluster to simplify operations
Question 16
How should you migrate an existing Hive metastore (about 6,400 tables) to a serverless, low-cost metadata catalog for EMR Spark and Hive on AWS?
-
❏ A. EMR cluster with self-hosted Hive metastore
-
❏ B. AWS Glue Data Catalog import for Hive metastore
-
❏ C. Amazon RDS with AWS DMS for the metastore
-
❏ D. AWS Glue Crawlers to rebuild tables from S3
Question 17
A media analytics startup, StreamQuant, uses an Amazon Redshift warehouse that holds roughly nine years of event records. Compliance requires keeping all historical data, but analysts primarily query the most recent 45 days for near real-time dashboards. How can the team organize storage and queries to reduce cost while preserving fast performance for recent data?
-
❏ A. Use Dense Compute (DC2) nodes to store all historical and recent data in the Redshift cluster
-
❏ B. Archive historical data in Amazon S3 Glacier and run active workloads on Amazon Redshift DS2 nodes
-
❏ C. Adopt Amazon Redshift RA3 with managed storage, unload older partitions to Amazon S3 and query them with Redshift Spectrum while keeping hot data in the cluster
-
❏ D. Move older records into an Amazon RDS database and keep only recent data in Amazon Redshift
Question 18
Which AWS services best ingest about 120 million daily clickstream events, enable fast search with aggregations, and deliver interactive dashboards?
-
❏ A. Amazon MSK, Amazon Redshift, Amazon OpenSearch Service
-
❏ B. Amazon Kinesis Data Streams, Amazon Athena, Amazon Managed Grafana
-
❏ C. Amazon Kinesis Data Firehose, Amazon OpenSearch Service, Amazon QuickSight
-
❏ D. Amazon Kinesis Data Firehose, Amazon Redshift, Amazon QuickSight
Question 19
A mobile gaming studio uses Amazon Kinesis Data Streams to collect gameplay and clickstream events from its apps. During limited-time tournaments, traffic can spike up to 12x within 15 minutes. The team wants the stream to scale with these surges automatically and avoid managing shard counts or scaling scripts. Which configuration should they choose?
-
❏ A. Use Kinesis Data Streams with enhanced fan-out to boost consumer throughput during bursts
-
❏ B. Use Kinesis Data Streams in on-demand capacity mode so the stream automatically scales with traffic
-
❏ C. Migrate to Amazon Kinesis Data Firehose for automatic scaling to handle spikes
-
❏ D. Use Kinesis Data Streams in provisioned capacity and add or split shards manually during surges
Question 40
In Amazon Kinesis Data Streams, how can a consumer restart and resume from its last committed sequence number to avoid reprocessing events?
-
❏ A. AWS Lambda trigger with stateless processing
-
❏ B. Enable enhanced fan-out and use LATEST
-
❏ C. SubscribeToShard with enhanced fan-out. Rely on the stream to track offsets
-
❏ D. Kinesis Client Library with DynamoDB shard checkpoints
AWS Certified Data Engineer Sample Questions and Answers
Question 1
A digital media analytics firm runs Apache Hive on Amazon EMR. Around midday, roughly 90% of daily Hive queries execute in a short burst and performance degrades, yet monitoring shows HDFS usage consistently stays below 12%. What change should be made to improve performance during these spikes?
-
✓ C. Enable instance groups for core and task nodes and drive automatic scaling using the YARNMemoryAvailablePercentage CloudWatch metric
The correct answer is Enable instance groups for core and task nodes and drive automatic scaling using the YARNMemoryAvailablePercentage CloudWatch metric.
The workload is compute and memory bound during the query surge, while HDFS usage is low, so scaling on YARN memory headroom is appropriate. EMR automatic scaling for instance groups supports CloudWatch metrics such as YARNMemoryAvailablePercentage to add or remove nodes during load spikes.
Configure uniform instance groups for core and task nodes and scale based on the CloudWatch CapacityRemainingGB metric is not suitable because CapacityRemainingGB measures HDFS storage space and does not reflect compute or memory pressure, which is the actual bottleneck.
Configure EC2 Spot Fleet for EMR core and task nodes and scale on the YARNMemoryAvailablePercentage metric is invalid since EC2 Spot Fleet is not used to provision EMR core or task nodes. EMR relies on instance groups or instance fleets to manage cluster nodes.
Turn on EMR managed scaling but target HDFS CapacityRemainingGB rather than YARN memory utilization is misguided because managed scaling decisions are based on YARN memory and vCore metrics, not HDFS capacity, and the low HDFS usage indicates storage is not the limiting factor.
When HDFS utilization is low but queries slow during peaks, think compute or memory bound. For EMR, scale on YARN metrics like YARNMemoryAvailablePercentage, and remember EC2 Spot Fleet is not how you provision EMR core or task nodes.
Question 2
For daily time-based scans on S3 telemetry grouped by sensor_model, which storage format and layout maximize query performance and cost efficiency?
-
✓ C. Apache ORC, partitioned by event_date, sorted by sensor_model
The best choice is Apache ORC, partitioned by event_date, sorted by sensor_model.
Columnar formats like ORC enable predicate pushdown, compression, and selective column reads, dramatically reducing scanned bytes. Partitioning by event_date aligns with daily time filters to prune entire partitions, and sorting within partitions by sensor_model improves grouping and aggregation efficiency.
The option Apache Parquet, partitioned by sensor_model, sorted by event_time is suboptimal because time-based queries would have to read many model partitions, defeating partition pruning. Sorting by event_time does not help when aggregating by model.
The option Compressed CSV, partitioned by event_date, sorted by health_state uses a row-oriented format without predicate pushdown or column pruning, leading to higher scan costs and slower queries even with date partitions.
The option Apache Iceberg on S3, monthly partitions introduces overly coarse partitions for daily workloads, which increases scanned data. Without partitioning and ordering aligned to the query patterns, performance suffers.
Prefer columnar formats (ORC or Parquet) over CSV/JSON for analytics. Match partition keys to the most common filter predicates (often event_date). For group-by-heavy queries on a dimension, sort or cluster within partitions by that dimension. Use appropriately sized files and consistent schemas. Look for keywords like daily scans, group by, and cost efficiency to steer toward columnar + date partitioning + sort-by-dimension.
Question 3
An online brokerage is assessing its workloads with the AWS Well-Architected Tool to tighten security. They want centralized control over identities and credentials with routine rotation, and they also need database passwords to rotate automatically without interrupting applications. Which AWS services should they choose to satisfy these requirements?
-
✓ C. AWS Identity and Access Management (IAM) and AWS Secrets Manager
The correct choice is AWS Identity and Access Management (IAM) and AWS Secrets Manager.
IAM provides centralized identity, access policies, and credential governance, while Secrets Manager integrates with databases to rotate credentials automatically using rotation Lambdas and RDS integrations without application downtime.
AWS CloudTrail and AWS Systems Manager Automation are not a fit because CloudTrail audits API activity rather than managing identities or credentials, and Systems Manager Automation is not a purpose-built service for seamless database secret rotation.
AWS Well-Architected Tool and AWS Key Management Service (KMS) are inadequate here since the Well-Architected Tool is advisory and KMS rotates encryption keys, not database passwords or application secrets.
Amazon Cognito and AWS Secrets Manager is closer, but Cognito addresses end-user authentication for applications, not centralized workforce IAM and credential policies across the AWS account that the scenario requires.
Match the need for centralized identity governance with IAM and the need for automatic database secret rotation with AWS Secrets Manager. Avoid confusing key rotation in KMS with secret rotation for credentials.
Question 4
Which approach processes Amazon Kinesis Data Streams records and updates Amazon RDS with sub-15-ms latency, auto-scaling, and minimal operations?
-
✓ B. AWS Lambda consumers for Kinesis Data Streams updating Amazon RDS
AWS Lambda consumers for Kinesis Data Streams updating Amazon RDS is the best fit for low-latency, event-driven processing with minimal operations. Lambda integrates natively with Kinesis Data Streams, automatically scales with shard throughput, and can be tuned for per-record processing by setting small batch sizes and a near-zero batch window. Using Amazon RDS Proxy with Lambda helps pool connections to RDS for stability at high concurrency.
Amazon Kinesis Data Analytics (Apache Flink) writing directly to Amazon RDS is heavier to build and operate and requires custom sinks or JDBC, which adds latency and complexity. It is not the minimal-ops path for OLTP writes.
EC2 Auto Scaling workers polling Kinesis Data Streams and persisting to Amazon RDS increases operational overhead for capacity, patching, and fault handling, conflicting with the minimal-ops requirement.
AWS Glue streaming job loading to Amazon RDS uses micro-batching via Spark, which typically cannot meet sub-15-ms latencies and is not designed for fine-grained OLTP updates.
When the question stresses minimal operations and event-driven from Kinesis Data Streams, think Lambda. For very low latency, tune Lambda event source mapping with small batch size and batch window. For database connection scaling, pair Lambda with Amazon RDS Proxy. Consider ParallelizationFactor for additional per-shard concurrency and ensure downstream idempotency.
Question 5
A retail analytics group at Meridian Outfitters runs complex Amazon Athena queries against clickstream logs stored in Amazon S3, with the AWS Glue Data Catalog holding the table metadata. Query planning has slowed significantly because the table is split into hundreds of thousands of partitions across date and region keys. The team wants to cut Athena planning overhead and speed up query execution while continuing to use Athena. Which actions would best achieve these goals? (Choose 2)
-
✓ B. Enable partition projection in Athena for the table and define partition value ranges and formats
-
✓ D. Convert the dataset in S3 to Apache Parquet with compression and update the table metadata
The most effective improvements come from reducing both planning overhead and the amount of data scanned. Enable partition projection in Athena for the table and define partition value ranges and formats eliminates expensive partition enumeration in the AWS Glue Data Catalog, which directly targets slow query planning. Convert the dataset in S3 to Apache Parquet with compression and update the table metadata cuts the bytes read per query, improving execution time.
Use AWS Glue jobs to merge small S3 objects into larger files across all partitions can help with read efficiency but does not resolve the core issue of too many partitions driving planning latency.
Implement AWS Glue Elastic Views to manage partitions for the Athena table is unrelated to Athena’s partition metadata and was discontinued, so it is not a viable solution on newer exams.
Rely on Athena DML to automatically reduce the number of partitions during queries is not feasible because Athena does not auto-merge partitions. Re-partitioning requires explicitly rewriting data with a new layout.
When Athena planning is slow due to many partitions, think partition projection to avoid catalog enumeration and columnar formats like Parquet to reduce scan size and cost.
Question 6
Athena queries over large CSV data in S3 are slow and costly. What should you do to reduce data scanned and improve performance without moving off Athena?
-
✓ C. Run AWS Glue ETL to convert CSV to Parquet with partitions and update the Data Catalog
The best approach is to Run AWS Glue ETL to convert CSV to Parquet with partitions and update the Data Catalog.
Columnar formats like Parquet with compression and partitioning drastically reduce bytes scanned, which directly lowers Athena cost and improves performance. Using Glue ETL provides a scalable, managed pipeline, and the Data Catalog stays accurate for efficient predicate pushdown and partition pruning.
The option Use Athena CTAS to generate daily Parquet extracts and query those can work for ad hoc conversions, but relying on Athena as a continuous ETL mechanism for large, ongoing ingestion is brittle and lacks robust orchestration and partition management.
The option Enable Athena materialized views over the CSV tables may accelerate specific repeated aggregations, but it still refreshes from CSV and does not broadly reduce the underlying scan footprint across diverse queries.
The option Upgrade to Athena engine v3 and enable query result reuse can help repeat queries but does not fix the root cause of high scan volume from unpartitioned CSV. Data layout optimization remains necessary.
For Athena performance, prioritize columnar formats (Parquet/ORC), compression, and partitioning. Maintain accurate metadata in the AWS Glue Data Catalog and leverage partition pruning via aligned S3 prefixes. Engine upgrades, result reuse, and materialized views help in narrow cases but do not replace proper data layout optimization.
Question 7
A digital advertising analytics startup uses Amazon Athena to query clickstream logs in Amazon S3 with SQL. Over the last 9 months, the number of data engineers tripled and Athena costs spiked. Most routine queries finish in under five seconds and read only small amounts of data. The company wants to enforce different hourly and daily caps on the total data scanned per workgroup to control spend without blocking short queries. What should you configure in Athena?
-
✓ C. Set several workgroup-wide data usage limits in Athena to enforce hourly and daily scanned-data thresholds
The correct choice is Set several workgroup-wide data usage limits in Athena to enforce hourly and daily scanned-data thresholds.
Athena workgroups support multiple data usage control limits per workgroup, and each limit can target a specific period such as hourly or daily with its own threshold. This directly satisfies the need for different aggregate caps while allowing short queries to continue running.
Define multiple per-query limits using Athena per-query data usage controls is incorrect because Athena allows only one per-query limit per workgroup, and it governs individual query scan size rather than aggregate usage across a period.
AWS Budgets is not appropriate because it provides monitoring and alerts for spend or usage but does not enforce or stop Athena queries based on data scanned within a workgroup.
Configure a single workgroup-wide limit that combines all hourly and daily thresholds into one setting is not feasible because each workgroup limit in Athena has a single time window. You need separate limits for hourly and daily caps.
When the requirement mentions aggregate data scanned over time in Athena, think workgroup-wide usage limits. When it mentions capping a single query’s scan size, think per-query limit. Remember there is one per-query limit per workgroup but multiple workgroup-wide limits are allowed.
Question 8
Which AWS service enables scheduled subscriptions to third-party datasets with automatic delivery to Amazon S3 every 6 hours?
-
✓ B. AWS Data Exchange service
The correct choice is AWS Data Exchange service because it is purpose-built for subscribing to third-party datasets and automating their delivery into Amazon S3 on a recurring schedule, minimizing custom code and ongoing operations.
Amazon Kinesis Data Firehose is for streaming ingestion to S3 and other targets but does not handle marketplace subscriptions or vendor feeds.
AWS Glue crawler only catalogs data already in S3 or other sources and does not ingest external datasets.
Amazon AppFlow integrates data from supported SaaS applications but is not intended for subscribing to external marketplace datasets.
When you see keywords like subscribe to third-party data, marketplace, and automatic/recurring delivery to S3, prefer AWS Data Exchange. If a solution would require custom scheduling, retries, or vendor API handling, it is likely not the managed option the exam expects.
Question 9
VeloTrack Logistics runs a write-intensive payments service on Amazon RDS for PostgreSQL, and the primary instance shows sustained CPU utilization above 85 percent during peak posting windows, causing slow transactions. Which actions should a data engineer take to directly reduce CPU pressure on the database instance? (Choose 2)
-
✓ B. Migrate the DB instance to a larger class with more vCPU and memory
-
✓ D. Turn on Amazon RDS Performance Insights and tune the highest-CPU queries
Turn on Amazon RDS Performance Insights and tune the highest-CPU queries is correct because it reveals the SQL and waits driving CPU, enabling targeted optimizations such as query rewrites and indexing to cut compute usage.
Migrate the DB instance to a larger class with more vCPU and memory is also correct since adding CPU capacity directly relieves saturation for a write-heavy OLTP workload when tuning alone is insufficient.
Implement Amazon ElastiCache to cache frequent lookups and reduce read traffic is not appropriate here because caching primarily benefits read-heavy access patterns and does not reduce CPU spent on inserts and updates.
Create an Amazon RDS read replica to move reporting reads off the primary does not solve the problem because replicas handle reads only, while the primary remains responsible for all writes and stays CPU-bound.
Increase the Provisioned IOPS allocation for the DB storage addresses I/O latency and throughput but not compute exhaustion, so it will not meaningfully reduce CPU unless the bottleneck is storage-related.
When RDS CPU is persistently high for a write-heavy workload, first use Performance Insights to find and fix expensive SQL, then consider vertical scaling. Read replicas and caching mainly help read-intensive scenarios.
Question 10
During an AWS DMS migration with near-zero downtime, how can you verify the target matches the source before cutover while minimizing source impact?
-
✓ B. Enable AWS DMS data validation
Enable AWS DMS data validation is correct because it performs row-by-row and aggregate comparisons between source and target, continues during CDC, and surfaces discrepancies so you can confirm accuracy before cutover with minimal impact on the source.
The option AWS DMS premigration assessment is incorrect because it evaluates compatibility and task readiness, not data equality.
AWS Glue Data Quality is incorrect as it targets lakehouse datasets and rules-based checks rather than validating relational database migrations executed by DMS.
Amazon Aurora zero-ETL integration is incorrect because it delivers data to Amazon Redshift for analytics and does not validate source-to-target parity.
When you see verify target matches source and near-zero downtime with CDC, think DMS data validation. Premigration assessments address readiness, not correctness. Lakehouse data-quality tools and analytics integrations are distractors for database migration validation.
Question 11
A regional media analytics firm named BrightStream wants developers to be able to start and stop specific Amazon EC2 instances only between 08:00 and 18:00 on weekdays, and the platform team needs to see exactly which identity performed each change while also automating monthly patching across all instances. Which combination of AWS services and features should be used to meet these goals?
-
✓ B. IAM policy conditions using aws:CurrentTime with date operators for business-hour access, AWS CloudTrail for auditing, and AWS Systems Manager Patch Manager
The correct approach is to apply time-bound authorization at the IAM layer, capture API-level audit trails, and use a managed patching capability. IAM policy conditions using aws:CurrentTime with date operators for business-hour access, AWS CloudTrail for auditing, and AWS Systems Manager Patch Manager meets all three requirements. IAM policies can use aws:CurrentTime with DateLessThan/DateGreaterThan to constrain actions like StartInstances and StopInstances to specific hours and days. CloudTrail records API activity and the identity involved, enabling accountability. Patch Manager automates scanning and applying patches on a schedule across instances.
IAM policies with tag-based permissions, AWS CloudWatch Logs for change tracking, and AWS Systems Manager Patch Manager is incomplete because tag conditions do not enforce a time window, and CloudWatch Logs does not natively capture API caller identity the way CloudTrail does.
AWS Config, Amazon EventBridge Scheduler, and AWS Systems Manager State Manager is unsuitable since Config focuses on configuration snapshots and drift, EventBridge Scheduler cannot enforce permissions, and State Manager is not the primary patch automation feature.
AWS Systems Manager Session Manager, AWS CloudTrail, and Amazon EC2 Auto Scaling does not satisfy the time-gated authorization for EC2 API actions, and Auto Scaling is not used for patch orchestration.
When the requirement says who did what for AWS APIs, think CloudTrail. When you need allow only during these hours, use IAM policy time-based condition keys like aws:CurrentTime. For automated OS updates on instances, look for Systems Manager Patch Manager.
Question 12
Which Lake Formation features enable a central catalog, cross-service discovery, and fine-grained, tag-based governance for datasets from Amazon S3 and a JDBC source?
-
✓ B. Glue Data Catalog with LF tag-based access control + Lake Formation blueprints
Glue Data Catalog with LF tag-based access control + Lake Formation blueprints is correct because it delivers the central, cross-service metadata catalog (Glue Data Catalog), fine-grained governance via Lake Formation tag-based access control (LF-TBAC) for column/row-level and attribute-based permissions, and automated ingestion from both Amazon S3 and JDBC sources using Lake Formation blueprints. This combination directly addresses discoverability, centralized metadata, and department-level fine-grained permissions in a governed data lake.
The option IAM policies + Glue ETL jobs + AWS RAM is incorrect because IAM and RAM cannot provide Lake Formation’s column- and row-level controls or tag-based, cross-service enforcement, which are required for fine-grained governance.
The option Glue Crawlers + Lake Formation blueprints + IAM roles is incorrect since, while crawlers and blueprints support discovery and ingestion, IAM roles alone cannot implement Lake Formation’s fine-grained authorization model across analytics services.
The option Glue Data Catalog + Lake Formation permissions (no LF tags) + Glue jobs is incorrect because it omits LF-TBAC. Without tags, managing fine-grained, scalable, attribute-based permissions across many datasets and teams becomes difficult and does not meet the requirement for tag-based governance.
Map a centralized catalog to the Glue Data Catalog. Associate fine-grained, scalable governance with Lake Formation, especially LF-TBAC for tag-driven, cross-service permissions. For ingesting from S3 and JDBC into a governed lake, think Lake Formation blueprints or Glue-based ingestion integrated with Lake Formation permissions.
Question 13
At LumaRide Mobility, analysts run the same KPI dashboard query every 10 minutes against Amazon Redshift. The statement applies heavy aggregations and joins across about 28 TB of historical data in a fact_trips table, causing long runtimes. Which approach will yield the largest performance gain for this repeatedly executed query?
-
✓ C. Create a materialized view in Amazon Redshift with automatic refresh so the aggregations are precomputed and stored
The best choice is Create a materialized view in Amazon Redshift with automatic refresh so the aggregations are precomputed and stored.
Materialized views persist the results of expensive joins and aggregations and can auto-refresh incrementally, providing consistently faster performance for repeated dashboard queries.
Use Amazon Redshift Spectrum to read external tables in Amazon S3 instead of local Redshift storage is not optimal because Spectrum often performs worse for complex aggregations than local Redshift tables and does not precompute results.
Create a standard view that encapsulates the SQL and query the view when needed does not improve performance because it still executes the full query each time. It is only a saved SELECT.
Rely on the Amazon Redshift result cache by ensuring the exact same query text is executed each time can help when data is unchanged and queries are identical, but cache invalidation is common in frequently updated KPI pipelines, making it less effective than a materialized view.
For recurring, heavy aggregations in Redshift, think materialized views with auto-refresh. Regular views do not precompute, Spectrum targets external data access, and the result cache is opportunistic and invalidates on changes.
Question 14
Which AWS Organizations policy enforces organization-wide maximum permissions when attached to the root or OUs?
-
✓ B. AWS Organizations Service Control Policies (SCPs)
AWS Organizations Service Control Policies (SCPs) are the correct choice because they provide preventive, organization-level guardrails by defining the maximum permissions that member accounts can have. When attached to the root or OUs, SCPs centrally constrain what IAM users and roles (including the root user) can do in those accounts, ensuring least-privilege boundaries across the organization.
The option AWS Control Tower is not correct because it is a governance orchestration service that sets up a landing zone and applies guardrails, but it uses SCPs underneath. It is not the direct policy mechanism that enforces maximum permissions.
AWS IAM Access Analyzer is incorrect because it is a detective service that analyzes resource policies for unintended access and does not enforce permissions.
IAM permissions boundaries are incorrect because they limit permissions for individual IAM principals within a single account and are not centrally attached at the organization root or OUs. They do not provide organization-wide guardrails.
When you see phrasing like maximum permissions, guardrails, and attach at the root or OU, think SCPs. Distinguish preventive controls (SCPs) from detective ones (Access Analyzer). Remember that Control Tower is an orchestration layer that leverages SCPs, and that permissions boundaries are per-principal IAM constructs, not org-wide policies.
Question 15
A data engineer at a regional logistics startup runs heavy analytics approximately every six weeks. For each cycle, they launch a new Amazon Redshift provisioned cluster, process queries for about three hours, export results and snapshots to an Amazon S3 bucket, and then delete the cluster. They want to keep these periodic analyses while avoiding capacity planning, patching, and lifecycle scripting for clusters. Which approach will achieve this with the least ongoing operational effort?
-
✓ B. Use Amazon Redshift Serverless to run the analytics on demand with automatic scaling
The best fit is Use Amazon Redshift Serverless to run the analytics on demand with automatic scaling.
Redshift Serverless eliminates cluster provisioning, sizing, patching, and shutdown workflows, which directly minimizes operational overhead for infrequent, bursty analytics. You pay only for the compute used during the runs, and scaling is handled automatically.
Use Amazon EventBridge Scheduler to trigger an AWS Step Functions workflow that creates a Redshift cluster, runs the jobs, copies data to S3, and then terminates the cluster still leaves you responsible for defining infrastructure templates, security, upgrades, and teardown logic across multiple services, which increases complexity.
Configure Redshift zero-ETL integrations to handle the batch analytics workload addresses data ingestion into Redshift but does not remove the need to manage compute or clusters for query execution, so it does not solve the stated problem.
Purchase Amazon Redshift reserved node offerings for the cluster to simplify operations targets long-running, steady workloads for cost savings and still requires you to manage provisioned clusters, which is misaligned with intermittent runs and the goal of minimal ops.
When the requirement emphasizes the least operational effort for intermittent analytics, think serverless data warehousing. Orchestration solutions still imply infrastructure management, and reserved capacity is for steady state usage rather than occasional batch runs.
Question 16
How should you migrate an existing Hive metastore (about 6,400 tables) to a serverless, low-cost metadata catalog for EMR Spark and Hive on AWS?
-
✓ B. AWS Glue Data Catalog import for Hive metastore
The best approach is AWS Glue Data Catalog import for Hive metastore.
Glue Data Catalog is a fully managed, serverless metadata repository that integrates natively with Amazon EMR for Spark and Hive. It provides a supported path to import an existing Hive metastore, preserving table definitions and enabling centralized, low-cost catalog management without running database servers.
The option EMR cluster with self-hosted Hive metastore is incorrect because it requires operating a persistent cluster or external metastore services, which is not serverless and increases cost and operational burden.
The option Amazon RDS with AWS DMS for the metastore is also not serverless and introduces database administration and cost that Glue avoids.
The option AWS Glue Crawlers to rebuild tables from S3 is not a migration path. It attempts to infer schemas and may miss existing properties, partitions, or serde settings, and does not directly import the Hive metastore.
When a question emphasizes serverless, low cost, and centralized metadata for EMR/Hive, prioritize AWS Glue Data Catalog. If a choice involves managing RDS or an EMR-hosted metastore, it is typically not the serverless answer. Rebuilding metadata with crawlers or query services is not the same as migrating an existing metastore.
Question 17
A media analytics startup, StreamQuant, uses an Amazon Redshift warehouse that holds roughly nine years of event records. Compliance requires keeping all historical data, but analysts primarily query the most recent 45 days for near real-time dashboards. How can the team organize storage and queries to reduce cost while preserving fast performance for recent data?
-
✓ C. Adopt Amazon Redshift RA3 with managed storage, unload older partitions to Amazon S3 and query them with Redshift Spectrum while keeping hot data in the cluster
The best approach is to tier data by using Adopt Amazon Redshift RA3 with managed storage, unload older partitions to Amazon S3 and query them with Redshift Spectrum while keeping hot data in the cluster.
RA3 nodes separate compute from managed storage so you keep frequently accessed data local for speed while placing cold data in S3. Redshift Spectrum lets you query S3 through external tables without loading it back, minimizing cluster storage cost and preserving performance.
Use Dense Compute (DC2) nodes to store all historical and recent data in the Redshift cluster forces all data to reside on local SSDs, which is expensive for multi-year retention and lacks the elasticity and cost benefits of managed storage and S3 tiering.
Archive historical data in Amazon S3 Glacier and run active workloads on Amazon Redshift DS2 nodes is not viable because Glacier is designed for archival with slow retrieval and DS2 is an older generation that provides less favorable price-performance for modern analytics.
Move older records into an Amazon RDS database and keep only recent data in Amazon Redshift introduces an OLTP engine that is not optimized for analytical scans and does not integrate natively for federated querying of large historical datasets.
When you see hot vs. cold analytics data in Redshift, look for RA3 managed storage plus Redshift Spectrum for S3-based cold data. Avoid options that push analytics data to Glacier or OLTP engines like RDS.
Question 18
Which AWS services best ingest about 120 million daily clickstream events, enable fast search with aggregations, and deliver interactive dashboards?
-
✓ C. Amazon Kinesis Data Firehose, Amazon OpenSearch Service, Amazon QuickSight
Amazon Kinesis Data Firehose, Amazon OpenSearch Service, Amazon QuickSight is the best fit because Firehose provides managed, near real-time delivery of streaming events into OpenSearch indices for low-latency search and aggregations, and QuickSight offers interactive dashboards for analysts.
The option Amazon MSK, Amazon Redshift, Amazon OpenSearch Service adds Kafka complexity and centers analytics on a data warehouse rather than a search engine, which does not align with fast exploratory search needs. The choice Amazon Kinesis Data Streams, Amazon Athena, Amazon Managed Grafana omits a search index and relies on S3 queries via Athena, which increases latency and limits ad-hoc log search. Grafana is also less suited for analyst BI dashboards.
The option Amazon Kinesis Data Firehose, Amazon Redshift, Amazon QuickSight provides ingestion and BI, but Redshift is not optimized for search-first, schemaless log exploration with faceted aggregations.
When the requirement emphasizes fast exploratory search with aggregations over event logs, think OpenSearch. For fully managed streaming delivery into OpenSearch with minimal operations, think Kinesis Data Firehose. For analyst-facing dashboards, QuickSight is the expected BI service.
Question 19
A mobile gaming studio uses Amazon Kinesis Data Streams to collect gameplay and clickstream events from its apps. During limited-time tournaments, traffic can spike up to 12x within 15 minutes. The team wants the stream to scale with these surges automatically and avoid managing shard counts or scaling scripts. Which configuration should they choose?
-
✓ B. Use Kinesis Data Streams in on-demand capacity mode so the stream automatically scales with traffic
The correct choice is Use Kinesis Data Streams in on-demand capacity mode so the stream automatically scales with traffic.
On-demand capacity mode adjusts to unpredictable spikes without shard planning, providing automatic scaling as throughput changes.
Use Kinesis Data Streams with enhanced fan-out to boost consumer throughput during bursts is not sufficient because enhanced fan-out improves consumer read performance, not write capacity or shard scaling.
Migrate to Amazon Kinesis Data Firehose for automatic scaling to handle spikes changes the architecture and addresses delivery, not the ingestion stream’s shard scaling requirements.
Use Kinesis Data Streams in provisioned capacity and add or split shards manually during surges contradicts the requirement to avoid manual intervention and relies on operational management of shard counts.
When you see unpredictable spikes and a requirement for no shard management, think Kinesis Data Streams on-demand. Enhanced fan-out is for consumer read scaling only, and provisioned mode implies operational scaling overhead.
Question 40
In Amazon Kinesis Data Streams, how can a consumer restart and resume from its last committed sequence number to avoid reprocessing events?
-
✓ D. Kinesis Client Library with DynamoDB shard checkpoints
The correct choice is Kinesis Client Library with DynamoDB shard checkpoints.
KCL manages shard leases and persists per-shard sequence-number checkpoints in Amazon DynamoDB. When the consumer restarts, it reads the stored checkpoint and resumes from the exact committed position, preventing rereads of already processed records.
The option AWS Lambda trigger with stateless processing is incorrect because Lambda integrations with Kinesis use at-least-once delivery and can produce retries and duplicates. Lambda does not maintain a precise, persistent sequence checkpoint for your consumer. You would need to implement your own state management.
The option Enable enhanced fan-out and use LATEST is incorrect. Enhanced fan-out improves read throughput and isolation, and LATEST only sets the initial starting position. Neither provides persistent checkpointing across restarts.
The option SubscribeToShard with enhanced fan-out. Rely on the stream to track offsets is incorrect because Kinesis Data Streams does not track consumer offsets for you. Without KCL (or your own persistent checkpoint store), you cannot guarantee an exact resume point.
When you see keywords like “resume at exact position,” “sequence number,” or “checkpoint,” think KCL + DynamoDB. Features like enhanced fan-out, on-demand capacity, or retention address throughput and scalability, not consumer state. For Lambda consumers, design for idempotency and duplicates rather than exact offset control.
Jira, Scrum & AI Certification |
---|
Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..
You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today. |
Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.