Certified AWS Data Engineer Exam Dumps and Braindumps

AWS Certified Data Engineer Associate Logo & Badge Credly

All questions come from my AWS Engineer Udemy course and certificationexams.pro

AWS Data Engineer Exam Topics Tests

Despite the title of this article, this is not an AWS Data Engineer Certification Braindump in the traditional sense. I do not believe in cheating.

Traditionally, the term “braindump” referred to someone taking an exam, memorizing the questions, and sharing them online for others to use. That practice is unethical and violates the AWS certification agreement. It provides no integrity, no real learning, and no professional growth.

This is not an AWS Data Engineer Exam Dump. All of these questions come from my AWS Data Engineer course and from the certificationexams.pro website, which offers hundreds of free AWS Data Engineer Associate Practice Questions.

AWS Exam Simulator

Each question has been carefully written to align with the official AWS Certified Data Engineer exam objectives. They mirror the tone, logic, and technical depth of real AWS scenarios, but none are copied from the actual test. Every question is designed to help you learn, reason, and master AWS concepts such as data modeling, governance, and data pipeline optimization in the right way.

If you can answer these questions and understand why the incorrect options are wrong, you will not only pass the AWS Data Engineer Associate exam but also gain a solid understanding of how to design, manage, and optimize data pipelines effectively. So if you want to call this your AWS Data Engineer Exam Dump, that is fine, but remember that every question here is built to teach, not to cheat.

Each item includes detailed explanations, realistic examples, and insights that help you think like a data engineer during the exam. Study with focus, practice consistently, and approach your certification with integrity. Success as an AWS Data Engineer comes not from memorizing answers but from understanding how data ingestion, transformation, and governance work together to deliver business value.

Use the AWS Data Engineer Associate Exam Simulator and AWS Data Engineer Sample Questions to prepare effectively and move closer to earning your certification.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

AWS Data Engineer Exam Dumps and Braindumps

Question 1

A compliance team at a regional credit union produces regulatory audit summaries only three times each year. They orchestrate a fault-tolerant report workflow with AWS Step Functions that includes retries, and the source data resides in Amazon S3. The dataset is about 300 TB and must be accessible with millisecond latency when needed. Which Amazon S3 storage class will minimize cost while meeting these needs?

  • ❏ A. Amazon S3 Intelligent-Tiering

  • ❏ B. Amazon S3 Glacier Flexible Retrieval

  • ❏ C. Amazon S3 Standard-IA

  • ❏ D. Amazon S3 Standard

Question 2

Which approaches enforce Region-scoped access to an Amazon S3 data lake with minimal maintenance? (Choose 2)

  • ❏ A. AWS Lake Formation LF-tags for Region

  • ❏ B. S3 Access Points per Region with per-team policies

  • ❏ C. IAM ABAC with S3 and principal Region tags

  • ❏ D. AWS Organizations SCP restricting S3 to one Region

  • ❏ E. S3 VPC gateway endpoints per Region

Question 3

A data engineer at a fintech analytics firm must run a series of Amazon Athena queries against data in Amazon S3 every day. Some of the queries can run for more than 25 minutes. The team wants the most economical way to kick off each query and reliably wait for it to finish before starting the next one. Which approaches should they implement? (Choose 2)

  • ❏ A. Build an AWS Step Functions state machine that invokes a Lambda function to submit the Athena query and then uses a Wait state to poll get_query_execution until completion before triggering the next query

  • ❏ B. Use an AWS Glue Python shell job to call the Athena start_query_execution API for each query

  • ❏ C. Use an AWS Lambda function to programmatically call the Athena start_query_execution API for each query

  • ❏ D. Create an AWS Step Functions workflow that starts an AWS Glue Python shell job and then uses a Wait state to poll get_query_execution until the query completes before proceeding

  • ❏ E. Use Amazon Managed Workflows for Apache Airflow with a sensor to monitor Athena query completion and trigger subsequent tasks

Question 4

Which SageMaker Feature Store storage should be used to serve features for online inference in under 15 ms?

  • ❏ A. Amazon DynamoDB

  • ❏ B. Feature Store online store

  • ❏ C. Offline store

  • ❏ D. Amazon ElastiCache for Redis

Question 5

A mobile gaming studio is launching a new leaderboard web service that experiences highly irregular traffic with brief surges lasting a few minutes at a time. They need a relational database that can automatically scale capacity up and down and reduce costs by charging only for what is actually used. Which AWS managed database should they choose?

  • ❏ A. Amazon DynamoDB on-demand

  • ❏ B. Amazon Aurora Serverless v2

  • ❏ C. Amazon Redshift with elastic resize

  • ❏ D. Amazon RDS for MySQL with Read Replicas

Question 6

Which combination of AWS features restricts viewers to specified IP ranges and keeps an S3 origin private when serving static files through CloudFront? (Choose 2)

  • ❏ A. Attach an AWS WAF web ACL with an allow-list IP set to the CloudFront distribution

  • ❏ B. Create a VPC network ACL allowing those IPs and associate it with CloudFront

  • ❏ C. Configure CloudFront origin access control and allow only that principal in the S3 bucket policy

  • ❏ D. Attach an AWS WAF web ACL with IP match to the S3 bucket policy

  • ❏ E. Use an S3 bucket policy with aws:SourceIp for the viewer IP ranges

Question 7

A fintech startup, LumaPay, organizes its Amazon S3 data lake using Hive-style partitions in object key paths such as s3://lumapay-datalake/ingest/year=2026/month=03/day=09. The team needs the AWS Glue Data Catalog to reflect new partitions as soon as files are written so that analytics jobs can query the latest data with the least possible delay. Which approach should they use?

  • ❏ A. Schedule an AWS Glue crawler to run at the start of every hour

  • ❏ B. Run the MSCK REPAIR TABLE command after each daily load

  • ❏ C. Have the writer job call the AWS Glue create_partition API via Boto3 immediately after writing to Amazon S3

  • ❏ D. Use Amazon EventBridge to start an AWS Glue crawler on S3 ObjectCreated events

Question 8

How can you store and join live Kinesis Data Streams events with the last 12 hours of data in Amazon Redshift Serverless with under 45-second latency and minimal operations?

  • ❏ A. Amazon Managed Service for Apache Flink writing to Redshift via JDBC

  • ❏ B. Kinesis Data Firehose delivery to Amazon Redshift

  • ❏ C. Amazon Redshift streaming ingestion from Kinesis Data Streams into a materialized view

  • ❏ D. Land to Amazon S3 and query with Redshift Spectrum

Question 9

A biotech analytics firm plans to run a web portal on an Auto Scaling group that can scale up to 18 Amazon EC2 instances across two Availability Zones. The application needs a shared file system that all instances can mount simultaneously for read and write operations, with high availability and elastic capacity. Which AWS storage service should the team choose?

  • ❏ A. Amazon S3

  • ❏ B. Amazon EBS Multi-Attach

  • ❏ C. Amazon Elastic File System (EFS)

  • ❏ D. Amazon EC2 instance store

Question 10

How do you configure cross-Region copy of KMS-encrypted Amazon Redshift automated and manual snapshots with 90-day retention?

  • ❏ A. Create snapshot copy grant in the source Region with its KMS key. Enable copy from the destination

  • ❏ B. Create snapshot copy grant in the destination using its KMS key. Enable cross-Region copy on the source with 90-day retention

  • ❏ C. Use AWS Backup cross-Region copy for Redshift

  • ❏ D. Use a multi-Region KMS key in the source. No snapshot copy grant needed

AWS Certified Data Engineer Associate Logo & Badge Credly

All questions come from my AWS Engineer Udemy course and certificationexams.pro

Question 11

A retail analytics startup needs to define and roll out AWS Glue job configurations as infrastructure as code, and it wants new objects in an Amazon S3 bucket to invoke an AWS Lambda function that starts those Glue jobs. Which AWS service or tool is the most suitable to implement this with minimal boilerplate and native event wiring?

  • ❏ A. AWS CloudFormation

  • ❏ B. AWS CDK

  • ❏ C. AWS SAM (Serverless Application Model)

  • ❏ D. AWS CodeDeploy

Question 12

How can you deliver only DynamoDB item updates to an existing Amazon Redshift cluster in near real time (under 60 seconds) at about 300,000 updates per day with bursts up to 9,000 per minute?

  • ❏ A. AWS DMS change replication from DynamoDB to Amazon Redshift

  • ❏ B. Enable DynamoDB Streams (new image) and use Lambda to send records to a Kinesis Data Firehose delivery stream for Amazon Redshift

  • ❏ C. Use DynamoDB Streams to land updates in Amazon S3 via Lambda and query with Redshift Spectrum

  • ❏ D. Use an AWS Glue streaming job to consume DynamoDB Streams and write to Amazon Redshift

Question 13

A genomics analytics startup runs a weekly Spark batch on Amazon EC2 that executes for about 48 hours and produces roughly 12 TB of temporary shuffle and scratch files per run. The team needs the lowest-latency reads and writes to data that resides directly on the instance while jobs are running, and they want to keep costs low for this short-lived workload. Which EC2 storage option should they choose?

  • ❏ A. Amazon EBS General Purpose SSD (gp3)

  • ❏ B. Amazon FSx for Lustre

  • ❏ C. Instance Store Volumes

  • ❏ D. Amazon S3 Standard

Question 14

When copying an encrypted EBS snapshot to another Region, how do you ensure it stays encrypted with a customer managed KMS key?

  • ❏ A. Enable EBS encryption by default in the destination Region and copy without specifying a key

  • ❏ B. Choose a customer managed KMS key in the destination Region during the copy

  • ❏ C. Reuse the same KMS key from the source Region

  • ❏ D. Create a grant on the source KMS key and reference it during the copy

Question 15

SkyTrail Studios is launching a real-time trivia app that sends answer submissions from mobile devices to a backend that computes standings and updates a public leaderboard. During live shows, traffic can spike to 1.5 million events per minute for about 25 minutes. The team must process submissions in the order they arrive for each player, persist results in a highly available database, and keep operations work to a minimum. Which architecture should they adopt?

  • ❏ A. Send score events to an Amazon SQS FIFO queue, process them with an Auto Scaling group of Amazon EC2 workers, and persist results in Amazon RDS for MySQL

  • ❏ B. Ingest updates with Amazon MSK, consume them on an Amazon EC2 fleet, and store processed items in Amazon DynamoDB

  • ❏ C. Stream updates into Amazon Kinesis Data Streams, invoke AWS Lambda to process records, and write results to Amazon DynamoDB

  • ❏ D. Publish updates to Amazon EventBridge, trigger AWS Lambda to process, and store results in Amazon Aurora Serverless v2

Question 16

Which AWS service lets multiple consumers read the same ordered clickstream and replay records for up to 30 days?

  • ❏ A. Amazon SQS

  • ❏ B. Kinesis Data Streams

  • ❏ C. Amazon DynamoDB Streams

  • ❏ D. Amazon Kinesis Data Firehose

Question 17

A video streaming startup stores title metadata in Amazon DynamoDB and wants to improve query performance for several access patterns. The team needs to fetch items by titleId as well as by genre and by studio, and they want the data to stay consistent across these queries. Which combination of secondary indexes should they configure to satisfy these access patterns?

  • ❏ A. Local Secondary Index (LSI) with titleId as the partition key and studio as the sort key. Global Secondary Index (GSI) with genre as the partition key

  • ❏ B. Amazon OpenSearch Service

  • ❏ C. Global Secondary Index (GSI) with genre as the partition key and another GSI with studio as the partition key

  • ❏ D. Two Global Secondary Indexes (GSIs), both using titleId as the partition key but with different sort keys for genre and studio

Question 18

Which methods automate consistent EMR cluster initialization with required libraries at launch? (Choose 2)

  • ❏ A. AWS Systems Manager Run Command after startup

  • ❏ B. Use EMR bootstrap actions from scripts in S3

  • ❏ C. Store scripts in DynamoDB and trigger with Lambda

  • ❏ D. Launch with an EMR custom AMI that includes the libs

  • ❏ E. Add a first EMR step to install packages

Question 19

A media analytics startup runs a serverless data integration stack on AWS Glue. The data engineer must regularly crawl a Microsoft SQL Server database called OpsDB and its table txn_logs_2024 every 12 hours, then orchestrate the end-to-end extract, transform, and load steps so that the processed data lands in an Amazon S3 bucket named s3://media-raw-zone. Which AWS service or feature will most cost-effectively coordinate the crawler and ETL jobs as a single pipeline?

  • ❏ A. AWS Step Functions

  • ❏ B. AWS Glue workflows

  • ❏ C. AWS Glue DataBrew

  • ❏ D. AWS Glue Studio

Question 20

Which AWS Glue transform can probabilistically match and deduplicate records across datasets without a shared key?

  • ❏ A. AWS Glue DataBrew

  • ❏ B. AWS Entity Resolution

  • ❏ C. AWS Glue FindMatches transform

  • ❏ D. AWS Glue Relationalize

Question 21

At Riverton Books, analysts run Amazon Athena queries that join a very large fact_orders table with a much smaller dim_stores table using an equality condition. The fact table has roughly 75 million rows while the lookup table has only a few thousand rows, and the join is performing poorly. What change to the join should you make to improve performance without altering the results?

  • ❏ A. Place the small table on the left and the big table on the right

  • ❏ B. Switch the Athena workgroup to the newest engine release

  • ❏ C. Move the larger dataset to the left side of the join and the smaller table to the right side

  • ❏ D. Partition the small lookup table by the join key

Question 22

Which AWS services should be combined to automatically detect and mask sensitive columns in S3 and enforce role-based access so analysts get masked data while a preprocessing role can read raw data? (Choose 2)

  • ❏ A. Amazon Macie

  • ❏ B. S3 Object Lambda with Lambda redaction

  • ❏ C. AWS Glue DataBrew

  • ❏ D. IAM policies on S3 prefixes

  • ❏ E. Amazon Comprehend

Question 23

A sports analytics startup ingests real-time clickstream and playback telemetry that can spike to 180,000 events per minute and wants to use Spark to transform the data before landing curated outputs in an Amazon S3 data lake. The team needs a managed and cost effective ETL service that can scale automatically with changing load and also allows per job compute tuning to control costs. Which service and configuration should they choose?

  • ❏ A. Amazon EMR with manually provisioned Spark clusters

  • ❏ B. AWS Glue Crawler

  • ❏ C. AWS Glue with configurable DPUs for Spark jobs and optional autoscaling

  • ❏ D. Amazon Kinesis Data Analytics

Question 24

Which AWS service pair provides streaming ingestion and stateful real-time analytics for anomaly detection under 500 ms?

  • ❏ A. Amazon Data Firehose and Amazon Redshift

  • ❏ B. Kinesis Data Streams plus Amazon Managed Service for Apache Flink

  • ❏ C. AWS Glue Streaming and Amazon S3

  • ❏ D. Amazon Kinesis Data Streams with AWS Lambda

Question 25

A logistics analytics company, ParcelMetrics, runs several Amazon ECS task types on Amazon EC2 instances within a shared ECS cluster. Each task must write its result files and state to a common store that is accessible by all tasks. Every run produces roughly 35 MB per task, as many as 500 tasks can run at once, and even with ongoing archiving the total storage footprint is expected to stay below 800 GB. Which storage approach will best support sustained high-frequency reads and writes for this workload?

  • ❏ A. Create a shared Amazon DynamoDB table accessible by all ECS tasks

  • ❏ B. Use Amazon EFS in Bursting Throughput mode

  • ❏ C. Use Amazon EFS in Provisioned Throughput mode

  • ❏ D. Mount a single Amazon EBS volume to the ECS cluster instances

AWS Data Engineer Associate Exam Dumps and Braindumps Answered

AWS Certified Data Engineer Associate Logo & Badge Credly

All questions come from my AWS Engineer Udemy course and certificationexams.pro

Question 1

A compliance team at a regional credit union produces regulatory audit summaries only three times each year. They orchestrate a fault-tolerant report workflow with AWS Step Functions that includes retries, and the source data resides in Amazon S3. The dataset is about 300 TB and must be accessible with millisecond latency when needed. Which Amazon S3 storage class will minimize cost while meeting these needs?

  • ✓ C. Amazon S3 Standard-IA

The correct choice is Amazon S3 Standard-IA.

The team needs millisecond retrieval but reads the data only a few times per year, so Standard-IA’s lower storage price with per-GB retrieval fees is typically the most economical, and it offers the same low-latency performance characteristics as S3 Standard. With retries in the Step Functions workflow, the 99.9% availability of Standard-IA is acceptable for batch-style report generation.

Amazon S3 Intelligent-Tiering is ideal when access patterns are unpredictable, but here the rare-access pattern is known, making the monitoring and automation charges unnecessary overhead compared to Standard-IA.

Amazon S3 Standard provides millisecond access but is priced for frequent access, so at hundreds of terabytes it would cost more than Standard-IA for this seldom-accessed data.

Amazon S3 Glacier Flexible Retrieval does not meet the millisecond latency requirement because restores typically take minutes to hours, so it is unsuitable for on-demand report builds.

When data is accessed rarely but must be retrieved with millisecond latency, look to S3 Standard-IA over S3 Standard or archival classes, and consider S3 Intelligent-Tiering only if access patterns are unknown or change over time.

Question 2

Which approaches enforce Region-scoped access to an Amazon S3 data lake with minimal maintenance? (Choose 2)

  • ✓ A. AWS Lake Formation LF-tags for Region

  • ✓ C. IAM ABAC with S3 and principal Region tags

AWS Lake Formation LF-tags for Region and IAM ABAC with S3 and principal Region tags both provide scalable, low-ops ways to enforce Region-scoped access. Lake Formation centralizes fine-grained permissions over S3-backed tables using LF-tags, enabling consistent Region scoping across analytics services with minimal policy sprawl. ABAC with S3 and principal tags uses IAM condition keys to automatically grant or deny based on matching Region tags, reducing the need to manage many distinct policies.

The option S3 Access Points per Region with per-team policies is incorrect because it increases the number of access points and policies to manage and does not inherently enforce Region alignment automatically.

The option AWS Organizations SCP restricting S3 to one Region is incorrect because SCPs are coarse-grained, S3 is a global service, and this approach cannot express dataset-level Region constraints.

The option S3 VPC gateway endpoints per Region is incorrect because endpoints address networking, not authorization, and do not enforce object-level Region-based access. On the exam, look for solutions that are tag-driven and centralized for governance. Keywords like ABAC, LF-tags, and fine-grained permissions usually indicate scalable and low-maintenance authorization patterns. Be cautious of options that only change network paths or that require many distinct resource policies, as these typically add operational burden without solving authorization.

Question 3

A data engineer at a fintech analytics firm must run a series of Amazon Athena queries against data in Amazon S3 every day. Some of the queries can run for more than 25 minutes. The team wants the most economical way to kick off each query and reliably wait for it to finish before starting the next one. Which approaches should they implement? (Choose 2)

  • ✓ A. Build an AWS Step Functions state machine that invokes a Lambda function to submit the Athena query and then uses a Wait state to poll get_query_execution until completion before triggering the next query

  • ✓ C. Use an AWS Lambda function to programmatically call the Athena start_query_execution API for each query

The most cost-effective pattern is to submit queries asynchronously and use a serverless orchestrator to wait and advance only when a query completes. Use an AWS Lambda function to programmatically call the Athena start_query_execution API for each query cheaply starts queries without holding compute for the full runtime. Build an AWS Step Functions state machine that invokes a Lambda function to submit the Athena query and then uses a Wait state to poll get_query_execution until completion before triggering the next query adds low-cost, reliable coordination for long-running queries that may exceed Lambda’s timeout.

Use an AWS Glue Python shell job to call the Athena start_query_execution API for each query is less cost-effective because Glue Python shell jobs are billed by DPU minutes even when only making API calls.

Create an AWS Step Functions workflow that starts an AWS Glue Python shell job and then uses a Wait state to poll get_query_execution until the query completes before proceeding works functionally but remains more expensive than Lambda plus Step Functions.

Use Amazon Managed Workflows for Apache Airflow with a sensor to monitor Athena query completion and trigger subsequent tasks is operationally heavier and incurs always-on environment costs, which is not the cheapest option for simple daily coordination.

For long-running asynchronous operations like Athena queries, submit with a lightweight client (Lambda) and orchestrate with Step Functions Wait and polling, avoiding long-lived compute such as Glue jobs or managed Airflow when cost is a priority.

Question 4

Which SageMaker Feature Store storage should be used to serve features for online inference in under 15 ms?

  • ✓ B. Feature Store online store

Feature Store online store is correct because it is the SageMaker Feature Store tier designed for real-time inference, delivering millisecond-level reads from feature groups for online scoring.

Amazon DynamoDB is incorrect because, while fast, it is not the managed serving layer of SageMaker Feature Store and would require custom pipelines and governance to maintain feature consistency and lineage.

Offline store is incorrect because it is optimized for batch and historical analytics on Amazon S3, with latency unsuitable for online inference.

Amazon ElastiCache for Redis is incorrect because it is an external cache, not the Feature Store online serving option, and would add custom integration overhead and potential drift between online and offline data.

When you see strict real-time latency for feature retrieval (for example, sub-10–20 ms) tied to SageMaker Feature Store, choose the online store. For training, batch scoring, or analytics, choose the offline store. Watch for wording like online inference vs.

batch analytics to quickly map to the correct store.

Question 5

A mobile gaming studio is launching a new leaderboard web service that experiences highly irregular traffic with brief surges lasting a few minutes at a time. They need a relational database that can automatically scale capacity up and down and reduce costs by charging only for what is actually used. Which AWS managed database should they choose?

  • ✓ B. Amazon Aurora Serverless v2

Amazon Aurora Serverless v2 is the best fit because it is a relational engine that automatically scales database capacity in fine-grained units and bills per second, making it ideal for intermittent, spiky workloads while optimizing cost.

Amazon DynamoDB on-demand is built for NoSQL workloads, so it does not meet the explicit requirement for a relational database.

Amazon RDS for MySQL with Read Replicas can scale read traffic but cannot automatically adjust compute or write capacity and still requires instance sizing and manual scaling.

Amazon Redshift with elastic resize targets analytical workloads and is not intended for operational relational use cases that need rapid, on-demand scaling for brief bursts.

For unpredictable relational workloads requiring pay-per-use and automatic scaling, think Aurora Serverless. For unpredictable NoSQL workloads, think DynamoDB on-demand.

Question 6

Which combination of AWS features restricts viewers to specified IP ranges and keeps an S3 origin private when serving static files through CloudFront? (Choose 2)

  • ✓ A. Attach an AWS WAF web ACL with an allow-list IP set to the CloudFront distribution

  • ✓ C. Configure CloudFront origin access control and allow only that principal in the S3 bucket policy

The correct combination is to enforce the viewer IP allow-list at CloudFront and to keep the S3 origin private with a CloudFront-origin identity. Use Attach an AWS WAF web ACL with an allow-list IP set to the CloudFront distribution to permit only the specified source IP ranges to reach the distribution, and use Configure CloudFront origin access control and allow only that principal in the S3 bucket policy so the S3 bucket denies public access and only accepts requests that originate from CloudFront.

Create a VPC network ACL allowing those IPs and associate it with CloudFront is invalid because NACLs are scoped to VPC subnets and do not apply to the global CloudFront service.

Attach an AWS WAF web ACL with IP match to the S3 bucket policy is impossible since AWS WAF cannot attach to S3.

Use an S3 bucket policy with aws:SourceIp for the viewer IP ranges fails for CloudFront origins because S3 evaluates the request from CloudFront or the OAC principal, not the end-user viewer IP.

Separate viewer-level controls from origin privacy. Apply IP allow-lists with AWS WAF on the CloudFront distribution, and secure S3 origins with OAC (or OAI, though OAC is the modern choice). Remember that S3 does not see viewer IPs when requests come via CloudFront, and CloudFront does not support security groups or VPC NACLs.

Question 7

A fintech startup, LumaPay, organizes its Amazon S3 data lake using Hive-style partitions in object key paths such as s3://lumapay-datalake/ingest/year=2026/month=03/day=09. The team needs the AWS Glue Data Catalog to reflect new partitions as soon as files are written so that analytics jobs can query the latest data with the least possible delay. Which approach should they use?

  • ✓ C. Have the writer job call the AWS Glue create_partition API via Boto3 immediately after writing to Amazon S3

The lowest-latency method is to register partitions as part of the write path, so the catalog reflects changes immediately. When the ingestion process writes new partitioned objects, calling the AWS Glue CreatePartition API avoids waiting for scans or schedules.

Have the writer job call the AWS Glue create_partition API via Boto3 immediately after writing to Amazon S3 is correct because it synchronously updates the Glue Data Catalog as data lands, eliminating crawler or repair delays.

Schedule an AWS Glue crawler to run at the start of every hour is slower because it introduces a schedule delay and the crawler itself can take time to analyze paths, so recent partitions are not instantly queryable.

Run the MSCK REPAIR TABLE command after each daily load is batch-oriented and forces a metadata scan to discover partitions, which adds latency and unnecessary cost compared to direct registration.

Use Amazon EventBridge to start an AWS Glue crawler on S3 ObjectCreated events reduces waiting but still relies on a crawler run that can take minutes, so it is not the least-latency option.

When a question emphasizes least latency or immediate partition visibility, prefer producer-side partition registration in the Glue Data Catalog over crawlers or MSCK REPAIR TABLE.

Question 8

How can you store and join live Kinesis Data Streams events with the last 12 hours of data in Amazon Redshift Serverless with under 45-second latency and minimal operations?

  • ✓ C. Amazon Redshift streaming ingestion from Kinesis Data Streams into a materialized view

The correct choice is Amazon Redshift streaming ingestion from Kinesis Data Streams into a materialized view.

This feature natively ingests events from Kinesis Data Streams into a Redshift materialized view with seconds-level latency and minimal setup. Because the data is in Redshift, you can immediately join it with existing tables while meeting low-latency and operational simplicity goals in Redshift Serverless.

The option Kinesis Data Firehose delivery to Amazon Redshift introduces an extra hop and typically uses S3 staging with micro-batch COPY, which adds latency and components compared to native streaming ingestion.

The option Land to Amazon S3 and query with Redshift Spectrum keeps data external and increases latency. It also does not satisfy the requirement to store the streaming events in Redshift.

The option Amazon Managed Service for Apache Flink writing to Redshift via JDBC can work but requires custom application code, connector management, checkpoints, scaling, and operational overhead, so it is not the least-effort approach.

When you see keywords like minimal operations and sub-minute latency for joining Kinesis streams with Redshift tables, look for Redshift streaming ingestion into a materialized view. If the data must reside in Redshift for joins, avoid Spectrum-only solutions.

Question 9

A biotech analytics firm plans to run a web portal on an Auto Scaling group that can scale up to 18 Amazon EC2 instances across two Availability Zones. The application needs a shared file system that all instances can mount simultaneously for read and write operations, with high availability and elastic capacity. Which AWS storage service should the team choose?

  • ✓ C. Amazon Elastic File System (EFS)

Amazon Elastic File System (EFS) is the correct choice because it delivers a fully managed, multi-AZ NFS file system that many EC2 instances can mount at the same time for concurrent reads and writes. It scales automatically to handle changing workloads and is designed for high availability across Availability Zones.

Amazon S3 is unsuitable because it is object storage rather than a mountable, POSIX-compliant file system, and it does not provide native file locking or shared file semantics for concurrent writers.

Amazon EBS Multi-Attach allows a limited form of shared block access only within a single AZ and requires cluster-aware file systems or applications to avoid data corruption, making it a poor fit for a general web application needing a shared file system across AZs.

Amazon EC2 instance store is ephemeral and tied to the instance lifecycle, is not shared across instances, and therefore cannot satisfy the requirement for a durable, common file system.

When you see concurrent access, shared file system, and multi-AZ for EC2 fleets, think EFS. Use EBS for single-instance block storage (or niche cluster-aware cases with Multi-Attach), and remember S3 is object storage, not a POSIX file system.

Question 10

How do you configure cross-Region copy of KMS-encrypted Amazon Redshift automated and manual snapshots with 90-day retention?

  • ✓ B. Create snapshot copy grant in the destination using its KMS key. Enable cross-Region copy on the source with 90-day retention

The correct setup is to create a snapshot copy grant in the destination Region using a KMS key in that Region, then enable cross-Region snapshot copy on the source cluster and set the retention to 90 days. This ensures both automated and manual snapshots are copied, remain encrypted in the destination with its KMS key, and are retained for the specified period. Therefore, Create snapshot copy grant in the destination using its KMS key. Enable cross-Region copy on the source with 90-day retention is correct.

The option Create snapshot copy grant in the source Region with its KMS key. Enable copy from the destination is incorrect because snapshot copy is configured on the source cluster, and the snapshot copy grant must be created in the destination Region with a key from that Region.

The option Use AWS Backup cross-Region copy for Redshift is incorrect because Redshift uses its native snapshot copy mechanism. AWS Backup is not the method for KMS-encrypted Redshift snapshot cross-Region replication.

The option Use a multi-Region KMS key in the source. No snapshot copy grant needed is incorrect because Redshift requires a destination-Region KMS key and an associated snapshot copy grant to encrypt snapshots in the target Region.

For Redshift encrypted snapshots, remember the flow: destination Region KMS key and snapshot copy grant first, then enable snapshot copy on the source cluster and set the retention. Watch for distractors that reverse source/destination roles or suggest unrelated services. When you see cross-Region encrypted snapshots, think destination key + snapshot copy grant and configure copy on the source.

Question 11

A retail analytics startup needs to define and roll out AWS Glue job configurations as infrastructure as code, and it wants new objects in an Amazon S3 bucket to invoke an AWS Lambda function that starts those Glue jobs. Which AWS service or tool is the most suitable to implement this with minimal boilerplate and native event wiring?

  • ✓ C. AWS SAM (Serverless Application Model)

The best choice is AWS SAM (Serverless Application Model) because it streamlines serverless infrastructure as code, natively supports S3 event sources for Lambda, and still lets you include AWS Glue resources via CloudFormation within the same template. This provides a concise, repeatable deployment for the Lambda trigger and the Glue job definitions together.

AWS CloudFormation can absolutely deploy Glue, Lambda, and S3 notifications, but it typically involves more boilerplate and lacks SAM’s serverless conveniences like simplified event wiring and packaging.

AWS CDK can also implement the entire stack, yet the question favors a serverless-focused tool with built-in event source mappings and packaging, which SAM offers out of the box.

AWS CodeDeploy is oriented toward deploying application versions and does not provision Glue jobs or manage S3-to-Lambda notifications, so it does not meet the IaC and serverless trigger requirements.

When you see S3 events invoking Lambda and a need for concise serverless IaC, think AWS SAM. If you need maximum flexibility across services and languages, consider AWS CDK, and for raw templates or broader coverage, use CloudFormation.

Question 12

How can you deliver only DynamoDB item updates to an existing Amazon Redshift cluster in near real time (under 60 seconds) at about 300,000 updates per day with bursts up to 9,000 per minute?

  • ✓ B. Enable DynamoDB Streams (new image) and use Lambda to send records to a Kinesis Data Firehose delivery stream for Amazon Redshift

Enable DynamoDB Streams (new image) and use Lambda to send records to a Kinesis Data Firehose delivery stream for Amazon Redshift is the best fit. DynamoDB Streams provides change data capture of only modified items. Using the new image ensures the full, updated item is emitted. Lambda scales with traffic spikes and transforms or enriches as needed. Kinesis Data Firehose then buffers and micro-batches records to Amazon Redshift, meeting sub-minute latency and handling bursty throughput with built-in retry and backoff.

The option AWS DMS change replication from DynamoDB to Amazon Redshift is not appropriate because DMS does not stream DynamoDB changes directly into Redshift and typically stages via Amazon S3, which adds latency and operational overhead for near real-time needs.

The option Use DynamoDB Streams to land updates in Amazon S3 via Lambda and query with Redshift Spectrum does not load the data into Redshift tables. It enables external querying, which does not satisfy the requirement to send the modified rows to the existing Redshift cluster.

The option Use an AWS Glue streaming job to consume DynamoDB Streams and write to Amazon Redshift is not suitable because Glue streaming jobs do not natively read from DynamoDB Streams. Implementing this would require custom bridges and increases complexity and latency risk versus the native Streams + Lambda + Firehose pattern.

When you see near real-time CDC from DynamoDB into Redshift with minimal latency and spiky volumes, look for the pattern DynamoDB Streams + Lambda + Kinesis Data Firehose (Redshift destination). If the requirement is to analyze data in place on S3 rather than load into Redshift, consider Redshift Spectrum. If you need batch ingestion or migration, consider AWS DMS with S3 staging. Pay attention to Streams images: keys only does not include changed attributes. Use new image (or new and old if you need before and after values).

Question 13

A genomics analytics startup runs a weekly Spark batch on Amazon EC2 that executes for about 48 hours and produces roughly 12 TB of temporary shuffle and scratch files per run. The team needs the lowest-latency reads and writes to data that resides directly on the instance while jobs are running, and they want to keep costs low for this short-lived workload. Which EC2 storage option should they choose?

  • ✓ C. Instance Store Volumes

The best choice is Instance Store Volumes because the workload writes large amounts of temporary data and needs the lowest latency and highest IOPS directly on the instance. Instance store is physically attached storage that is ideal for ephemeral scratch and shuffle files and is cost-effective for short-lived runs since it comes with the instance.

Amazon EBS General Purpose SSD (gp3) delivers good performance and durability, but it is network-attached block storage with additional provisioned capacity costs, and it cannot match the latency of local NVMe for transient scratch.

Amazon FSx for Lustre provides very high throughput for shared file access, yet it is a network file system that adds cost and complexity when local, disposable storage is required.

Amazon S3 Standard is object storage with exceptional durability and scalability but it is accessed over the network and is not designed for the low-latency block I/O demanded by on-instance processing.

Map keywords to storage: ephemeral scratch, lowest latency, on-instance points to instance store; durable block points to EBS; shared POSIX suggests EFS or FSx. And object indicates S3.

Question 14

When copying an encrypted EBS snapshot to another Region, how do you ensure it stays encrypted with a customer managed KMS key?

  • ✓ B. Choose a customer managed KMS key in the destination Region during the copy

Choose a customer managed KMS key in the destination Region during the copy is correct because KMS keys are regional. When you copy an encrypted EBS snapshot across Regions, the snapshot is re-encrypted in the target Region. To keep control with a customer managed key, you must explicitly select a customer managed KMS key that exists in the destination Region during the copy operation. If you do not specify a key, EBS uses the AWS managed key for EBS in that Region.

The option Enable EBS encryption by default in the destination Region and copy without specifying a key is incorrect because default encryption uses the AWS managed key aws/ebs, not a customer managed key.

The option Reuse the same KMS key from the source Region is incorrect since KMS keys are Region scoped and cannot be used directly in another Region.

The option Create a grant on the source KMS key and reference it during the copy is incorrect because grants control permissions, not Region scope, and do not allow cross-Region key usage.

Look for cross-Region plus KMS phrasing. KMS keys are regional. For EBS snapshot copies, you must specify the destination Region customer managed KMS key to maintain governance. If you see default encryption mentioned, remember it implies the AWS managed key unless a specific CMK is chosen.

Question 15

SkyTrail Studios is launching a real-time trivia app that sends answer submissions from mobile devices to a backend that computes standings and updates a public leaderboard. During live shows, traffic can spike to 1.5 million events per minute for about 25 minutes. The team must process submissions in the order they arrive for each player, persist results in a highly available database, and keep operations work to a minimum. Which architecture should they adopt?

  • ✓ C. Stream updates into Amazon Kinesis Data Streams, invoke AWS Lambda to process records, and write results to Amazon DynamoDB

The best fit is Stream updates into Amazon Kinesis Data Streams, invoke AWS Lambda to process records, and write results to Amazon DynamoDB.

Kinesis Data Streams handles large, bursty ingestion while preserving record order per partition key, Lambda integrates natively to scale consumers with managed checkpointing, and DynamoDB delivers highly available, low-latency persistence with minimal ops.

Send score events to an Amazon SQS FIFO queue, process them with an Auto Scaling group of Amazon EC2 workers, and persist results in Amazon RDS for MySQL increases operational burden due to EC2 and RDS management and faces FIFO throughput constraints, making it harder to absorb very large spikes.

Ingest updates with Amazon MSK, consume them on an Amazon EC2 fleet, and store processed items in Amazon DynamoDB can meet ordering but requires managing MSK clusters and EC2 consumers, which conflicts with the requirement to minimize management overhead.

Publish updates to Amazon EventBridge, trigger AWS Lambda to process, and store results in Amazon Aurora Serverless v2 does not guarantee strict ordering and is not intended for sustained high-throughput streaming workloads, and relational writes may struggle under spiky ingest compared to DynamoDB.

When you see requirements for very high-throughput ingestion, per-key ordering, and minimal operations, think Kinesis Data Streams plus Lambda for processing and DynamoDB for durable, highly available storage.

AWS Certified Data Engineer Associate Logo & Badge Credly

All questions come from my AWS Engineer Udemy course and certificationexams.pro

Question 16

Which AWS service lets multiple consumers read the same ordered clickstream and replay records for up to 30 days?

  • ✓ B. Kinesis Data Streams

Kinesis Data Streams is correct because it preserves per-shard ordering, supports multiple independent consumers (including enhanced fan-out), and allows configurable retention up to 365 days, enabling deterministic replays within a 30-day window. Consumers can use sequence numbers or iterators to reprocess the same data in order at a later time.

The option Amazon SQS is incorrect because messages are typically removed after consumption and the maximum message retention is 14 days, which does not meet a 30-day replay requirement and does not support two independent consumers re-reading the same messages from the same queue.

The option Amazon DynamoDB Streams is incorrect because it is limited to item-change streams with roughly 24-hour retention, not a general-purpose clickstream with long replay needs.

The option Amazon Kinesis Data Firehose is incorrect because it is a delivery service to sinks and does not provide replayable storage or multiple ordered consumers.

When you see requirements for ordered events, multiple independent consumers, and replays beyond a few days, think Kinesis Data Streams. If the question mentions delivery to destinations without reprocessing or multiple consumers, that points to Firehose. If it mentions queue semantics and deletion on consume, that is SQS, which also has shorter retention limits.

Question 17

A video streaming startup stores title metadata in Amazon DynamoDB and wants to improve query performance for several access patterns. The team needs to fetch items by titleId as well as by genre and by studio, and they want the data to stay consistent across these queries. Which combination of secondary indexes should they configure to satisfy these access patterns?

  • ✓ C. Global Secondary Index (GSI) with genre as the partition key and another GSI with studio as the partition key

Global Secondary Index (GSI) with genre as the partition key and another GSI with studio as the partition key is correct because only GSIs allow you to query by alternative partition keys beyond the base table’s titleId, enabling efficient lookups by genre and by studio. While GSI reads are eventually consistent, DynamoDB automatically propagates updates to the indexes, and GSIs are the proper choice for these access patterns.

Local Secondary Index (LSI) with titleId as the partition key and studio as the sort key. Global Secondary Index (GSI) with genre as the partition key is incorrect because the LSI still requires the titleId partition key and therefore cannot serve a studio-only lookup across different items.

Two Global Secondary Indexes (GSIs), both using titleId as the partition key but with different sort keys for genre and studio is incorrect since using titleId as the partition key on both indexes prevents direct queries by genre or studio without knowing the titleId.

Amazon OpenSearch Service is incorrect because it is a separate search service and not a DynamoDB secondary index, so it does not fulfill the requirement to implement DynamoDB-native indexes.

If you must query by attributes that are not the base partition key, use a GSI; LSIs share the base partition key and can offer strongly consistent reads, but they cannot support cross-partition lookups by a different attribute.

Question 18

Which methods automate consistent EMR cluster initialization with required libraries at launch? (Choose 2)

  • ✓ B. Use EMR bootstrap actions from scripts in S3

  • ✓ D. Launch with an EMR custom AMI that includes the libs

The best ways to ensure identical, automated EMR initialization are Use EMR bootstrap actions from scripts in S3 and Launch with an EMR custom AMI that includes the libs.

Bootstrap actions run during instance provisioning on every node, providing deterministic, repeatable setup. A custom EMR AMI bakes dependencies into the image so every cluster node starts with the exact same software baseline.

AWS Systems Manager Run Command after startup is post-provision and can execute after EMR services start, which risks configuration drift and non-deterministic ordering.

Store scripts in DynamoDB and trigger with Lambda is not supported because EMR does not pull bootstrap artifacts from DynamoDB and Lambda does not run on cluster nodes at boot.

Add a first EMR step to install packages executes only after the cluster is up, so it cannot guarantee libraries are present before daemons start and is not the recommended mechanism for base initialization.

When you see phrases like at launch, every node, and no manual steps, prefer EMR-native initialization: bootstrap actions from S3 or a custom AMI. Avoid post-provision tools or external triggers for base cluster setup.

Question 19

A media analytics startup runs a serverless data integration stack on AWS Glue. The data engineer must regularly crawl a Microsoft SQL Server database called OpsDB and its table txn_logs_2024 every 12 hours, then orchestrate the end-to-end extract, transform, and load steps so that the processed data lands in an Amazon S3 bucket named s3://media-raw-zone. Which AWS service or feature will most cost-effectively coordinate the crawler and ETL jobs as a single pipeline?

  • ✓ B. AWS Glue workflows

The correct choice is AWS Glue workflows.

It provides native orchestration for AWS Glue crawlers and Glue jobs, enabling you to set dependencies, triggers, and parameters so the crawl and ETL steps execute and are tracked as one pipeline. Because it is built into Glue, it is typically more cost-effective than introducing a separate orchestration service for a Glue-centric workflow.

AWS Step Functions can coordinate Glue jobs and crawlers, but it adds per-state transition charges and additional complexity, so it is not the most cost-effective choice when all tasks live within Glue.

AWS Glue Studio focuses on visually authoring and monitoring individual Glue jobs. It does not provide the same end-to-end workflow coordination of multiple jobs and crawlers.

AWS Glue DataBrew is geared toward interactive, no-code data preparation by analysts and is not an orchestration mechanism for scheduled crawlers and ETL pipelines.

When the scenario emphasizes orchestrating Glue crawlers and multiple Glue jobs as a single unit and hints at cost efficiency, look for Glue workflows. If the problem spans many different AWS services or requires human approval steps, Step Functions is more likely.

Question 20

Which AWS Glue transform can probabilistically match and deduplicate records across datasets without a shared key?

  • ✓ C. AWS Glue FindMatches transform

AWS Glue FindMatches transform is correct because it is the AWS Glue machine learning transform designed for probabilistic record linkage and deduplication when there is no shared identifier across datasets. It learns matching rules from labeled examples (or can run with minimal labeling) and scores potential matches to consolidate records at scale.

AWS Glue DataBrew is incorrect because while it can remove duplicates using specified columns, it does not perform ML-based fuzzy/entity matching across datasets without a common key.

AWS Entity Resolution is incorrect in this context because it is a separate managed service, not a Glue transform. The question specifically asks for a Glue transformation.

AWS Glue Relationalize is incorrect because it flattens nested structures into relational tables and does not identify duplicate or matching entities.

When you see cues like no common identifier, deduplicate, and a requirement to do it inside AWS Glue, think of FindMatches. If the question broadens beyond Glue to cross-application/entity matching, consider AWS Entity Resolution instead.

Question 21

At Riverton Books, analysts run Amazon Athena queries that join a very large fact_orders table with a much smaller dim_stores table using an equality condition. The fact table has roughly 75 million rows while the lookup table has only a few thousand rows, and the join is performing poorly. What change to the join should you make to improve performance without altering the results?

  • ✓ C. Move the larger dataset to the left side of the join and the smaller table to the right side

The best fix is to ensure the small lookup table is on the right-hand side of the join and the large table is on the left. In Athena’s distributed hash join, the right-side relation is built into an in-memory hash table and broadcast to workers, so keeping that side small reduces memory consumption and network overhead. This typically yields a noticeable speedup for equijoins with highly asymmetric table sizes.

Move the larger dataset to the left side of the join and the smaller table to the right side is correct because Athena builds and broadcasts the right side of an equality join, and making it the small table optimizes the build phase.

Place the small table on the left and the big table on the right is incorrect because it forces Athena to build a large hash table from the big dataset, increasing memory usage and slowing execution.

Switch the Athena workgroup to the newest engine release is not the right fix because engine version changes alone do not correct suboptimal join order. Workgroups mainly control settings like data scan limits and version pinning.

Partition the small lookup table by the join key is not helpful here because the table is tiny and partitioning does not address the broadcast build behavior of Athena’s hash join, so it offers negligible benefit.

For Athena equijoins, put the small table on the right side so the broadcast build is tiny. Think right side = build side and keep it small for faster joins.

Question 22

Which AWS services should be combined to automatically detect and mask sensitive columns in S3 and enforce role-based access so analysts get masked data while a preprocessing role can read raw data? (Choose 2)

  • ✓ C. AWS Glue DataBrew

  • ✓ D. IAM policies on S3 prefixes

AWS Glue DataBrew plus IAM policies on S3 prefixes together satisfy automated detection/masking and role-based access. DataBrew offers built-in transforms to mask or hash sensitive columns and can output sanitized datasets to a separate S3 prefix, aligning with a low-maintenance, scalable pattern. IAM policies restrict access to the raw S3 location for a preprocessing role while granting analysts read access only to the masked prefix, cleanly separating duties and enforcing least privilege.

The option Amazon Macie is incorrect because it classifies sensitive data but does not mask it or enforce masked-vs-raw access.

S3 Object Lambda with Lambda redaction is not appropriate for large-scale tabular analytics. It transforms objects per request and does not automatically detect sensitive columns, adding operational complexity.

Amazon Comprehend focuses on unstructured text PII detection and is not efficient or native for columnar masking across large CSV/Parquet datasets in S3.

Look for combinations that both transform data (masking) and control access (RBAC). Prefer low-code, scalable services for pipelines such as DataBrew for masking and simple S3/IAM boundaries for raw vs masked. Be wary of tools that only discover PII without masking or that target unstructured data when the workload is structured.

Question 23

A sports analytics startup ingests real-time clickstream and playback telemetry that can spike to 180,000 events per minute and wants to use Spark to transform the data before landing curated outputs in an Amazon S3 data lake. The team needs a managed and cost effective ETL service that can scale automatically with changing load and also allows per job compute tuning to control costs. Which service and configuration should they choose?

  • ✓ C. AWS Glue with configurable DPUs for Spark jobs and optional autoscaling

The best choice is AWS Glue with configurable DPUs for Spark jobs and optional autoscaling because it offers serverless Spark ETL, integrates natively with S3, lets you tune compute per job via DPUs to optimize cost, and can elastically scale to handle fluctuating throughput.

Amazon EMR with manually provisioned Spark clusters is less suitable here because you must manage and right-size clusters yourself, which increases operational overhead and can lead to idle costs when load drops.

AWS Glue Crawler only discovers and catalogs schema and does not perform the actual ETL transformations or run Spark jobs.

Amazon Kinesis Data Analytics is focused on Flink or SQL streaming analytics and does not provide serverless Spark ETL with DPU-based compute controls for S3 data lake pipelines.

When you see keywords like serverless Spark, auto scaling, and tune compute per job, think AWS Glue jobs with configurable DPUs. A Glue Crawler is catalog only, and manually managed EMR contradicts the auto-scaling, low-ops requirement.

Question 24

Which AWS service pair provides streaming ingestion and stateful real-time analytics for anomaly detection under 500 ms?

  • ✓ B. Kinesis Data Streams plus Amazon Managed Service for Apache Flink

Kinesis Data Streams plus Amazon Managed Service for Apache Flink is correct because it combines scalable stream ingestion with a fully managed Flink runtime for stateful, low-latency processing, enabling sub-second anomaly detection with keyed state and event-time windows.

The option Amazon Data Firehose and Amazon Redshift is incorrect because Firehose delivers in batches and Redshift is a data warehouse meant for analytic queries after ingestion, not continuous sub-second processing.

The option AWS Glue Streaming and Amazon S3 is incorrect because Glue streaming performs micro-batch ETL to S3 and does not provide stateful, sub-second analytics.

The option Amazon Kinesis Data Streams with AWS Lambda is incorrect because while Lambda can process records quickly, it lacks rich, long-lived state and complex windowing required for robust anomaly detection at scale.

When you see stateful, windowed, or sub‑second stream analytics, think Flink on top of a streaming backbone like Kinesis Data Streams. Firehose is delivery, not analytics. Lambda is event-driven compute but not a full stream processing engine. Glue Streaming is micro-batch ETL, not ultra-low-latency stream analytics.

Question 25

A logistics analytics company, ParcelMetrics, runs several Amazon ECS task types on Amazon EC2 instances within a shared ECS cluster. Each task must write its result files and state to a common store that is accessible by all tasks. Every run produces roughly 35 MB per task, as many as 500 tasks can run at once, and even with ongoing archiving the total storage footprint is expected to stay below 800 GB. Which storage approach will best support sustained high-frequency reads and writes for this workload?

  • ✓ C. Use Amazon EFS in Provisioned Throughput mode

The best choice is Use Amazon EFS in Provisioned Throughput mode.

This mode provides predictable throughput regardless of how much data is stored, which fits small-capacity but high-throughput patterns. EFS also enables many EC2 instances and ECS tasks to read and write concurrently, matching the need for shared, highly parallel access.

Use Amazon EFS in Bursting Throughput mode is less suitable because baseline throughput scales with the amount of data stored, and with under a terabyte of data you may not achieve sustained high IO for hundreds of concurrent tasks without padding.

Create a shared Amazon DynamoDB table accessible by all ECS tasks does not fit due to the 400 KB item size limit, which forces fragmentation of 35 MB outputs and complicates high-frequency reads and writes for file-like workloads.

Mount a single Amazon EBS volume to the ECS cluster instances is not ideal since EBS is per-instance by default. While Multi-Attach exists for specific volumes and Nitro instances, it has limits and does not provide the scalable, multi-writer shared file system semantics needed.

When storage size is relatively small but the workload needs steady, high throughput for many clients, prefer Amazon EFS Provisioned Throughput to avoid padding and to guarantee performance.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.