DP-100 Practice Tests on Data Scientist Associate Exam Topics

DP-100 Azure Data Scientist Certification Exam Topics

Want to pass the DP-100 certification exam on your first try? This collection of DP-100 exam questions will help you understand core concepts and prepare for the real DP-100 test.

These questions come from my Udemy training and the certificationexams.pro website, resources that have helped many students pass the DP-100 certification.

DP-100 Practice Questions

These are not DP-100 exam dumps or braindumps. They are carefully written questions that resemble what you will see on the real DP-100 certification exam.

Good luck on these practice questions, and even better luck on the official DP-100 exam.

Git, GitHub & GitHub Copilot Certification Made Easy

Want to get certified on the most popular AI, ML & DevOps technologies of the day? These five resources will help you get GitHub certified in a hurry.

Get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Azure DP-100 Sample Questions

Bistro Solace in Brooklyn is a well regarded neighborhood restaurant founded by Maria Chen and Daniel Ortiz and they are experimenting with machine learning to improve customer personalization. The team has adopted Microsoft Azure and plans to use Hyperdrive in Azure Machine Learning to tune model hyperparameters, and they cannot enable any early stopping policy because of their audit requirements. The hyperparameter space is a learning rate that can be any value between 0.0005 and 0.2 and a batch size that must be one of 8 16 or 32. Daniel is deciding which sampling strategy to pick and asks for your recommendation. Which sampling strategies should you recommend to Daniel to use? (Choose 2)

  • ❏ A. Grid sampling

  • ❏ B. Hyperband sampling

  • ❏ C. Random sampling

  • ❏ D. Bayesian sampling

Which model evaluation statistic is most commonly referred to as R squared?

  • ❏ A. Explained variance score

  • ❏ B. Relative error

  • ❏ C. Coefficient of determination

  • ❏ D. Relative squared error

Which statement correctly describes managed compute clusters in a cloud analytics environment?

  • ❏ A. Compute clusters are limited to CPU only machine types

  • ❏ B. Clusters generally have higher downtime than single compute instances

  • ❏ C. Clusters must be resized manually and do not scale automatically

  • ❏ D. Managed compute clusters can automatically scale their node count to match workload demand

Nova Financial Systems was founded by Clara Boone and has grown into a major independent fintech firm. The team routes their machine learning workloads and training datasets over a virtual private network when running experiments and training models. Why does Nova use a VPN for its machine learning pipelines and model training?

  • ❏ A. To route traffic over encrypted private links such as Cloud VPN or Dedicated Interconnect

  • ❏ B. To encrypt traffic and confine training resources to a private virtual network

  • ❏ C. To administer access by assigning IAM roles to users and services

  • ❏ D. To allow developers to run model experiments without relying on cloud connectivity

A data science team at Northbridge Labs needs to connect Azure Machine Learning with other Azure resources and reuse a massively parallel processing data platform that their group uses in other initiatives. They want to pick a compute target that can make use of an existing MPP Spark environment in their Azure workspace. Which compute target should they choose?

  • ❏ A. Azure Machine Learning compute instance

  • ❏ B. An autoscaling compute cluster that grows from three to six nodes

  • ❏ C. Azure Kubernetes Service

  • ❏ D. Synapse Analytics Spark pool

A data science team at Harborview Analytics trained a regression model to forecast quarterly revenue and they want to evaluate how accurate its predictions are. Which evaluation metric best matches the description “The mean of absolute differences between predicted and actual values measured in the same units as the target where a smaller value signals better accuracy”?

  • ❏ A. Coefficient of Determination R2

  • ❏ B. Root Mean Squared Error RMSE

  • ❏ C. Mean Absolute Percentage Error MAPE

  • ❏ D. Mean Absolute Error MAE

  • ❏ E. Relative Absolute Error RAE

You are training a regression model using Contoso Automated Machine Learning and your dataset contains missing values and categorical features with few unique categories. You want the AutoML run to automatically fill missing entries and transform categorical columns as part of the training pipeline. What configuration should you set to guarantee those preprocessing steps occur?

  • ❏ A. Set the featurization parameter to “disabled”

  • ❏ B. Set the featurization parameter to “custom”

  • ❏ C. Set the featurization parameter to “auto”

  • ❏ D. Set the featurization parameter to “enabled”

The Metro Gazette, located in Riverton, modernized its infrastructure under editor in chief Eleanor Blake and adopted Microsoft Azure for new initiatives. Liam Reed is tasked with building a computer vision workflow to locate and extract the boundaries of several items within a photograph using Azure AutoML. Which type of computer vision model should Liam choose to perform this task effectively?

  • ❏ A. Multi-label image classification

  • ❏ B. Instance segmentation

  • ❏ C. Object detection

  • ❏ D. Multi-class image classification

A regional lender is using an automated machine learning tool to train a natural language processing classifier and they must pick training settings and evaluation procedures that align with responsible AI practices. What considerations should guide their decisions?

  • ❏ A. Rely only on accuracy metrics and ignore input preprocessing

  • ❏ B. Choose highly complex neural networks without evaluating interpretability

  • ❏ C. Emphasize ethical practices by documenting preprocessing steps and preferring interpretable models

  • ❏ D. Cloud AutoML

Real world datasets frequently include missing entries incorrect recordings and sampling biases that can skew analysis results. Which steps are appropriate when inspecting and preparing such data for modeling and evaluation? (Choose 4)

  • ❏ A. Inspect records for misentered or corrupted values

  • ❏ B. Use Cloud Dataflow to preprocess and transform the dataset

  • ❏ C. Confirm that the dataset size is sufficient to reflect real world variability

  • ❏ D. Identify and impute missing values

  • ❏ E. Check raw data for sampling or measurement bias

After you create an Orion Machine Learning workspace you can manage the assets and compute resources needed to train and deploy models, and the platform provides several types of compute targets for experimentation training and hosting. Which of the following compute resource types can you create inside an Orion Machine Learning workspace? (Choose 3)

  • ❏ A. Managed Kubernetes clusters

  • ❏ B. Virtual machine size

  • ❏ C. Compute instances

  • ❏ D. Google Kubernetes Engine

  • ❏ E. Compute clusters

  • ❏ F. Inference clusters

Within the context of the CloudVista machine learning environment simple models with small datasets can usually be trained in a single pass but larger datasets and more complex models require iterative training that repeatedly applies the model to training data compares outputs to expected labels and adjusts parameters until a suitable fit is found. Hyperparameters control how the model is updated during these training iterations and preprocessing denotes transformations applied before data is given to the model. The most common preprocessing step is [__]?

  • ❏ A. Standardize features to have zero mean and unit variance

  • ❏ B. Normalize features to lie between zero and one

  • ❏ C. Remove anomalous values from the dataset

  • ❏ D. Google Cloud Storage

At Harbor Data Labs the analytics group uses several techniques in Azure Machine Learning Studio to validate models that predict continuous outcomes. Which algorithm listed below focuses on minimizing the differences between observed values and predicted values and is therefore most appropriate for fitting a linear relationship?

  • ❏ A. Fast Forest Quantile Regression

  • ❏ B. Ridge Regression

  • ❏ C. Boosted Decision Tree Regression

  • ❏ D. Linear Regression

Lumen and Oak is an upscale Brooklyn bistro founded by Emma Carter and Daniel Shaw who are exploring ways to streamline their operations. They have adopted Microsoft Azure and you were hired to lead several initiatives. The team published a machine learning solution built with Azure Machine Learning Designer as a real time web service on an Azure Kubernetes Service inference compute cluster and they did not change the deployed endpoint settings. You need to provide the application developers with the values required to invoke the endpoint and you will work with an intern to gather the details. Which values should you supply to the application developers to call the endpoint? (Choose 2)

  • ❏ A. The run identifier for the inference pipeline execution

  • ❏ B. The URL of the web service endpoint

  • ❏ C. The name of the Azure Kubernetes Service cluster hosting the service

  • ❏ D. The container image name used for the deployment

  • ❏ E. The endpoint authentication key

Rita Chen recently joined Aurora Air which is expanding into international routes. The airline wants to label passenger feedback with sentiments such as “negative” “neutral” and “positive”. Rita is using Azure AutoML to build a classification model and she must remove columns that behave like record identifiers because they do not help prediction. Which preprocessing transformation should she apply?

  • ❏ A. Convert categorical variables to numeric representations

  • ❏ B. Fill in missing values with imputation methods

  • ❏ C. Drop fields that have very high cardinality

  • ❏ D. Apply feature engineering to create derived attributes

UrbanBite Restaurants is a U.S. quick service chain led by Marco Lema and based in Boulder, Colorado. The company plans to open locations overseas and this has created IT challenges so Marco has asked for your assistance. The immediate goal is to provision a shared data science environment for the analytics team. The training dataset for models exceeds 45 GB in size. Models must be developed using either Caffe2 or Chainer frameworks. Data scientists need to build machine learning pipelines and train models on their personal laptops while online and while offline. Laptops must be able to receive pipeline updates once they reconnect to the network. Which data science environment best meets these requirements?

  • ❏ A. Azure Databricks

  • ❏ B. Azure Kubernetes Service

  • ❏ C. Azure Machine Learning Designer

  • ❏ D. Azure Machine Learning

A data engineer must transfer a large dataset from Contoso Machine Learning Studio into a Weka environment and must convert the files into a format that Weka can read. Which conversion module will best produce a file compatible with Weka?

  • ❏ A. Export as CSV

  • ❏ B. Convert to TFRecord

  • ❏ C. Convert to ARFF

  • ❏ D. Export as LIBSVM format

Ashford Analytics is a data company started by Adam Ashford and it is valued at over forty million dollars. After founding the Ashford Trust Adam became the company chairman and lead technologist. Adam has asked you to assist as his engineering group adopts Azure Databricks for their machine learning workloads. During a team workshop the engineers examined Driver and Executor roles in Databricks and they observed that Spark parallelism uses clusters made up of a Driver and one or more executors. The lead engineer wants to know what type of object submitted work is partitioned into on a Spark Cluster?

  • ❏ A. Arrays

  • ❏ B. Sessions

  • ❏ C. Stages

  • ❏ D. Jobs

Atlas Dataworks is auditing a dataset and a junior analyst asks you to interpret a NumPy array with the shape (3, 25). The analyst is learning how array shapes describe data layouts and requests a simple explanation of what the tuple (3, 25) communicates about the elements inside the array. How would you explain this shape?

  • ❏ A. BigQuery

  • ❏ B. The structure represents a single dimensional sequence with 75 elements

  • ❏ C. The array contains three elements whose values are three and twenty five

  • ❏ D. A two dimensional array made up of three rows each containing 25 elements

A data engineer at a regional bookseller is preparing sales data with pandas in Python and notices entries that repeat the same values in “cust_id” and “order_id”. They must remove duplicate rows so only the first occurrence of each pair is kept. Which pandas call will perform this operation?

  • ❏ A. duplicated(subset=[“cust_id”, “order_id”])

  • ❏ B. drop_duplicates(subset=”cust_id”, keep=’first’)

  • ❏ C. drop_duplicates(subset=[“cust_id”, “order_id”], keep=’first’)

  • ❏ D. drop_duplicates(keep=’first’)

A data science team at Meridian Analytics needs to install the Azure Machine Learning CLI extension into their environment. The extension adds command line operations for managing Azure Machine Learning resources. Which software must already be installed before adding this extension?

  • ❏ A. Google Cloud SDK

  • ❏ B. Azure CLI

  • ❏ C. Azure PowerShell

  • ❏ D. Power Apps

Scenario Blue Ridge Analytics is a consulting firm started by Jordan Hale who assembled a small analytics group to support client reporting. The team is analyzing sales records that are kept in a Pandas DataFrame named sales_df. The DataFrame includes the columns year month day total_sales. Which code snippet should be used to compute the average total_sales value efficiently?

  • ❏ A. sales_df[‘total_sales’].median()

  • ❏ B. sales_df[‘total_sales’].average()

  • ❏ C. sales_df[‘total_sales’].mean()

  • ❏ D. mean(sales_df[‘total_sales’])

Scenario: Meridian Bistro in San Francisco was opened by Lena Ortiz and Omar Hale, and they have adopted Microsoft Azure to modernize operations and hired you to lead several IT projects. The current assignment is to use Azure Machine Learning Designer to construct a pipeline that trains a classification model and then make that trained model available as an online service. What steps must be completed before you can deploy the trained model as a service?

  • ❏ A. Register the trained model in the Azure Machine Learning model registry

  • ❏ B. Create an inference pipeline derived from the training pipeline

  • ❏ C. Add an Evaluate Model module into the original training pipeline

  • ❏ D. Clone the training pipeline and swap the algorithm to a regression learner

Scenario: Marlowe Textiles is a family run retailer with several stores across Greater Manchester and it recently purchased a small fashion label based in Barcelona. As part of the consolidation the company is migrating its systems into Marlowe’s Microsoft Azure environment and the CTO has hired you as an Azure consultant to guide the integration. The current work stream focuses on Azure Machine Learning. The engineering team provisioned an Azure Machine Learning compute target named ComputeA using the STANDARD_D2 virtual machine image. ComputeA is currently idle and has zero active nodes. A developer set a Python variable ws to reference the Azure Machine Learning workspace and then runs this code python from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException the_cluster_name = “ComputeA” try: the_cluster = ComputeTarget(workspace=ws, name=the_cluster_name) print(“Step1″) except ComputeTargetException: config = AmlCompute.provisioning_configuration(vm_size=”STANDARD_DS13_v2”, max_nodes=6) the_cluster = ComputeTarget.create(ws, the_cluster_name, config) print(“Step2”) The CTO is concerned that the output is not matching expectations and asks whether Step1 will be printed to the console. Will Step1 be printed to the screen?

  • ❏ A. No the script will print Step2 instead

  • ❏ B. Yes the text Step1 will be printed to the screen

  • ❏ C. An unhandled exception will occur and the program will fail

  • ❏ D. It depends on workspace authentication and the run may fail if ws is invalid

When configuring an Automated Machine Learning experiment in Contoso AI which setting is not available to change during model creation?

  • ❏ A. Set the experiment timeout to 90 minutes

  • ❏ B. Select R squared as the main evaluation metric

  • ❏ C. Limit training to only the XGBoost algorithm

  • ❏ D. Turn on a native switch to train solely on a fraction of the input dataset

Lena Park is a computer vision engineer at Novex Security who is using Azure AutoML with the Azure Machine Learning Python SDK v2 to build a model that finds vehicles in rooftop camera images and returns bounding box coordinates for each detected vehicle. Which AutoML task should she choose so the trained model outputs bounding box coordinates for detected vehicles?

  • ❏ A. azure.ai.ml.automl.image_instance_segmentation

  • ❏ B. azure.ai.ml.automl.image_classification

  • ❏ C. azure.ai.ml.automl.image_object_detection

  • ❏ D. azure.ai.ml.automl.image_classification_multilabel

ArchForm Studio is an architecture firm in Chicago that was founded by Elena Park and is moving its legacy infrastructure to Microsoft Azure. The team plans to use Horovod to train a deep neural network and has set up Horovod across three servers each with two GPUs to support synchronized distributed training. Within the Horovod training script what is the main purpose of hvd.callbacks.BroadcastGlobalVariablesCallback(0)?

  • ❏ A. Use tf.distribute.MirroredStrategy for synchronous GPU training

  • ❏ B. To average evaluation metrics across participants at the end of each epoch

  • ❏ C. To send the final model variables from rank 0 to other processes after training finishes

  • ❏ D. To broadcast the initial model variables from rank 0 so every worker begins with identical parameters

After deploying a model for online inference to an endpoint in a cloud project what is the default method to invoke that endpoint?

  • ❏ A. Python client library

  • ❏ B. Cloud Console UI

  • ❏ C. REST API

  • ❏ D. gcloud CLI

Maya Rivera is the lead for a computer vision initiative at Harborview Research Center and she is working with surveillance footage from a hospital loading zone to train a model with the Azure Machine Learning Python SDK v2 and the objective is to output bounding box coordinates for vehicles detected in the images. Which Azure AutoML image task should she choose to have the trained model produce bounding box coordinates for vehicles?

  • ❏ A. azure.ai.ml.automl.image_classification

  • ❏ B. AutoML Vision

  • ❏ C. azure.ai.ml.automl.image_instance_segmentation

  • ❏ D. azure.ai.ml.automl.image_object_detection

Scenario The Pacific Collegiate Wrestling League was founded by promoter Marcus Bell and he is using cloud solutions to modernize the organization. He has asked for guidance on their Microsoft Azure setup. The analytics team is training a classification model and plans to measure performance with k fold cross validation on a small sample of data. The data scientist must select a value for the k parameter which determines how many splits the dataset will be divided into for the cross validation procedure. Which value should they pick?

  • ❏ A. k=5

  • ❏ B. k=0.5

  • ❏ C. k=10

  • ❏ D. k=1

You are moving from a data engineering role into data science and you often encounter the phrase “data wrangling” while working with Google Cloud Vertex AI. How would you describe “data wrangling” in the Vertex AI workflow?

  • ❏ A. Splitting a dataset into training and evaluation subsets

  • ❏ B. Data wrangling is the interactive process of cleaning structuring and enriching raw datasets so they meet the input expectations of a machine learning pipeline

  • ❏ C. Managing storage access versioning and collaboration for datasets across a team

  • ❏ D. Using Cloud Dataprep or Dataflow to automate large scale extract transform load jobs

A data scientist at a small analytics startup is preparing k fold cross validation for a classification model and must choose the number of folds to balance evaluation thoroughness and compute time on a limited dataset. Which k value is most suitable?

  • ❏ A. Set k to 5 folds

  • ❏ B. Use leave one out cross validation with k equal to the number of training examples

  • ❏ C. Set k to 10 folds

  • ❏ D. Set k to 2 folds

Harborview Loans is a regional mortgage firm with branches across Oregon and it is run by Nora and Alan Pierce. Priya Rao leads a new ML Designer experiment and she needs to wrangle CSV files stored in an Azure Blob Storage container named finance-archive with a folder called pricing-data to build a loan pricing model. She wants the simplest method to access and process the files inside a notebook while minimizing setup steps. What approach should she choose?

  • ❏ A. Register the storage account as a datastore in the workspace and create a data asset for the files

  • ❏ B. Use Azure Storage Explorer to download the blob files to the compute instance before working with them

  • ❏ C. Open the blobs directly in the notebook by using the blob URI together with a SAS token

  • ❏ D. Employ the Azure Machine Learning Python SDK v2 to register and access the data programmatically

Aurora Club in Meridian City is an upscale venue and it also serves as the headquarters for Damien Voss’s side operations and you were hired to advise their analytics group. The team is building a scikit-learn regression model in Python and the dataset contains several numeric fields plus one text column named ‘ItemCategory’ with values SportsCars Motorbikes Yachts Trucks. They defined a mapping dictionary CategoryCode = { ‘SportsCars’ : 1 , ‘Motorbikes’ : 2 , ‘Yachts’ : 3 , ‘Trucks’ : 4 } and they intend to create a numeric column with dataset[ ‘CategoryCode’ ] = dataset[ ‘ItemCategory’ ][?] They will then use CategoryCode with other numeric features to fit LinearRegression. What expression should replace [?] to transform the text categories into numeric values that scikit-learn can accept?

  • ❏ A. apply(CategoryCode)

  • ❏ B. map(CategoryCode)

  • ❏ C. gen(CategoryCode)

  • ❏ D. transpose(CategoryCode)

Scenario The Aurora Lounge is Riverton’s upscale nightclub and a discreet front for a local syndicate. You are acting as a consultant to improve their analytics processes. The team is working with a Python DataFrame named revenue_df and they need to convert it from a wide layout to a long layout using pandas.melt. The wide DataFrame contains the columns shop 2019 2020 and the rows 0 StoreA 40 30 1 StoreB 70 80 2 StoreC 52 58. The expected long format should have the columns shop year value and one row per store per year. A developer left placeholders [A] [B] and [C] in this snippet import pandas as pd revenue_df = pd.melt([A], id_vars='[B]’, value_vars=[C]) Which arguments should replace the placeholders so that the code performs the intended unpivoting operation?

  • ❏ A. [A] revenue_df, [B] shop, [C] [“2019″,”2020”]

  • ❏ B. [A] dataFrame, [B] shop, [C] [“2019″,”2020”]

  • ❏ C. [A] revenue_df, [B] StoreA, StoreB, StoreC, [C] [“year”]

  • ❏ D. [A] bigquery, [B] value, [C] “shop”

In a Jupyter notebook you have a registered Workspace object named project_ws. Which call retrieves the workspace default datastore?

  • ❏ A. project_ws.get_dataset()

  • ❏ B. project_ws.get_default_datastore()

  • ❏ C. project_ws.default()

  • ❏ D. project_ws.find()

Scenario: Meridian Biotech analytics team led by Ana Rivera and James Park are deploying Azure Machine Learning to improve hiring model fairness and performance. They are using Grid Search to tune a binary classifier that predicts whether applicants will be hired. They want the classifier to select equal proportions of candidates from each category in the Gender attribute. Which parity constraint should they enforce to obtain equal selection rates across the gender groups?

  • ❏ A. True positive rate parity

  • ❏ B. Error rate parity

  • ❏ C. Demographic parity

  • ❏ D. False positive rate parity

  • ❏ E. Equalized odds

You are advising Sentinel Analytics and you are meeting with Maria who leads the data engineering team about Azure Machine Learning. The group plans to deploy models through batch endpoints as part of an ETL workflow and they must create the deployment definition. Which class should you recommend for building the deployment definition?

  • ❏ A. Pipeline

  • ❏ B. OnlineDeployment

  • ❏ C. ParallelRunConfig

  • ❏ D. BatchDeployment

While configuring a training run in Contoso Machine Learning you may pick a custom or prebuilt runtime environment. Which additional resource must you also define to specify where the training will run?

  • ❏ A. Artifact Registry

  • ❏ B. Object storage bucket

  • ❏ C. A designated compute target for executing the training job

  • ❏ D. Operating system image

An analytics team at Nimbus Data needs to register a Synapse Spark pool as an attached compute resource from the Azure Machine Learning Studio compute creation wizard. What sequence of steps correctly registers the Spark pool?

  • ❏ A. Select a Spark pool then attach it to a newly created Synapse workspace then enable the compute managed identity and after saving go to Synapse Studio to give the managed identity the Synapse Administrator role

  • ❏ B. Choose an existing Synapse workspace then enable a workspace managed identity and assign that identity the Azure ML Administrator role

  • ❏ C. Select the existing Synapse workspace then choose the Spark pool within that workspace then enable the managed identity for the compute resource then save the compute and use Synapse Studio to grant the managed identity the Synapse Administrator role

  • ❏ D. Pick an existing Spark pool then enable a managed identity for the Synapse workspace and assign that identity the Azure ML Administrator role

The Midtown Chronicle is a regional Chicago newspaper based in the Lakeside Building and its lead developer Alex Chen is building MLOps to automate workflows. Alex wants to ensure that model training starts automatically whenever the code repository receives proposed changes. What action should Alex take to enable this automation?

  • ❏ A. Configure a push webhook on the repository to notify the pipeline endpoint

  • ❏ B. Create a new feature branch in the repository

  • ❏ C. Open a pull request in the repository hosting service

  • ❏ D. Clone the repository to a local development system

Helix Analytics is the research division of Meridian Biotech and it is overseen by Ava Carter and Jonah Reed. They intend to use Microsoft Machine Learning to improve operational outcomes and they have asked you to advise on privacy practices. While chatting over coffee Ava wants a concise explanation of how differential privacy protects individuals in published summaries. How would you briefly explain differential privacy to Ava?

  • ❏ A. Homomorphic encryption

  • ❏ B. Inject random noise into analytic outputs so aggregated metrics reflect the dataset yet vary unpredictably

  • ❏ C. Substitute numeric entries with their column average for analysis

  • ❏ D. Google Cloud Data Loss Prevention

A data scientist at Aurora Insights has completed a binary classification model and they will use precision as the primary evaluation metric. Which visualization technique best shows how precision changes across different classification thresholds?

  • ❏ A. Violin plot

  • ❏ B. Calibration curve

  • ❏ C. Precision recall curve

  • ❏ D. Receiver operating characteristic curve

Which machine learning task is most suitable to run inside a containerized deployment on a managed Kubernetes cluster?

  • ❏ A. Data preparation

  • ❏ B. Model inference service

  • ❏ C. Model training

  • ❏ D. Data loading

The boutique firm Brightman Harlow Quinn represents enhanced individuals in regulatory and injury matters and an analyst named Rivera is building a machine learning experiment and needs to import tabular data into an Azure Machine Learning dataset using the fewest ingestion steps to accelerate model training. Which data format should Rivera choose to minimize the number of steps when loading into an Azure Machine Learning table?

  • ❏ A. A directory of image files in an Azure Blob storage container

  • ❏ B. A set of newline delimited JSON files in a cloud folder

  • ❏ C. A single CSV file hosted at a public HTTP URL

  • ❏ D. A single Parquet file in Azure Blob Storage using a shared access signature

Scenario The company Arcadia Analytics is a data science firm founded after the Arcadia Trust and it is valued at more than thirty-five million dollars and it is led by CEO Daniel Pierce. Daniel requested help as his engineering team prepares to use Microsoft Azure Machine Learning and they are rehearsing how to construct a DataFrame from Row instances and in-memory records using Apache Spark. Which method should they call to create a DataFrame object?

  • ❏ A. Call pandas.DataFrame()

  • ❏ B. Use createOrReplaceTempView() and then query the view

  • ❏ C. Use a DF.create() instance method

  • ❏ D. Call spark.createDataFrame()

Meridian Solutions was founded by Isabel Hart and now has a market value exceeding forty-two million dollars. Ms Hart established the firm soon after launching the Hart Foundation and she has asked you to advise her IT group on their Microsoft Azure Machine Learning deployment. The engineers are unsure about the nature of the driver and the executor programs and they want to know what type of process the driver and the executors represent within Azure Machine Learning?

  • ❏ A. Cloud Dataflow

  • ❏ B. Python processes

  • ❏ C. SQL processes

  • ❏ D. Java processes

  • ❏ E. C++ processes

  • ❏ F. JSON processors

A data scientist at a retail analytics startup is using pandas in Python and finds NaN values in a numeric column named “PaymentAmount” inside the DataFrame “sales_df”. If the scientist needs to substitute those missing numeric values with 0.02 directly in the existing DataFrame what single line of code accomplishes this?

  • ❏ A. sales_df.replace(value={“PaymentAmount”:0.02}, inplace=True)

  • ❏ B. sales_df.fillna(“PaymentAmount”=0.02)

  • ❏ C. sales_df.fillna(value={“PaymentAmount”:0.02}, inplace=True)

  • ❏ D. sales_df.dropna(inplace=True)

Nolan’s Burgers is a regional burger chain competing with FryKing and they have engaged you as a consultant for Microsoft Azure machine learning projects. You are leading a meeting about model training and the team plans to use scikit learn to fit a regression model on historical sales records. To ensure the model makes reliable predictions on new transactions what evaluation strategy should they adopt?

  • ❏ A. Apply k fold cross validation with scikit learn to estimate model performance

  • ❏ B. Reserve a randomly selected portion of the data for training and keep a distinct held back portion for testing

  • ❏ C. Train the regression model on the entire dataset and then measure performance on the same observations

  • ❏ D. Select the examples closest to the mean for training and then evaluate on the complete dataset

Aurora Insights is adapting its workflows to Microsoft Azure and lead engineer Maya Rios must attach to an existing CPU based compute cluster named cpu-node-2 by using Azure ML Python SDK v2. Which code snippet should Maya use to connect to the cpu-node-2 compute cluster?

  • ❏ A. cpu_compute_target = “gpu-node-3” cpu_cluster = ml_client.compute.get(cpu_compute_target)

  • ❏ B. cpu_cluster = ml_client.compute.get(cpu_compute_target) cpu_compute_target = “cpu-node-2”

  • ❏ C. cpu_compute_target = “cpu-node-2” cpu_cluster = ml_client.compute.get(cpu_compute_target)

  • ❏ D. cpu_compute_target = “cpu-node-2”

You are a machine learning engineer using the Azure Machine Learning SDK for Python v1 together with notebook based workflows to train models. You have already created a compute target, built an environment, and written a Python training script. Which SDK class should you create to package the script with its environment and submit the training job to the compute target?

  • ❏ A. RunConfiguration

  • ❏ B. ScriptRun

  • ❏ C. Run

  • ❏ D. ScriptRunConfig

QuantumWorks is an advanced dimensional engineering firm based in Marston and it functions as a division of Helixium Inc under the leadership of Daniel Cross. Daniel manages a team of data scientists that includes Ana Velasquez who is a top practitioner at QuantumWorks. Ana must retrieve a dataset from a publicly accessible repository hosted on example.com and load it into a Jupyter notebook inside an Azure Machine Learning workspace for quick experimentation. Which protocol should Ana use to fetch the data in her Jupyter notebook?

  • ❏ A. git

  • ❏ B. azureml

  • ❏ C. abfss

  • ❏ D. http(s)

Harrison Realty Group is a large property firm with holdings across multiple markets and they contracted you to create a machine learning experiment in Azure Machine Learning Designer to predict which houses will sell within the next 45 days. You must choose a classification algorithm that can learn complex non linear relationships among features and generalize well to unseen listings. Which algorithm should you select to model non linear relationships effectively?

  • ❏ A. Two Class Support Vector Machine

  • ❏ B. Boosted Decision Tree classifier for two classes

  • ❏ C. Two Class Logistic Regression

  • ❏ D. Linear Regression Model

When converting an exploratory Jupyter notebook into a reproducible production training script what characteristics typically set script based training apart from notebook based development? (Choose 3)

  • ❏ A. Scripts are deployed to Vertex AI with no modifications

  • ❏ B. Scripts are written for automated repeatable training workflows that prioritize consistency and efficiency

  • ❏ C. Scripts are primarily intended for ad hoc exploratory data analysis

  • ❏ D. Script files execute code sequentially when invoked which provides a deterministic and controlled execution flow compared to interactive notebooks

  • ❏ E. Scripts are generally lean and focus on production code rather than extensive visualizations or experimental notes

At Acme Analytics you run a cloud based machine learning pipeline that uses the Permutation Feature Importance module. When selecting evaluation metrics to assess model performance which metrics can be chosen? (Choose 2)

  • ❏ A. AUC ROC

  • ❏ B. Chi squared statistic

  • ❏ C. Precision

  • ❏ D. Accuracy

DP-100 Sample Questions Answered

Bistro Solace in Brooklyn is a well regarded neighborhood restaurant founded by Maria Chen and Daniel Ortiz and they are experimenting with machine learning to improve customer personalization. The team has adopted Microsoft Azure and plans to use Hyperdrive in Azure Machine Learning to tune model hyperparameters, and they cannot enable any early stopping policy because of their audit requirements. The hyperparameter space is a learning rate that can be any value between 0.0005 and 0.2 and a batch size that must be one of 8 16 or 32. Daniel is deciding which sampling strategy to pick and asks for your recommendation. Which sampling strategies should you recommend to Daniel to use? (Choose 2)

  • ✓ C. Random sampling

  • ✓ D. Bayesian sampling

Random sampling and Bayesian sampling are correct.

Random sampling works well here because it can natively sample from a continuous range for the learning rate and from a small set of categorical values for the batch size. It does not depend on any early stopping policy so it satisfies the audit requirement that early stopping remain disabled.

Bayesian sampling is also appropriate because it handles mixed parameter types and uses information from previous trials to suggest promising hyperparameter combinations, which is useful when you want to be efficient with trials and you cannot use early stopping.

Grid sampling is incorrect because the learning rate is a continuous interval and a grid would force an arbitrary discretization or an impractically large number of combinations, making grid search inefficient for this problem.

Hyperband sampling is incorrect because Hyperband relies on early stopping to allocate resources and prune trials, and the requirement forbids enabling any early stopping policy.

When choosing a sampling strategy first identify whether parameters are continuous or categorical and then check whether the optimizer requires early stopping. That will let you eliminate incompatible methods quickly.

Which model evaluation statistic is most commonly referred to as R squared?

  • ✓ C. Coefficient of determination

The correct answer is Coefficient of determination.

Coefficient of determination is the statistic most commonly called R squared. It expresses the proportion of variance in the target variable that is explained by the model and is the standard definition used in statistics and machine learning.

The coefficient of determination is computed as one minus the ratio of the residual sum of squares to the total sum of squares. That formulation makes clear that the metric compares model error to the error of predicting the mean and that higher values indicate better explanatory power up to a maximum of one.

Explained variance score is related because it also measures explained variability but it is not the canonical term R squared. Differences in implementation and handling of bias or multioutput targets can cause the two metrics to differ so the standard name remains the coefficient of determination.

Relative error denotes an error expressed relative to the true value and it is not a measure of explained variance so it is not R squared.

Relative squared error is a normalized squared error metric that compares model squared error to a baseline and it is not the coefficient of determination and therefore it is not called R squared.

When you see R squared on the exam think of the phrase coefficient of determination and look for answers that mention proportion of variance explained.

Which statement correctly describes managed compute clusters in a cloud analytics environment?

  • ✓ D. Managed compute clusters can automatically scale their node count to match workload demand

Managed compute clusters can automatically scale their node count to match workload demand is correct.

Managed clusters are built to adjust capacity as jobs start and finish. The control plane observes resource use and applies autoscaling policies to add or remove worker nodes so the cluster matches workload demand. This approach helps maintain performance while avoiding unnecessary cost compared to leaving capacity fixed.

Compute clusters are limited to CPU only machine types is incorrect because cloud providers offer a range of machine types and hardware accelerators. Clusters can include GPUs or memory optimized instances when analytics workloads need them.

Clusters generally have higher downtime than single compute instances is incorrect because clusters are usually more resilient than a single instance. A distributed cluster can tolerate individual node failures and orchestration systems often replace or reroute work away from failed nodes automatically.

Clusters must be resized manually and do not scale automatically is incorrect because managed analytics clusters commonly support automatic resizing through autoscaling policies. Manual resizing may be available but it is not the only mode of operation for managed services.

When a question pairs managed with automatically or scale think about autoscaling features and choose answers that mention dynamic node counts.

Nova Financial Systems was founded by Clara Boone and has grown into a major independent fintech firm. The team routes their machine learning workloads and training datasets over a virtual private network when running experiments and training models. Why does Nova use a VPN for its machine learning pipelines and model training?

  • ✓ B. To encrypt traffic and confine training resources to a private virtual network

To encrypt traffic and confine training resources to a private virtual network is correct.

A VPN encrypts data in transit and places training instances and datasets on private IP space inside a VPC so that traffic does not traverse the public internet. This protects sensitive financial data and keeps compute and storage resources isolated, which is why Nova routes its machine learning pipelines over a VPN.

To route traffic over encrypted private links such as Cloud VPN or Dedicated Interconnect is incorrect because it mixes mechanisms and implies both are equivalent encrypted links. Dedicated Interconnect provides a private high bandwidth connection but it is not an encryption mechanism by itself, and the question is asking about encrypting and confining training traffic rather than naming link types.

To administer access by assigning IAM roles to users and services is incorrect because IAM controls identity and permissions at the resource level and does not by itself provide network encryption or isolate network traffic into a private virtual network.

To allow developers to run model experiments without relying on cloud connectivity is incorrect because a VPN still depends on network connectivity to the cloud and its purpose is to secure and isolate traffic rather than enable offline operation.

When a question ties networking and machine learning think about the security goal first. Look for answers about encrypting data in transit and keeping resources in a private VPC rather than answers about identity or offline operation.

A data science team at Northbridge Labs needs to connect Azure Machine Learning with other Azure resources and reuse a massively parallel processing data platform that their group uses in other initiatives. They want to pick a compute target that can make use of an existing MPP Spark environment in their Azure workspace. Which compute target should they choose?

  • ✓ D. Synapse Analytics Spark pool

The correct option is Synapse Analytics Spark pool.

Synapse Analytics Spark pool lets Azure Machine Learning attach to and submit jobs to an existing massively parallel processing Spark environment in the workspace so the team can reuse their MPP data platform and integrate with other Azure resources without reconfiguring a separate cluster. It acts as a remote compute target for distributed Spark workloads and is designed for the scale and shared usage patterns that Synapse Spark pools provide.

Azure Machine Learning compute instance is a single user development VM that is intended for interactive development and not for attaching to an MPP Spark environment or running large scale distributed Spark jobs.

An autoscaling compute cluster that grows from three to six nodes would provide more compute than a single instance but it does not reuse an existing Synapse Spark pool or other MPP Spark environment that the team already manages so it does not meet the requirement to connect to the existing platform.

Azure Kubernetes Service is suitable for containerized training and scalable deployments and it is not the right choice when you need to reuse a Synapse or other MPP Spark pool because AKS does not provide a native Spark pool integration.

When a question states reuse of an existing MPP Spark environment look for compute targets that explicitly support Spark integration such as Synapse Spark pools rather than single node instances or generic clusters that cannot attach to the managed Spark service.

A data science team at Harborview Analytics trained a regression model to forecast quarterly revenue and they want to evaluate how accurate its predictions are. Which evaluation metric best matches the description “The mean of absolute differences between predicted and actual values measured in the same units as the target where a smaller value signals better accuracy”?

  • ✓ D. Mean Absolute Error MAE

Mean Absolute Error MAE is correct because it is defined as the mean of the absolute differences between predicted and actual values and it is measured in the same units as the target where a smaller value signals better accuracy.

This metric computes the average of absolute deviations from the ground truth and it is directly interpretable in the original units of the target variable which matches the description in the question.

Coefficient of Determination R2 is incorrect because it is a unitless measure of the proportion of variance explained by the model and it does not represent an average absolute difference in the target units.

Root Mean Squared Error RMSE is incorrect because although it is measured in the same units it computes the square root of the mean of squared errors and therefore emphasizes larger errors rather than taking the mean of absolute differences.

Mean Absolute Percentage Error MAPE is incorrect because it reports errors as percentages of the actual values and therefore is not measured in the same units as the target and can be unstable when actual values are near zero.

Relative Absolute Error RAE is incorrect because it expresses the total absolute error relative to a baseline such as the error of a naive predictor and it is a relative ratio rather than the simple mean of absolute differences in the target units.

When the question explicitly states mean of absolute differences and refers to the target units choose MAE. If the question mentions percentages then consider MAPE and if it mentions variance explained then consider R2.

You are training a regression model using Contoso Automated Machine Learning and your dataset contains missing values and categorical features with few unique categories. You want the AutoML run to automatically fill missing entries and transform categorical columns as part of the training pipeline. What configuration should you set to guarantee those preprocessing steps occur?

  • ✓ C. Set the featurization parameter to “auto”

The correct option is Set the featurization parameter to “auto”.

Set the featurization parameter to “auto” instructs the Automated Machine Learning run to perform automatic featurization. It will impute missing values and apply appropriate encodings to categorical columns, including handling columns with few unique categories, so you do not need to preprocess those fields manually before training.

Set the featurization parameter to “disabled” is incorrect because disabling featurization prevents any automatic preprocessing and the run will not fill missing entries or encode categorical features for you.

Set the featurization parameter to “custom” is incorrect because a custom setting requires you to provide specific featurization rules or a custom pipeline, so automatic imputation and encoding are not guaranteed unless you define them.

Set the featurization parameter to “enabled” is incorrect because it does not represent the automatic heuristic driven featurization provided by Set the featurization parameter to “auto” and therefore does not guarantee the same automatic handling of missing values and categorical transformations.

When a setting is labeled auto in AutoML it usually means the service will handle preprocessing and feature engineering for you, so choose it when you want the run to automatically manage missing values and categorical encoding.

The Metro Gazette, located in Riverton, modernized its infrastructure under editor in chief Eleanor Blake and adopted Microsoft Azure for new initiatives. Liam Reed is tasked with building a computer vision workflow to locate and extract the boundaries of several items within a photograph using Azure AutoML. Which type of computer vision model should Liam choose to perform this task effectively?

  • ✓ C. Object detection

The correct answer is Object detection.

Object detection models predict both the class label and the location of each instance in an image by returning bounding boxes and coordinates for every detected object. This makes object detection the appropriate choice when you need to locate and extract the boundaries of several items within a photograph, and Azure AutoML includes object detection as a supported vision task for this purpose.

Multi-label image classification is incorrect because it only assigns one or more labels to the entire image and does not provide locations or bounding boxes for individual items.

Instance segmentation is incorrect for this question because it produces pixel-accurate masks for each object rather than the bounding boxes that are typically used for simple boundary extraction, and the exam scenario expects the object detection approach supported by AutoML.

Multi-class image classification is incorrect because it assigns a single label to the whole image and cannot locate or extract multiple object boundaries within the photo.

When a question asks to locate or extract items in an image look for models that return bounding boxes or coordinates. That phrase usually points to object detection rather than classification.

A regional lender is using an automated machine learning tool to train a natural language processing classifier and they must pick training settings and evaluation procedures that align with responsible AI practices. What considerations should guide their decisions?

  • ✓ C. Emphasize ethical practices by documenting preprocessing steps and preferring interpretable models

The correct answer is Emphasize ethical practices by documenting preprocessing steps and preferring interpretable models.

Emphasize ethical practices by documenting preprocessing steps and preferring interpretable models is correct because documenting preprocessing creates transparency and reproducibility, and it makes it easier to audit data transformations for bias or errors. Preferring interpretable models helps stakeholders and regulators understand predictions, and it supports fair lending practices and actionable remediation when issues are found.

Practical evaluation under this approach includes multiple performance metrics, representative test sets, and fairness checks rather than relying on a single number. It also includes human review and clear records of how data was prepared, so that model behavior can be investigated and explained.

Rely only on accuracy metrics and ignore input preprocessing is wrong because accuracy alone can hide class imbalance and fairness problems, and ignoring preprocessing prevents detection of biased or corrupted inputs.

Choose highly complex neural networks without evaluating interpretability is wrong because highly complex models may perform well but they are often hard to explain, and that lack of interpretability is a poor fit for regulated lending decisions where explanations and accountability are required.

Cloud AutoML is a less suitable choice in this context and it is also an older branding that has largely been consolidated into Vertex AI AutoML. Automated tools can help, but using them without explicit documentation, interpretability checks, and fairness evaluation does not meet responsible AI practices and references to legacy services may be less likely on newer exams.

When you see options about model choices and evaluation favor the ones that mention documentation, interpretability, and multiple evaluation metrics rather than a single accuracy figure.

Real world datasets frequently include missing entries incorrect recordings and sampling biases that can skew analysis results. Which steps are appropriate when inspecting and preparing such data for modeling and evaluation? (Choose 4)

  • ✓ A. Inspect records for misentered or corrupted values

  • ✓ C. Confirm that the dataset size is sufficient to reflect real world variability

  • ✓ D. Identify and impute missing values

  • ✓ E. Check raw data for sampling or measurement bias

The correct options are Inspect records for misentered or corrupted values, Confirm that the dataset size is sufficient to reflect real world variability, Identify and impute missing values, and Check raw data for sampling or measurement bias.

Inspect records for misentered or corrupted values is important because validating individual entries reveals typos, formatting issues, and corrupted fields that can silently break training pipelines or skew metrics. Looking at raw records helps you decide whether simple cleaning or more extensive data repair is needed.

Identify and impute missing values matters because many algorithms cannot handle missing fields and because systematic missingness can bias results. Choosing an imputation strategy requires understanding the missingness mechanism and the impact on downstream tasks.

Check raw data for sampling or measurement bias matters because models trained on biased samples will not generalize to the target population. Assessing how data were collected and measuring differences across subgroups helps you spot bias and decide on correction or evaluation strategies.

Confirm that the dataset size is sufficient to reflect real world variability is necessary because small or nonrepresentative datasets can lead to overfitting and unstable performance estimates. Ensuring enough examples across important segments guides model selection and validation design.

Use Cloud Dataflow to preprocess and transform the dataset is incorrect because it names a specific tool rather than a fundamental inspection or preparation step. Cloud Dataflow can be used for large scale preprocessing but the question asks about appropriate data inspection and preparation activities rather than a required platform.

Favor answers that describe general data quality and representativeness processes rather than specific tools. Focus first on detecting bias, missingness, and corruption before choosing preprocessing technologies.

After you create an Orion Machine Learning workspace you can manage the assets and compute resources needed to train and deploy models, and the platform provides several types of compute targets for experimentation training and hosting. Which of the following compute resource types can you create inside an Orion Machine Learning workspace? (Choose 3)

  • ✓ A. Managed Kubernetes clusters

  • ✓ C. Compute instances

  • ✓ E. Compute clusters

The correct options are Managed Kubernetes clusters, Compute instances, and Compute clusters.

Managed Kubernetes clusters are provided as a workspace compute target so you can run containerized experiments and training on a Kubernetes environment that the platform manages or provisions for you.

Compute instances are single node compute resources that the workspace creates and manages for interactive development and smaller training runs.

Compute clusters are multi node clusters that the workspace can create for distributed training and larger scale batch workloads.

Virtual machine size is not a separate compute resource type. It describes an instance SKU or machine flavor but the workspace exposes compute resources as instances or clusters rather than a stand alone resource called Virtual machine size.

Google Kubernetes Engine is a specific cloud provider service. The workspace lists managed Kubernetes clusters as a compute type but it does not present a provider specific resource named Google Kubernetes Engine as a workspace compute option.

Inference clusters is not a named compute target in the workspace. Model hosting and inference use the platform s managed compute or endpoints and the term Inference clusters is not used as a distinct resource type.

Focus on the platform level compute types that a workspace can create and manage such as Managed Kubernetes clusters, Compute instances, and Compute clusters instead of vendor specific service names when you answer these questions.

Within the context of the CloudVista machine learning environment simple models with small datasets can usually be trained in a single pass but larger datasets and more complex models require iterative training that repeatedly applies the model to training data compares outputs to expected labels and adjusts parameters until a suitable fit is found. Hyperparameters control how the model is updated during these training iterations and preprocessing denotes transformations applied before data is given to the model. The most common preprocessing step is [__]?

  • ✓ B. Normalize features to lie between zero and one

The correct option is Normalize features to lie between zero and one.

Scaling features to a 0 to 1 range is a very common preprocessing step because it brings all inputs into a common numeric range and helps many learning algorithms train more reliably. Normalization speeds convergence for gradient based optimizers and prevents variables with large numeric ranges from dominating distance or similarity calculations.

Standardize features to have zero mean and unit variance is a common alternative that centers and scales data and it is often used when values are approximately normally distributed. It is not the answer in this question because the prompt asked for the most common single preprocessing step and normalization to the 0 to 1 range was specified as the correct choice.

Remove anomalous values from the dataset is an important data cleaning task but it is not the single most common preprocessing operation described in the question. Removing outliers changes the dataset composition and is separate from simple feature scaling.

Google Cloud Storage is a storage service and not a data preprocessing step, so it does not answer the question about transformations applied before data is given to a model.

When you see preprocessing questions focus on feature scaling first and consider which method maps values to a fixed range for model training.

At Harbor Data Labs the analytics group uses several techniques in Azure Machine Learning Studio to validate models that predict continuous outcomes. Which algorithm listed below focuses on minimizing the differences between observed values and predicted values and is therefore most appropriate for fitting a linear relationship?

  • ✓ D. Linear Regression

The correct answer is Linear Regression.

Linear Regression fits a straight line relationship between predictors and a continuous outcome by selecting coefficients that minimize the sum of squared differences between observed values and predicted values using ordinary least squares. This focus on minimizing residuals makes it the most appropriate choice when the goal is to model a linear relationship for continuous targets.

Fast Forest Quantile Regression is incorrect because it is an ensemble tree method designed to estimate conditional quantiles rather than to fit a simple linear relationship. It models complex and potentially non linear patterns and is not aimed at minimizing ordinary least squares residuals.

Ridge Regression is incorrect in this context because it is a regularized variant of linear regression that adds an L2 penalty to coefficient sizes. It still targets continuous outcomes but it trades off residual minimization against coefficient shrinkage to control overfitting, so it is not the plain ordinary least squares method the question describes.

Boosted Decision Tree Regression is incorrect because boosted trees build ensembles of decision trees to capture nonlinear relationships and interactions. They do not fit a single linear equation and they optimize loss functions through boosting rather than performing ordinary least squares linear fitting.

Look for keywords such as minimize the differences or least squares to pick ordinary linear regression when the exam asks for the method that fits a linear relationship for continuous outcomes.

Lumen and Oak is an upscale Brooklyn bistro founded by Emma Carter and Daniel Shaw who are exploring ways to streamline their operations. They have adopted Microsoft Azure and you were hired to lead several initiatives. The team published a machine learning solution built with Azure Machine Learning Designer as a real time web service on an Azure Kubernetes Service inference compute cluster and they did not change the deployed endpoint settings. You need to provide the application developers with the values required to invoke the endpoint and you will work with an intern to gather the details. Which values should you supply to the application developers to call the endpoint? (Choose 2)

  • ✓ B. The URL of the web service endpoint

  • ✓ E. The endpoint authentication key

The correct options are The URL of the web service endpoint and The endpoint authentication key.

The The URL of the web service endpoint is the network address or scoring URI that client applications must call to send HTTP requests to the deployed model. Without this URL the application does not know where to send inference requests.

The The endpoint authentication key is required when the deployed endpoint is secured so that the service accepts requests from authorized clients. The client includes this key in the request header or query string to authenticate each call.

The run identifier for the inference pipeline execution is useful for tracking and debugging the pipeline run but it is not required by client code to invoke the live web service.

The name of the Azure Kubernetes Service cluster hosting the service is an infrastructure detail that does not serve as the request URL or an authorization credential and so it is not needed to call the endpoint.

The container image name used for the deployment identifies the container artifact used by the deployment and can help with troubleshooting, but it is not required for making REST calls to the running service.

When preparing to call a real time Azure ML endpoint gather the endpoint URL and the authentication key from the deployed service settings and validate them with a simple curl or Postman request before integrating into your application.

Rita Chen recently joined Aurora Air which is expanding into international routes. The airline wants to label passenger feedback with sentiments such as “negative” “neutral” and “positive”. Rita is using Azure AutoML to build a classification model and she must remove columns that behave like record identifiers because they do not help prediction. Which preprocessing transformation should she apply?

  • ✓ C. Drop fields that have very high cardinality

The correct option is Drop fields that have very high cardinality.

Fields with very high cardinality often behave like record identifiers or near unique keys and they do not help a classification model generalize. Keeping those columns can introduce noise and lead to overfitting so removing them helps Automated ML concentrate on predictive attributes such as the text feedback.

Convert categorical variables to numeric representations is a useful preprocessing step when categories are meaningful and have moderate cardinality. Encoding does not solve the problem of identifier like columns and turning near unique categories into numeric features can create many sparse values that still harm the model.

Fill in missing values with imputation methods is appropriate when you have nulls in your data but it does not address columns that act as record identifiers. Imputation will not make an identifier any more predictive for sentiment labeling.

Apply feature engineering to create derived attributes can improve model performance when applied to informative fields such as text or user demographics. Creating derived features from an identifier is unlikely to help so it is better to drop identifier like columns first and then apply feature engineering to the remaining data.

When you see columns with almost unique values per row look for record identifiers and remove them before training. Identifiers do not generalize and removing them often improves automated model selection.

UrbanBite Restaurants is a U.S. quick service chain led by Marco Lema and based in Boulder, Colorado. The company plans to open locations overseas and this has created IT challenges so Marco has asked for your assistance. The immediate goal is to provision a shared data science environment for the analytics team. The training dataset for models exceeds 45 GB in size. Models must be developed using either Caffe2 or Chainer frameworks. Data scientists need to build machine learning pipelines and train models on their personal laptops while online and while offline. Laptops must be able to receive pipeline updates once they reconnect to the network. Which data science environment best meets these requirements?

  • ✓ D. Azure Machine Learning

Azure Machine Learning is the correct choice because it provides a managed, end to end data science platform that supports large training datasets, reproducible environments, and offline capable workflows.

Azure Machine Learning supports large data by using datastores and cloud storage and it can run scalable training on remote compute so the 45 GB dataset can be handled without forcing all work to happen on a laptop. The service lets you define Azure Machine Learning environments with Conda and Docker so custom frameworks such as Caffe2 and Chainer can be installed and reproduced both in the cloud and on developer laptops. It also supports building and versioning pipelines and components and you can run experiments locally with the SDK or package environments so engineers can work offline and then synchronize runs, updated pipelines, and registered models when they reconnect.

Azure Databricks is designed for Spark based big data processing and collaborative analytics but it does not provide the same managed model lifecycle, environment packaging, and offline laptop development and pipeline distribution features that Azure Machine Learning provides.

Azure Kubernetes Service is a container orchestration platform and it does not itself provide data science tooling, model lifecycle management, or an easy offline development and pipeline distribution workflow for individual data scientists.

Azure Machine Learning Designer is a visual, low code interface for building experiments and it can be useful for quick prototyping but it is not as flexible for custom frameworks like Caffe2 or Chainer and it does not address offline laptop development and robust pipeline distribution as fully as the full Azure Machine Learning service.

When a question calls for offline development and custom framework support, choose a managed platform that supports reproducible environments and pipeline or model versioning so artifacts can be synced when developers reconnect.

A data engineer must transfer a large dataset from Contoso Machine Learning Studio into a Weka environment and must convert the files into a format that Weka can read. Which conversion module will best produce a file compatible with Weka?

  • ✓ C. Convert to ARFF

The correct option is Convert to ARFF.

The ARFF format is Weka’s native data format and it includes a header section that defines each attribute and its type as well as the data section that follows. Converting to ARFF produces a file that preserves attribute names and types so Weka can load the dataset directly without additional schema inference or mapping.

The Azure Machine Learning Studio conversion module that generates ARFF files is designed to produce output compatible with Weka and similar tools that expect the ARFF header and data layout. That makes Convert to ARFF the best choice when the target environment is Weka.

Note that Azure Machine Learning Studio classic modules are being superseded by the newer Azure Machine Learning service so exam content may favor current tooling, but the ARFF format remains the correct interoperable choice for Weka.

Export as CSV is not the best answer because CSV files do not include attribute type declarations or a formal header that describes nominal values and data types. Weka can import CSV with extra work or with a CSV loader, but CSV does not guarantee the explicit metadata ARFF provides.

Convert to TFRecord is incorrect because TFRecord is a TensorFlow binary format and it is not readable by Weka without custom conversion. TFRecord is optimized for TensorFlow pipelines rather than traditional Weka workflows.

Export as LIBSVM format is incorrect because LIBSVM uses a sparse input format tailored for libsvm and liblinear. That format does not include the ARFF style attribute metadata that Weka expects and so it is not directly compatible without conversion.

When a question asks for a file that is directly compatible with Weka think ARFF since it carries both attribute declarations and data. If an option mentions a TensorFlow or libsvm format then it is likely not the direct answer.

Ashford Analytics is a data company started by Adam Ashford and it is valued at over forty million dollars. After founding the Ashford Trust Adam became the company chairman and lead technologist. Adam has asked you to assist as his engineering group adopts Azure Databricks for their machine learning workloads. During a team workshop the engineers examined Driver and Executor roles in Databricks and they observed that Spark parallelism uses clusters made up of a Driver and one or more executors. The lead engineer wants to know what type of object submitted work is partitioned into on a Spark Cluster?

  • ✓ D. Jobs

The correct option is Jobs.

When you submit work to Spark the unit that represents that submitted work is a Job. The driver creates a Job for each action and the scheduler decomposes the Job into Stages based on shuffle boundaries and then into tasks that run on executors.

The driver coordinates and schedules Jobs and assigns the tasks for each stage to executors so the work executes in parallel across the cluster.

Arrays is incorrect because Spark does not partition submitted work into arrays. Spark partitions data into RDD or DataFrame partitions and it organizes execution as jobs, stages, and tasks.

Sessions is incorrect because a SparkSession or interactive session provides the runtime context for running work but it is not the object that submitted work is partitioned into.

Stages is incorrect because stages are subdivisions within a Job and not the top level object that represents the submitted work. Stages are produced by the scheduler after a Job is submitted.

Focus on the level asked by the question. A Spark action creates a Job which is then split into stages and tasks. Pay attention to whether the question asks for the top level object or its subdivisions.

Atlas Dataworks is auditing a dataset and a junior analyst asks you to interpret a NumPy array with the shape (3, 25). The analyst is learning how array shapes describe data layouts and requests a simple explanation of what the tuple (3, 25) communicates about the elements inside the array. How would you explain this shape?

  • ✓ D. A two dimensional array made up of three rows each containing 25 elements

The correct answer is A two dimensional array made up of three rows each containing 25 elements.

This means the array has two dimensions and the shape tuple gives the length of each axis. With shape (3, 25) the first value is the number of rows and the second value is the number of elements in each row so there are three rows and each row contains 25 values. The total number of elements is 3 times 25 which equals 75.

In NumPy you index a two dimensional array with two indices such as arr[0, 0] to access the element in the first row and first column. The shape tuple therefore tells you the layout of the data and how to iterate or index into it.

BigQuery is incorrect because it is the name of a Google Cloud product and not a description of an array shape. The question refers to the structure of a NumPy array and not to a database service.

The structure represents a single dimensional sequence with 75 elements is incorrect because although 3 times 25 equals 75 the shape tuple shows two dimensions. A one dimensional array would be represented by a single number such as (75,) rather than a two element tuple.

The array contains three elements whose values are three and twenty five is incorrect because the shape describes counts per axis and not the actual values stored in the array. The tuple (3, 25) describes sizes of dimensions and not element contents.

When you see a shape tuple read it left to right as sizes for each axis and multiply the numbers to get the total number of elements.

A data engineer at a regional bookseller is preparing sales data with pandas in Python and notices entries that repeat the same values in “cust_id” and “order_id”. They must remove duplicate rows so only the first occurrence of each pair is kept. Which pandas call will perform this operation?

  • ✓ C. drop_duplicates(subset=[“cust_id”, “order_id”], keep=’first’)

The correct answer is drop_duplicates(subset=[“cust_id”, “order_id”], keep=’first’).

drop_duplicates(subset=[“cust_id”, “order_id”], keep=’first’) tells pandas to compare only the cust_id and order_id columns when identifying duplicates and to keep the first row for each duplicate pair. Using subset as a list ensures the pair is treated together and keep=’first’ preserves the initial occurrence for each unique pair.

duplicated(subset=[“cust_id”, “order_id”]) is not correct because duplicated only returns a boolean mask that marks which rows are duplicates. It does not remove rows by itself and would need to be combined with boolean indexing to drop rows.

drop_duplicates(subset=”cust_id”, keep=’first’) is not correct because that only considers the cust_id column. It will not deduplicate based on the pair of cust_id and order_id, so rows that share both values may still remain if order_id differs or may be dropped incorrectly if only cust_id is the same.

drop_duplicates(keep=’first’) is not correct because when subset is omitted pandas compares all columns. That means only rows that are identical across every column will be dropped and duplicates defined by just cust_id and order_id pairs may not be removed if other columns differ.

When you need to remove duplicates based on multiple columns use drop_duplicates with a list for subset and set keep=’first’ to preserve the earliest row for each group.

A data science team at Meridian Analytics needs to install the Azure Machine Learning CLI extension into their environment. The extension adds command line operations for managing Azure Machine Learning resources. Which software must already be installed before adding this extension?

  • ✓ B. Azure CLI

The correct answer is Azure CLI.

The Azure Machine Learning CLI extension is an add‑on for the Azure CLI and so it requires the Azure CLI to be installed first. The extension registers additional commands under the az command namespace and depends on the Azure CLI core for authentication and command dispatch.

Google Cloud SDK is incorrect because it is the command line tool for Google Cloud and has no role in installing Azure CLI extensions.

Azure PowerShell is incorrect because PowerShell uses a different command framework and module system and it is not the prerequisite for Azure CLI extensions.

Power Apps is incorrect because it is a low code app platform and it is unrelated to installing command line extensions for Azure Machine Learning.

Remember that an extension typically attaches to a base tool so identify the host CLI before choosing an answer. For Azure Machine Learning CLI questions the base tool to look for is the Azure CLI.

Scenario Blue Ridge Analytics is a consulting firm started by Jordan Hale who assembled a small analytics group to support client reporting. The team is analyzing sales records that are kept in a Pandas DataFrame named sales_df. The DataFrame includes the columns year month day total_sales. Which code snippet should be used to compute the average total_sales value efficiently?

  • ✓ C. sales_df[‘total_sales’].mean()

The correct answer is sales_df[‘total_sales’].mean().

sales_df[‘total_sales’].mean() calls the pandas Series mean() method which performs a vectorized computation in optimized code so it computes the arithmetic average efficiently and handles missing values according to pandas semantics.

sales_df[‘total_sales’].median() is incorrect because the median is the middle value and not the arithmetic mean that the question asks for.

sales_df[‘total_sales’].average() is incorrect because pandas Series does not have an average() method and attempting to call it will raise an AttributeError. Use the mean() method instead.

mean(sales_df[‘total_sales’]) is incorrect in this context because mean is not a built in that automatically applies to pandas Series unless you import it from another library, and it is not the idiomatic pandas approach which is to call the Series mean() method.

When a question asks for an aggregate on a pandas column look for the Series method name like mean() and remember that pandas methods are called as attributes on the Series object.

Scenario: Meridian Bistro in San Francisco was opened by Lena Ortiz and Omar Hale, and they have adopted Microsoft Azure to modernize operations and hired you to lead several IT projects. The current assignment is to use Azure Machine Learning Designer to construct a pipeline that trains a classification model and then make that trained model available as an online service. What steps must be completed before you can deploy the trained model as a service?

  • ✓ B. Create an inference pipeline derived from the training pipeline

Create an inference pipeline derived from the training pipeline is the correct step you must complete before deploying the trained model as an online service.

You derive an inference pipeline so that the pipeline contains only the preprocessing and scoring components that are needed at runtime and so that any training only modules are removed. The inference pipeline is what you convert into a web service endpoint or what you publish for online deployment because it defines the inputs and outputs for scoring and it includes the final trained model and any transformation steps required at inference time.

Register the trained model in the Azure Machine Learning model registry is not the required step in this Designer scenario because Designer deployment workflows focus on converting the pipeline to an inference pipeline and publishing that pipeline as a service. Registering a model is a common pattern in other workflows but it is not the specific prerequisite asked for here.

Add an Evaluate Model module into the original training pipeline is incorrect because evaluation is used to measure model quality during training and it does not produce an inference pipeline or the scoring logic needed for deployment. Evaluation helps decide if a model is good enough but it does not make the model deployable by itself.

Clone the training pipeline and swap the algorithm to a regression learner is wrong because changing to a regression algorithm would change the problem type from classification to regression and it is not a deployment requirement. Deployment requires preparing a scoring pipeline not changing the model type.

When a question mentions deploying a model from the Designer look for steps that create or export an inference pipeline that isolates scoring logic and input output schemas for serving.

Scenario: Marlowe Textiles is a family run retailer with several stores across Greater Manchester and it recently purchased a small fashion label based in Barcelona. As part of the consolidation the company is migrating its systems into Marlowe’s Microsoft Azure environment and the CTO has hired you as an Azure consultant to guide the integration. The current work stream focuses on Azure Machine Learning. The engineering team provisioned an Azure Machine Learning compute target named ComputeA using the STANDARD_D2 virtual machine image. ComputeA is currently idle and has zero active nodes. A developer set a Python variable ws to reference the Azure Machine Learning workspace and then runs this code python from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException the_cluster_name = “ComputeA” try: the_cluster = ComputeTarget(workspace=ws, name=the_cluster_name) print(“Step1″) except ComputeTargetException: config = AmlCompute.provisioning_configuration(vm_size=”STANDARD_DS13_v2”, max_nodes=6) the_cluster = ComputeTarget.create(ws, the_cluster_name, config) print(“Step2”) The CTO is concerned that the output is not matching expectations and asks whether Step1 will be printed to the console. Will Step1 be printed to the screen?

  • ✓ B. Yes the text Step1 will be printed to the screen

The correct answer is Yes the text Step1 will be printed to the screen.

The call ComputeTarget(workspace=ws, name=the_cluster_name) attaches to an existing compute target when it exists. ComputeA was already provisioned and it can be referenced even when it is idle and has zero nodes. Because the compute target exists the try block succeeds and the code executes the print statement that produces Step1.

No the script will print Step2 instead is incorrect because Step2 is only printed from the except block. That except block runs when a ComputeTargetException is raised for a missing or non reachable compute target. In the given scenario ComputeA exists so the except block is not executed.

An unhandled exception will occur and the program will fail is incorrect because the code is already catching ComputeTargetException and a normal successful lookup does not raise an exception. The program will therefore reach the print in the try block rather than failing.

It depends on workspace authentication and the run may fail if ws is invalid is misleading in this context. Workspace authentication problems could cause other errors, but the question states a valid workspace reference ws and a preprovisioned ComputeA. Under those stated conditions the lookup succeeds and Step1 is printed.

When a question describes a preprovisioned compute target remember that the SDK call to attach or get the target succeeds even if the cluster has zero active nodes. Focus first on what the code does in the try block and only consider authentication or runtime errors if the scenario mentions them or the exception is raised.

When configuring an Automated Machine Learning experiment in Contoso AI which setting is not available to change during model creation?

  • ✓ D. Turn on a native switch to train solely on a fraction of the input dataset

Turn on a native switch to train solely on a fraction of the input dataset is the correct setting that is not available to change during model creation.

Automated Machine Learning workflows require any dataset subsetting or sampling to be handled before the AutoML run starts or through a separate preprocessing step. There is no built in toggle during the model creation UI that simply tells the system to train only on a fraction of the input data and nothing else.

Set the experiment timeout to 90 minutes is incorrect because AutoML configurations let you set a maximum run duration or timeout when you create the experiment.

Select R squared as the main evaluation metric is incorrect because you can choose the evaluation metric for the task type and R squared is available for regression tasks in automated experiment settings.

Limit training to only the XGBoost algorithm is incorrect because AutoML platforms commonly allow you to include or exclude specific algorithms or to provide a custom list so you can restrict the search to XGBoost if you want.

When a question asks what you cannot change during model creation think about whether the action is a data preprocessing step or a model configuration option. If it is about sampling or modifying the dataset then it is often done before the AutoML run and not via a single runtime toggle. Preprocessing is usually separate from the AutoML model creation screen.

Lena Park is a computer vision engineer at Novex Security who is using Azure AutoML with the Azure Machine Learning Python SDK v2 to build a model that finds vehicles in rooftop camera images and returns bounding box coordinates for each detected vehicle. Which AutoML task should she choose so the trained model outputs bounding box coordinates for detected vehicles?

  • ✓ C. azure.ai.ml.automl.image_object_detection

The correct answer is azure.ai.ml.automl.image_object_detection.

The azure.ai.ml.automl.image_object_detection task trains models to locate objects in images and it returns bounding box coordinates along with class labels for each detected object which makes it appropriate for finding vehicles and reporting their box coordinates.

azure.ai.ml.automl.image_instance_segmentation is not correct because it produces pixel level segmentation masks for objects rather than returning bounding box coordinates.

azure.ai.ml.automl.image_classification is not correct because it assigns a single label to an entire image and it does not provide object locations or bounding boxes.

azure.ai.ml.automl.image_classification_multilabel is not correct because it can predict multiple labels for an image but it still does not output bounding boxes for object localization.

When a question asks for object locations or coordinates choose an object detection task rather than classification or segmentation.

ArchForm Studio is an architecture firm in Chicago that was founded by Elena Park and is moving its legacy infrastructure to Microsoft Azure. The team plans to use Horovod to train a deep neural network and has set up Horovod across three servers each with two GPUs to support synchronized distributed training. Within the Horovod training script what is the main purpose of hvd.callbacks.BroadcastGlobalVariablesCallback(0)?

  • ✓ D. To broadcast the initial model variables from rank 0 so every worker begins with identical parameters

The correct answer is To broadcast the initial model variables from rank 0 so every worker begins with identical parameters.

The hvd.callbacks.BroadcastGlobalVariablesCallback(0) makes the process with rank 0 send its model variables to all other processes at the start of training so every worker begins from the same parameter values. This initial broadcast prevents divergence that would occur if workers started from different random initializations and it is commonly used with Keras models when running Horovod for synchronized distributed training.

Use tf.distribute.MirroredStrategy for synchronous GPU training is incorrect because that option refers to TensorFlow native distribution strategies and not to what the Horovod broadcast callback does. MirroredStrategy is an alternative approach but it does not describe the purpose of BroadcastGlobalVariablesCallback.

To average evaluation metrics across participants at the end of each epoch is incorrect because broadcasting global variables only synchronizes model parameters. Averaging metrics is performed with collective operations such as allreduce rather than with a broadcast callback.

To send the final model variables from rank 0 to other processes after training finishes is incorrect because the BroadcastGlobalVariablesCallback is used to synchronize variables at initialization. It is not intended to propagate final model weights after training ends.

Remember that a broadcast callback in Horovod is about initial synchronization from the root rank so focus on aligning model parameters before training when you see this type of callback.

After deploying a model for online inference to an endpoint in a cloud project what is the default method to invoke that endpoint?

  • ✓ C. REST API

REST API is the correct option for invoking an endpoint by default after deploying a model for online inference.

The cloud prediction endpoints are exposed as RESTful services and the primary programmatic interface is the REST API. You call the endpoint with standard HTTP requests to the predict method to obtain online predictions and this is the default protocol used for automated or production traffic.

Python client library is not the default method even though it provides a convenient SDK. The client library is a higher level wrapper that itself calls the underlying REST API.

Cloud Console UI is not the default method for programmatic invocation because the console is intended for manual testing and exploration rather than automated requests.

gcloud CLI is not the default method either because it is a command line tool that wraps the same REST API and is used for scripting and management rather than being the native protocol.

When a question asks how to invoke a deployed endpoint programmatically choose the REST API unless the question explicitly asks for a client library or CLI.

Maya Rivera is the lead for a computer vision initiative at Harborview Research Center and she is working with surveillance footage from a hospital loading zone to train a model with the Azure Machine Learning Python SDK v2 and the objective is to output bounding box coordinates for vehicles detected in the images. Which Azure AutoML image task should she choose to have the trained model produce bounding box coordinates for vehicles?

  • ✓ D. azure.ai.ml.automl.image_object_detection

azure.ai.ml.automl.image_object_detection is correct.

The azure.ai.ml.automl.image_object_detection task is designed to train models that detect and localize objects in images and it returns class labels together with bounding box coordinates for each detected object. Using the Azure Machine Learning Python SDK v2 you select the image object detection task when you need the model to output coordinates for vehicle bounding boxes in surveillance frames.

azure.ai.ml.automl.image_classification is incorrect because image classification predicts a single class label for an entire image or multiple labels for an image and it does not produce bounding box coordinates for object localization.

AutoML Vision is incorrect in this context because it is a generic product name and not the specific AutoML SDK v2 task that yields bounding box outputs. Exams and SDK usage require selecting the explicit object detection task rather than a high level product name.

azure.ai.ml.automl.image_instance_segmentation is incorrect for this question because instance segmentation predicts pixel level masks for each object instance and not the simple bounding box coordinates that the question requests, although it does provide detailed localization it is not the direct match for a bounding box output requirement.

When a question asks for predicted bounding box coordinates look for the task name that explicitly mentions object detection rather than classification or segmentation.

Scenario The Pacific Collegiate Wrestling League was founded by promoter Marcus Bell and he is using cloud solutions to modernize the organization. He has asked for guidance on their Microsoft Azure setup. The analytics team is training a classification model and plans to measure performance with k fold cross validation on a small sample of data. The data scientist must select a value for the k parameter which determines how many splits the dataset will be divided into for the cross validation procedure. Which value should they pick?

  • ✓ C. k=10

The correct option is k=10.

k=10 is commonly chosen for small datasets because it balances bias and variance and it makes efficient use of limited data by training on most of the sample in each fold while still providing multiple independent test evaluations. With ten folds each run trains on about 90% of the data and tests on about 10% which yields more stable performance estimates than very small k values.

k=10 does increase computation because the model is trained ten times but this cost is often acceptable for small samples where getting a reliable estimate of generalization is more important than minimizing runtime.

k=5 is a valid and common choice but it uses fewer folds so its performance estimates can have higher variance on small datasets. It is more appropriate when compute is constrained but it is not the preferred default for very small samples.

k=0.5 is not valid because k must be an integer greater than or equal to two. Fractional values do not define discrete folds and are not accepted by standard cross validation implementations.

k=1 is not valid for k fold cross validation because one fold leaves no independent test split and most libraries require at least two folds. A value of one therefore does not perform cross validation.

When the dataset is small prefer k=10 to reduce variance and make the most of your data and remember that k must be an integer of two or more.

You are moving from a data engineering role into data science and you often encounter the phrase “data wrangling” while working with Google Cloud Vertex AI. How would you describe “data wrangling” in the Vertex AI workflow?

  • ✓ B. Data wrangling is the interactive process of cleaning structuring and enriching raw datasets so they meet the input expectations of a machine learning pipeline

The correct option is Data wrangling is the interactive process of cleaning structuring and enriching raw datasets so they meet the input expectations of a machine learning pipeline.

Data wrangling refers to the hands on work of preparing raw data so machine learning models can consume it. This includes cleaning noisy or missing values, converting types and formats, normalizing or scaling features, joining and reshaping tables, and creating or encoding features. In Vertex AI this step is part of dataset preparation and aims to make the data match the pipeline and model input expectations.

Splitting a dataset into training and evaluation subsets is only one small part of preparing data for models and does not capture the broader interactive cleaning and enrichment activities that define wrangling.

Managing storage access versioning and collaboration for datasets across a team describes dataset management and governance tasks rather than the hands on transformation and cleaning that data wrangling involves.

Using Cloud Dataprep or Dataflow to automate large scale extract transform load jobs names tools and automation that can help with parts of data preparation but it frames wrangling as just using specific services and as purely automated ETL. Data wrangling emphasizes the interactive cleaning and structuring steps that ensure data meets ML input expectations.

When identifying data wrangling on the exam look for options that describe the interactive cleaning structuring and enrichment of data rather than options that mention only a single step or only the tools used.

A data scientist at a small analytics startup is preparing k fold cross validation for a classification model and must choose the number of folds to balance evaluation thoroughness and compute time on a limited dataset. Which k value is most suitable?

  • ✓ C. Set k to 10 folds

The correct answer is Set k to 10 folds.

Set k to 10 folds is a common compromise because it produces a reasonably low variance estimate of model performance while keeping compute proportional to ten training runs instead of one run per example. Ten fold cross validation tends to be more thorough than very small k values and it is far less computationally expensive than leave one out cross validation, so it fits the requirement to balance evaluation thoroughness and compute time on a limited dataset.

Set k to 5 folds is a plausible alternative when compute is tighter because it reduces the number of training runs, but it gives a less thorough assessment and typically has higher variance than ten folds, so it is not the best match for the scenario described.

Use leave one out cross validation with k equal to the number of training examples trains one model per example which is often prohibitively costly and it can produce high variance estimates for many learners, so it does not meet the requirement to balance compute time with thorough evaluation.

Set k to 2 folds is too coarse for reliable performance estimation because each fold uses only half the data for training, which increases variance and yields unstable evaluation results, so it is not suitable when a reasonably thorough assessment is needed.

When a question asks to balance thoroughness and compute, pick the commonly recommended compromise such as k = 10 unless the dataset size or strict compute limits clearly point to a smaller k.

Harborview Loans is a regional mortgage firm with branches across Oregon and it is run by Nora and Alan Pierce. Priya Rao leads a new ML Designer experiment and she needs to wrangle CSV files stored in an Azure Blob Storage container named finance-archive with a folder called pricing-data to build a loan pricing model. She wants the simplest method to access and process the files inside a notebook while minimizing setup steps. What approach should she choose?

  • ✓ C. Open the blobs directly in the notebook by using the blob URI together with a SAS token

The correct option is Open the blobs directly in the notebook by using the blob URI together with a SAS token.

This approach is the simplest because you can generate a time limited SAS token and append it to the blob URI to read CSV files directly from a notebook without extra workspace configuration or data registration. Using the blob URL with a SAS token lets you call pandas.read_csv or stream the blob from Python and it keeps setup minimal for an experiment or ad hoc work.

Using the blob URI together with a SAS token also allows you to limit permissions and expiry on the token so you can grant only the access needed while avoiding longer lived credentials. It is therefore convenient for quick data wrangling in a compute instance while still maintaining access control.

Register the storage account as a datastore in the workspace and create a data asset for the files is not the simplest path because registering a datastore and creating a data asset requires workspace configuration and additional steps that are useful for production or repeatable datasets but they add setup overhead for a quick notebook task.

Use Azure Storage Explorer to download the blob files to the compute instance before working with them is suboptimal because it forces you to copy data and manage local files. This adds manual steps and can lead to duplicated data and slower iteration compared with reading blobs directly from the notebook.

Employ the Azure Machine Learning Python SDK v2 to register and access the data programmatically is more involved because the SDK registration and programmatic access are powerful for reproducible pipelines but they require more code and workspace setup than simply using a SAS URL for ad hoc access.

When you need quick, ad hoc access from a notebook use a SAS token with the blob URL. For repeatable or production workflows register data as a datastore or data asset.

Aurora Club in Meridian City is an upscale venue and it also serves as the headquarters for Damien Voss’s side operations and you were hired to advise their analytics group. The team is building a scikit-learn regression model in Python and the dataset contains several numeric fields plus one text column named ‘ItemCategory’ with values SportsCars Motorbikes Yachts Trucks. They defined a mapping dictionary CategoryCode = { ‘SportsCars’ : 1 , ‘Motorbikes’ : 2 , ‘Yachts’ : 3 , ‘Trucks’ : 4 } and they intend to create a numeric column with dataset[ ‘CategoryCode’ ] = dataset[ ‘ItemCategory’ ][?] They will then use CategoryCode with other numeric features to fit LinearRegression. What expression should replace [?] to transform the text categories into numeric values that scikit-learn can accept?

  • ✓ B. map(CategoryCode)

The correct option is map(CategoryCode).

Use map(CategoryCode) because pandas Series.map accepts a dictionary and returns a new Series where each category string is replaced by the corresponding integer. Assigning that Series to dataset[‘CategoryCode’] produces a numeric column that scikit learn LinearRegression can accept as an input feature.

apply(CategoryCode) is incorrect because Series.apply expects a function or callable to apply to each element and passing a dictionary will not perform the simple key lookup mapping that map provides. apply is for custom element wise operations rather than direct dictionary replacement.

gen(CategoryCode) is incorrect because there is no standard pandas method named gen and attempting to use it would raise an attribute error or fail to exist on the Series object.

transpose(CategoryCode) is incorrect because transpose only changes the orientation of an array or DataFrame and it does not map string categories to integers. It also does not accept a mapping dictionary for this purpose.

When converting string categories to numeric values in pandas use Series.map with a dictionary for direct mappings and consider Series.astype(‘category’).cat.codes when you want compact integer codes managed by pandas.

Scenario The Aurora Lounge is Riverton’s upscale nightclub and a discreet front for a local syndicate. You are acting as a consultant to improve their analytics processes. The team is working with a Python DataFrame named revenue_df and they need to convert it from a wide layout to a long layout using pandas.melt. The wide DataFrame contains the columns shop 2019 2020 and the rows 0 StoreA 40 30 1 StoreB 70 80 2 StoreC 52 58. The expected long format should have the columns shop year value and one row per store per year. A developer left placeholders [A] [B] and [C] in this snippet import pandas as pd revenue_df = pd.melt([A], id_vars='[B]’, value_vars=[C]) Which arguments should replace the placeholders so that the code performs the intended unpivoting operation?

  • ✓ A. [A] revenue_df, [B] shop, [C] [“2019″,”2020”]

The correct option is [A] revenue_df, [B] shop, [C] [“2019″,”2020”].

The first argument to pandas.melt should be the DataFrame to unpivot so it must be the variable revenue_df. The id_vars argument should be the identifier column name that stays as is which in this case is ‘shop’. The value_vars argument should be a list of the year columns to melt into rows which are ‘2019’ and ‘2020’. By default pandas.melt will produce a ‘variable’ column and a ‘value’ column and you can rename those with var_name and value_name if you need ‘year’ and ‘value’.

[A] dataFrame, [B] shop, [C] [“2019″,”2020”] is incorrect because the first placeholder must be the actual DataFrame variable revenue_df. Using a generic name like dataFrame will fail unless that exact variable exists and it is not the intended object in the question.

[A] revenue_df, [B] StoreA, StoreB, StoreC, [C] [“year”] is incorrect because id_vars should be the column name that identifies rows such as ‘shop’ and not the row values StoreA and StoreB. Also value_vars must list the source columns to unpivot like ‘2019’ and ‘2020’ and not a single string ‘year’.

[A] bigquery, [B] value, [C] “shop” is incorrect because the first argument must be the DataFrame revenue_df and not bigquery. The id_vars should be the identifier column ‘shop’ and not ‘value’. The value_vars should be a list of columns to melt and not the single string ‘shop’.

When using pandas.melt remember to pass the actual DataFrame as the first argument and use id_vars to keep identifier columns and value_vars to list the columns to unpivot. Use var_name and value_name to set friendly output column names.

In a Jupyter notebook you have a registered Workspace object named project_ws. Which call retrieves the workspace default datastore?

  • ✓ B. project_ws.get_default_datastore()

The correct option is project_ws.get_default_datastore().

This Workspace method returns the workspace default Datastore object which you can use to read and write files and to connect datasets and compute against the workspace storage.

project_ws.get_dataset() is incorrect because the Workspace API does not expose a method by that name and dataset retrieval is handled through the Dataset APIs rather than a workspace method for datastores.

project_ws.default() is incorrect because there is no such method on the Workspace class and it will not return the default datastore.

project_ws.find() is incorrect because this is not the documented way to retrieve the workspace default datastore and it does not return the Datastore object that get_default_datastore does.

When you are unsure check the Workspace class reference in the SDK to confirm exact method names and return types and remember that methods containing datastore usually return Datastore objects.

Scenario: Meridian Biotech analytics team led by Ana Rivera and James Park are deploying Azure Machine Learning to improve hiring model fairness and performance. They are using Grid Search to tune a binary classifier that predicts whether applicants will be hired. They want the classifier to select equal proportions of candidates from each category in the Gender attribute. Which parity constraint should they enforce to obtain equal selection rates across the gender groups?

  • ✓ C. Demographic parity

The correct option is Demographic parity.

Demographic parity requires that the model select the positive outcome at the same rate for each group defined by the sensitive attribute. In the hiring context this means equal proportions of candidates are chosen from each Gender group, which matches the team�s stated goal of equal selection rates.

True positive rate parity is not correct because it concerns matching the rate of correctly predicted positives across groups rather than matching the overall selection rate. That metric equalizes recall and does not guarantee equal proportions chosen.

Error rate parity is not correct because it focuses on equalizing overall error or misclassification rates across groups instead of matching positive selection rates. Equal errors does not imply equal selection proportions.

False positive rate parity is not correct because it aims to equalize the rate of false positives across groups. That controls one type of error but does not ensure the same fraction of candidates are selected in each group.

Equalized odds is not correct because it requires both true positive rates and false positive rates to be equal across groups. This is a stricter condition and it still may not produce equal overall selection rates unless base rates happen to align.

When a question asks for equal selection rates or equal proportions chosen across groups, pick demographic parity. Read the wording carefully to separate selection rate goals from error or recall based goals.

You are advising Sentinel Analytics and you are meeting with Maria who leads the data engineering team about Azure Machine Learning. The group plans to deploy models through batch endpoints as part of an ETL workflow and they must create the deployment definition. Which class should you recommend for building the deployment definition?

  • ✓ D. BatchDeployment

The correct option is BatchDeployment.

The BatchDeployment class in the Azure Machine Learning v2 SDK is designed to build deployment definitions for batch endpoints. It lets you specify the model or asset to deploy, the environment and dependencies, the target compute, the entry script and inputs and outputs, and other batch specific settings that are needed for ETL or scheduled inferencing.

Pipeline is not correct because a pipeline is used to orchestrate multi step workflows and experiments rather than to define a deployment for a batch endpoint.

OnlineDeployment is not correct because that class targets real time or low latency endpoints and is used for serving online predictions rather than for batch scoring.

ParallelRunConfig is not correct and is considered a legacy pattern for batch inferencing in older SDK versions. It has been superseded by the v2 batch deployment model and is less likely to appear on current exams.

When you see questions about batch versus real time deployments think about the endpoint type first. BatchDeployment maps to bulk or scheduled scoring and OnlineDeployment maps to low latency serving.

While configuring a training run in Contoso Machine Learning you may pick a custom or prebuilt runtime environment. Which additional resource must you also define to specify where the training will run?

  • ✓ C. A designated compute target for executing the training job

The correct option is A designated compute target for executing the training job.

You choose a runtime environment to define the software and libraries that the training will use but you must also specify where that software will actually run. The designated compute target is the resource that executes the training workload and it can be a managed training service, a specific VM type, or a cluster with GPUs or TPUs depending on the needs of the job.

Artifact Registry is incorrect because it is a place to store container images and packages and it does not itself execute training jobs.

Object storage bucket is incorrect because buckets hold datasets and model artifacts but they are not the compute resource that runs training.

Operating system image is incorrect because most modern training setups run inside containers or managed runtimes and the OS image is not the primary resource you must define when selecting where to execute a training job.

When you pick a runtime remember to also pick the compute target that will run the job and check quotas and machine types before submitting the training run.

An analytics team at Nimbus Data needs to register a Synapse Spark pool as an attached compute resource from the Azure Machine Learning Studio compute creation wizard. What sequence of steps correctly registers the Spark pool?

  • ✓ C. Select the existing Synapse workspace then choose the Spark pool within that workspace then enable the managed identity for the compute resource then save the compute and use Synapse Studio to grant the managed identity the Synapse Administrator role

The correct option is Select the existing Synapse workspace then choose the Spark pool within that workspace then enable the managed identity for the compute resource then save the compute and use Synapse Studio to grant the managed identity the Synapse Administrator role.

Registering an attached Synapse Spark pool requires you to pick the Synapse workspace first and then choose the specific Spark pool inside that workspace. You must enable the managed identity for the compute resource so Azure ML can use that identity. After you save the compute, you grant the managed identity the Synapse Administrator role from Synapse Studio so the compute has the permissions it needs to run jobs and access workspace artifacts.

Select a Spark pool then attach it to a newly created Synapse workspace then enable the compute managed identity and after saving go to Synapse Studio to give the managed identity the Synapse Administrator role is incorrect because you do not attach a Spark pool by first creating a new workspace in the Azure ML compute wizard. The correct flow selects the existing workspace and then the pool inside it rather than attaching a pool to a newly created workspace.

Choose an existing Synapse workspace then enable a workspace managed identity and assign that identity the Azure ML Administrator role is incorrect because enabling the workspace managed identity and assigning the Azure ML Administrator role is not the required step. The compute resource needs its managed identity enabled and you must grant that identity the Synapse Administrator role so the compute can operate within Synapse.

Pick an existing Spark pool then enable a managed identity for the Synapse workspace and assign that identity the Azure ML Administrator role is incorrect because the managed identity must be enabled for the compute resource and the role required is Synapse Administrator. Assigning the Azure ML Administrator role to the workspace managed identity does not give the compute the necessary Synapse permissions.

When attaching a Synapse Spark pool make sure to enable the compute managed identity and then assign the Synapse Administrator role to that identity from Synapse Studio before trying to run jobs.

The Midtown Chronicle is a regional Chicago newspaper based in the Lakeside Building and its lead developer Alex Chen is building MLOps to automate workflows. Alex wants to ensure that model training starts automatically whenever the code repository receives proposed changes. What action should Alex take to enable this automation?

  • ✓ C. Open a pull request in the repository hosting service

Open a pull request in the repository hosting service is correct.

When you Open a pull request in the repository hosting service the hosting service emits an event that CI systems and pipeline triggers can respond to and start automated model training. Many MLOps pipelines are configured to run validation training and tests on proposed changes so the team can catch regressions before merging.

For example you can configure Cloud Build or another CI tool to trigger on pull request events and then invoke Vertex AI pipelines or training jobs. That pattern ensures proposed code and model changes are tested automatically while keeping the main branch stable.

Configure a push webhook on the repository to notify the pipeline endpoint is not the best answer because a push webhook triggers on push events and does not specifically represent proposed changes made via a pull request. A push webhook could run builds for branch pushes but it will not automatically cover the pull request lifecycle unless you explicitly configure those events.

Create a new feature branch in the repository is not sufficient to start automated training by itself because creating a branch does not send an event to CI systems unless you push commits or open a pull request. Branch creation alone usually requires an additional action to trigger automation.

Clone the repository to a local development system is incorrect because cloning is a manual operation and it does not trigger remote CI pipelines. Local development does not cause cloud training to start automatically.

When a question mentions proposed changes think about repository events such as pull requests because CI and MLOps automation typically react to those events rather than to local clones or branch creation alone.

Helix Analytics is the research division of Meridian Biotech and it is overseen by Ava Carter and Jonah Reed. They intend to use Microsoft Machine Learning to improve operational outcomes and they have asked you to advise on privacy practices. While chatting over coffee Ava wants a concise explanation of how differential privacy protects individuals in published summaries. How would you briefly explain differential privacy to Ava?

  • ✓ B. Inject random noise into analytic outputs so aggregated metrics reflect the dataset yet vary unpredictably

The correct option is Inject random noise into analytic outputs so aggregated metrics reflect the dataset yet vary unpredictably.

This approach means adding carefully calibrated random noise to query answers so results show accurate aggregates while limiting any one person from changing the output in a detectable way. Differential privacy creates a mathematical bound on how much influence an individual’s data can have on published summaries and that bound is controlled by a privacy parameter often called epsilon. The method trades a small amount of accuracy for a strong privacy guarantee and it works across repeated analyses when a privacy budget is managed.

Homomorphic encryption is incorrect because it allows computation on encrypted data without revealing raw values but it does not itself provide the statistical privacy guarantee that differential privacy gives when publishing aggregated outputs.

Substitute numeric entries with their column average for analysis is incorrect because simple averaging or masking can still leak information and it does not provide a formal privacy guarantee or a controllable privacy loss parameter like differential privacy.

Google Cloud Data Loss Prevention is incorrect because that is a tool for discovering and redacting sensitive data and it is not the definition of differential privacy. It is also a product specific to Google Cloud and not the privacy mechanism described by the correct choice.

When a question asks about protecting published summaries look for mentions of adding calibrated noise or a privacy parameter such as epsilon because those phrases usually signal differential privacy rather than encryption or simple masking.

A data scientist at Aurora Insights has completed a binary classification model and they will use precision as the primary evaluation metric. Which visualization technique best shows how precision changes across different classification thresholds?

  • ✓ C. Precision recall curve

The correct answer is Precision recall curve.

A Precision recall curve is the best choice because it directly shows how precision varies as you change the classification threshold. The curve is produced by varying the decision threshold and computing precision and recall at each point so you can see the trade off and where precision rises or falls as the threshold moves.

Violin plot is incorrect because it visualizes the distribution of a numeric variable and it does not show classifier performance across thresholds.

Calibration curve is incorrect because it compares predicted probabilities to observed outcome frequencies to assess probability calibration and it does not plot precision as the threshold changes.

Receiver operating characteristic curve is incorrect because it plots true positive rate against false positive rate across thresholds and it does not display precision. The ROC can also be less informative than a precision recall curve when classes are imbalanced.

When a question asks how a metric changes with threshold look for plots that are built by varying the decision threshold. A precision recall curve directly shows changes in precision as the threshold moves.

Which machine learning task is most suitable to run inside a containerized deployment on a managed Kubernetes cluster?

  • ✓ B. Model inference service

The correct answer is Model inference service.

Running a model inference service inside a container on a managed Kubernetes cluster is a natural fit because inference workloads are typically stateless and require low latency and predictable scaling. Kubernetes provides autoscaling, load balancing, rolling updates and resource isolation that suit serving containerized models in production.

Containers make it easy to package the model code and its dependencies so the same image can be deployed across environments. A managed Kubernetes cluster also allows you to attach GPU nodes when needed and to integrate health checks and observability for continuous serving.

Data preparation is usually a batch oriented ETL step that runs as part of a data pipeline. It is better served by managed data processing tools and pipeline runners that handle long running jobs and large scale transformations.

Model training often requires distributed compute, long running GPU instances and specialized orchestration for checkpoints and hyperparameter tuning. Managed training services or dedicated training clusters are typically a better fit than a production inference deployment.

Data loading is an ingestion or streaming task that belongs in storage and pipeline services. It does not usually match the stateless, low latency serving pattern that you deploy as a containerized service on Kubernetes.

When a question mentions containerized deployment and managed Kubernetes look for workloads that are stateless, require low latency, and benefit from horizontal scaling. Those clues usually point to model inference.

The boutique firm Brightman Harlow Quinn represents enhanced individuals in regulatory and injury matters and an analyst named Rivera is building a machine learning experiment and needs to import tabular data into an Azure Machine Learning dataset using the fewest ingestion steps to accelerate model training. Which data format should Rivera choose to minimize the number of steps when loading into an Azure Machine Learning table?

  • ✓ C. A single CSV file hosted at a public HTTP URL

A single CSV file hosted at a public HTTP URL is correct.

Azure Machine Learning can create a tabular dataset directly from a single public CSV by providing the file URL to the dataset factory, so Rivera can load the table with the fewest ingestion steps and without creating storage mounts or generating credentials.

A directory of image files in an Azure Blob storage container is incorrect because images are not tabular data and building a table would require additional preprocessing to extract features or labels before creating a tabular dataset.

A set of newline delimited JSON files in a cloud folder is incorrect because newline delimited JSON usually needs parsing and normalization to a tabular schema, so ingesting those files typically requires extra conversion steps compared with a single delimited CSV.

A single Parquet file in Azure Blob Storage using a shared access signature is incorrect for this question because although Parquet is a supported columnar format it often requires ensuring valid storage credentials or generating a shared access signature, which introduces additional steps compared with a single public CSV URL.

Choose the option that requires the least authentication and the least preprocessing when the question asks to minimize ingestion steps. A single public file in a tabular delimited format is usually the fastest path into an Azure ML tabular dataset.

Scenario The company Arcadia Analytics is a data science firm founded after the Arcadia Trust and it is valued at more than thirty-five million dollars and it is led by CEO Daniel Pierce. Daniel requested help as his engineering team prepares to use Microsoft Azure Machine Learning and they are rehearsing how to construct a DataFrame from Row instances and in-memory records using Apache Spark. Which method should they call to create a DataFrame object?

  • ✓ D. Call spark.createDataFrame()

Call spark.createDataFrame() is correct. This SparkSession method creates a distributed DataFrame from an RDD of Row objects, from a list of Row instances, or from other in-memory records and it can take an explicit schema or infer one when appropriate.

Using Call spark.createDataFrame() is the standard way in PySpark to build a DataFrame from Row instances or Python collections because the method constructs the proper Spark SQL schema and distributes the data across the cluster for processing.

Call pandas.DataFrame() is incorrect because that function builds a local pandas DataFrame in memory and it does not produce a Spark DataFrame directly. You can convert a pandas DataFrame into a Spark DataFrame by passing it to spark.createDataFrame, but the question asks which method creates a Spark DataFrame from Row instances or in-memory records.

Use createOrReplaceTempView() and then query the view is incorrect because createOrReplaceTempView registers a DataFrame as a temporary SQL view for queries and it does not create a DataFrame from raw Row objects. You must already have a DataFrame before you call createOrReplaceTempView.

Use a DF.create() instance method is incorrect because there is no DF.create method in the Spark DataFrame API. The correct factory method for constructing DataFrames from in-memory data is spark.createDataFrame.

Remember that DataFrame constructors live on the SparkSession in PySpark so look for spark.createDataFrame when the question asks how to build a Spark DataFrame from Rows or in-memory records.

Meridian Solutions was founded by Isabel Hart and now has a market value exceeding forty-two million dollars. Ms Hart established the firm soon after launching the Hart Foundation and she has asked you to advise her IT group on their Microsoft Azure Machine Learning deployment. The engineers are unsure about the nature of the driver and the executor programs and they want to know what type of process the driver and the executors represent within Azure Machine Learning?

  • ✓ D. Java processes

Java processes is the correct answer. The Spark driver and executors run as JVM based Java processes when you run Spark workloads in Azure Machine Learning environments that use Spark or Databricks clusters.

The driver is the main control process and the executors perform the distributed tasks and they are implemented on the Java Virtual Machine so they run as Java processes. If you use PySpark your Python code communicates with the JVM via a bridge such as Py4J, but the actual task execution on the executors happens inside Java processes on the cluster.

Cloud Dataflow is not a type of process. It is a managed data processing service on Google Cloud and it does not describe the runtime process model for Spark driver and executors in Azure.

Python processes is incorrect because while user code can be written in Python with PySpark the driver and executors execute on the JVM so they are Java processes rather than native Python processes.

SQL processes is incorrect because SQL is a query language and not a descriptor of the underlying process type for driver and executor programs.

_C processes_ is incorrect because Spark is based on the JVM and does not run its driver and executor as native C programs in typical deployments.

JSON processors is incorrect because JSON is a data interchange format and not a process type for driver and executor programs.

When you see driver and executor in a question think of Apache Spark and the JVM rather than the programming language you used to author the job.

A data scientist at a retail analytics startup is using pandas in Python and finds NaN values in a numeric column named “PaymentAmount” inside the DataFrame “sales_df”. If the scientist needs to substitute those missing numeric values with 0.02 directly in the existing DataFrame what single line of code accomplishes this?

  • ✓ C. sales_df.fillna(value={“PaymentAmount”:0.02}, inplace=True)

The correct answer is sales_df.fillna(value={“PaymentAmount”:0.02}, inplace=True).

This call uses DataFrame.fillna with a dictionary that maps the column name to the replacement value and it uses inplace=True so the original DataFrame is updated. fillna will replace NaN entries in the numeric PaymentAmount column with 0.02 without altering other columns.

sales_df.replace(value={“PaymentAmount”:0.02}, inplace=True) is incorrect because replace as written does not target NaN specifically. To replace missing values with replace you would need to specify the value to replace such as numpy.nan or use a different mapping. The provided call will not reliably substitute NaN with 0.02.

sales_df.fillna(“PaymentAmount”=0.02) is incorrect because it is not valid Python syntax and fillna expects the replacements to be passed with value= or as a dictionary rather than using a quoted name as a keyword. This form would also omit inplace so it would not update the original DataFrame.

sales_df.dropna(inplace=True) is incorrect because dropna removes rows that contain missing values rather than substituting them with a specified value. That would delete records instead of filling PaymentAmount with 0.02.

Use fillna with a dictionary to target specific columns and pass inplace=True when you want to modify the existing DataFrame rather than creating a copy.

Nolan’s Burgers is a regional burger chain competing with FryKing and they have engaged you as a consultant for Microsoft Azure machine learning projects. You are leading a meeting about model training and the team plans to use scikit learn to fit a regression model on historical sales records. To ensure the model makes reliable predictions on new transactions what evaluation strategy should they adopt?

  • ✓ B. Reserve a randomly selected portion of the data for training and keep a distinct held back portion for testing

Reserve a randomly selected portion of the data for training and keep a distinct held back portion for testing is the correct choice.

Reserving a randomly selected training set and keeping a separate held back test set gives you an unbiased estimate of how the model will perform on new transactions. The held back test set must not be used during model fitting or hyperparameter tuning so it accurately reflects generalization to unseen data.

You can perform model selection and hyperparameter tuning with resampling inside the training data, but the final assessment should always use the distinct test set that was kept separate from training.

Apply k fold cross validation with scikit learn to estimate model performance is not the best single answer here because cross validation is useful for estimating and selecting models but it does not replace a final held out test set for an unbiased evaluation. Cross validation can be used inside the training data for tuning, but you still need a distinct test set for the final performance measure.

Train the regression model on the entire dataset and then measure performance on the same observations is wrong because evaluating on the same data used for training produces overly optimistic metrics and does not reveal whether the model will generalize to new data. This approach risks severe overfitting and gives a misleading view of real world performance.

Select the examples closest to the mean for training and then evaluate on the complete dataset is incorrect because that sampling strategy creates a biased training set that is not representative of the full distribution. Training on examples near the mean will harm the model ability to learn edge cases and will not produce reliable predictions on the diverse transactions in the full dataset.

Preserve a held out test set for the final evaluation and use cross validation only within the training data for model selection.

Aurora Insights is adapting its workflows to Microsoft Azure and lead engineer Maya Rios must attach to an existing CPU based compute cluster named cpu-node-2 by using Azure ML Python SDK v2. Which code snippet should Maya use to connect to the cpu-node-2 compute cluster?

  • ✓ C. cpu_compute_target = “cpu-node-2” cpu_cluster = ml_client.compute.get(cpu_compute_target)

cpu_compute_target = “cpu-node-2” cpu_cluster = ml_client.compute.get(cpu_compute_target) is correct.

This snippet assigns the compute target name to cpu_compute_target and then calls ml_client.compute.get with that name so the Azure ML SDK v2 retrieves the existing CPU compute cluster and assigns it to cpu_cluster.

cpu_compute_target = “gpu-node-3” cpu_cluster = ml_client.compute.get(cpu_compute_target) is incorrect because it uses a GPU node name instead of the required CPU cluster name so it would target the wrong resource.

cpu_cluster = ml_client.compute.get(cpu_compute_target) cpu_compute_target = “cpu-node-2” is incorrect because it calls get before the variable is defined which leads to an undefined variable being used instead of passing the compute name first.

cpu_compute_target = “cpu-node-2” is incorrect because it only assigns the name and does not call ml_client.compute.get to retrieve or attach to the compute cluster.

When choosing code snippet answers confirm the exact resource name and the order of operations so the SDK call receives the variable after it is set.

You are a machine learning engineer using the Azure Machine Learning SDK for Python v1 together with notebook based workflows to train models. You have already created a compute target, built an environment, and written a Python training script. Which SDK class should you create to package the script with its environment and submit the training job to the compute target?

  • ✓ D. ScriptRunConfig

The correct answer is ScriptRunConfig.

ScriptRunConfig in the Azure Machine Learning SDK v1 packages your training script, the environment you built, and the compute target into a single configuration object. You create a ScriptRunConfig with parameters like source_directory, script, compute_target, and environment and then pass it to Experiment.submit to run the job on the target compute.

RunConfiguration is intended to hold interpreter and run level settings and can be used to shape the environment, but it is not the object you pass to Experiment.submit to package and submit a script. It is often combined with a ScriptRunConfig but it does not itself represent the runnable job configuration.

ScriptRun is not the SDK v1 class used to define and submit a training job configuration, and it is not the object that packages the script, environment, and compute for submission.

Run represents a submitted and running or completed job and is returned by Experiment.submit. You do not instantiate a Run to submit work because it is the result of submitting a configuration such as a ScriptRunConfig.

When the question asks which class packages a script, its environment, and a compute target in Azure ML SDK v1 remember to use ScriptRunConfig and submit it via an Experiment.

QuantumWorks is an advanced dimensional engineering firm based in Marston and it functions as a division of Helixium Inc under the leadership of Daniel Cross. Daniel manages a team of data scientists that includes Ana Velasquez who is a top practitioner at QuantumWorks. Ana must retrieve a dataset from a publicly accessible repository hosted on example.com and load it into a Jupyter notebook inside an Azure Machine Learning workspace for quick experimentation. Which protocol should Ana use to fetch the data in her Jupyter notebook?

  • ✓ D. http(s)

The correct option is http(s).

http(s) is appropriate because the dataset is hosted on a publicly accessible website and web servers expose files over HTTP or HTTPS. Ana can fetch the file directly inside a Jupyter notebook by using standard tools and libraries such as requests, urllib, curl, or by using pandas functions that accept a URL. This approach does not require Azure specific datastores or authentication when the resource is public.

git is not correct because git is a version control protocol and it only retrieves repositories. It would be applicable only if the data were stored inside a git repository and Ana intended to clone that repository rather than download a file from a public website.

azureml is not correct because that scheme refers to Azure Machine Learning resource identifiers and service constructs rather than a general web transfer protocol. It is used inside Azure ML for referencing assets and datasets but it is not the protocol used to fetch a file from an arbitrary public URL on the internet.

abfss is not correct because that scheme is for Azure Data Lake Storage Gen2 access and it requires an Azure storage account endpoint and proper authentication. It is not used to download files from an external public website such as example.com.

When a file is publicly hosted try a quick requests.get() or pandas.read_csv() call in the notebook to verify the http or https URL works before creating an Azure ML datastore or dataset.

Harrison Realty Group is a large property firm with holdings across multiple markets and they contracted you to create a machine learning experiment in Azure Machine Learning Designer to predict which houses will sell within the next 45 days. You must choose a classification algorithm that can learn complex non linear relationships among features and generalize well to unseen listings. Which algorithm should you select to model non linear relationships effectively?

  • ✓ B. Boosted Decision Tree classifier for two classes

Boosted Decision Tree classifier for two classes is the correct choice for this task.

This classifier uses an ensemble of decision trees built by boosting, which allows it to learn complex non linear interactions among features while reducing bias and improving generalization across unseen listings. It works well with mixed numeric and categorical inputs and often requires less manual feature transformation than linear models, which makes it a strong fit for predicting whether a property will sell within a given timeframe.

Two Class Support Vector Machine is not selected because, while SVMs can model non linear patterns with kernels, they tend to scale poorly to very large datasets and require careful kernel and parameter tuning, which can make them less practical for large, diverse property portfolios.

Two Class Logistic Regression is not appropriate because it is fundamentally a linear classifier and cannot capture complex non linear relationships unless you explicitly create nonlinear features first.

Linear Regression Model is not suitable because it predicts a continuous numeric value rather than a binary class, so it does not directly address a classification problem like whether a house will sell within 45 days.

When a question asks for modeling complex non linear relationships in tabular data think of ensemble tree methods such as boosted trees or random forests rather than linear models.

When converting an exploratory Jupyter notebook into a reproducible production training script what characteristics typically set script based training apart from notebook based development? (Choose 3)

  • ✓ B. Scripts are written for automated repeatable training workflows that prioritize consistency and efficiency

  • ✓ D. Script files execute code sequentially when invoked which provides a deterministic and controlled execution flow compared to interactive notebooks

  • ✓ E. Scripts are generally lean and focus on production code rather than extensive visualizations or experimental notes

The correct options are Scripts are written for automated repeatable training workflows that prioritize consistency and efficiency, Script files execute code sequentially when invoked which provides a deterministic and controlled execution flow compared to interactive notebooks, and Scripts are generally lean and focus on production code rather than extensive visualizations or experimental notes.

Scripts are written to be run in automated pipelines and production environments and they are designed to accept arguments, handle dependencies, and integrate with CI and scheduling systems. This emphasis on repeatability and efficiency makes scripts suitable for production training jobs and for deployment to services like Vertex AI after packaging and testing.

Script files execute top to bottom when run which provides a deterministic and controlled execution flow. This contrasts with interactive notebooks where cell order and hidden state can lead to nonreproducible results. The sequential execution model makes it easier to test, debug, and validate training runs and to reproduce results reliably.

Scripts are typically lean and focused on production concerns such as data ingestion, model training loops, logging, and saving artifacts. They usually omit extensive exploratory visualizations and narrative notes that are common in notebooks. This focus reduces runtime overhead and simplifies maintenance and reuse in production pipelines.

Scripts are deployed to Vertex AI with no modifications is incorrect because most scripts require packaging, dependency specification, and an appropriate entry point or containerization before they can run on Vertex AI. Deployment to a managed training service normally involves some adaptation to meet runtime and environment requirements.

Scripts are primarily intended for ad hoc exploratory data analysis is incorrect because exploratory analysis is the main purpose of notebooks. Scripts are intended for repeatable, automated, and maintainable workflows rather than for interactive, ad hoc investigation.

When converting a notebook make the code parameter driven and ensure it runs top to bottom so you can test it locally and then run it reliably in CI or on Vertex AI.

At Acme Analytics you run a cloud based machine learning pipeline that uses the Permutation Feature Importance module. When selecting evaluation metrics to assess model performance which metrics can be chosen? (Choose 2)

  • ✓ C. Precision

  • ✓ D. Accuracy

The correct options are Precision and Accuracy.

Permutation Feature Importance evaluates how much a chosen evaluation metric changes when a feature is randomly permuted. It therefore needs a metric that can be calculated from the model predictions on individual examples. Precision and accuracy both compare predicted labels to true labels on a per example basis, so they are appropriate metrics to use with permutation importance.

AUC ROC is incorrect because it is a ranking based summary across thresholds rather than a simple per example prediction metric. Permutation approaches typically require pointwise metrics and do not generally use ranking metrics like AUC.

Chi squared statistic is incorrect because it is a statistical test often used for assessing association or for feature selection, and it is not a direct measure of predictive performance computed from model predictions in the way accuracy or precision are.

Remember to pick evaluation metrics that are computed from model predictions on individual examples when dealing with permutation importance. Metrics such as precision and accuracy are usually valid choices.

Jira, Scrum & AI Certification

Want to get certified on the most popular software development technologies of the day? These resources will help you get Jira certified, Scrum certified and even AI Practitioner certified so your resume really stands out..

You can even get certified in the latest AI, ML and DevOps technologies. Advance your career today.

Cameron McKenzie Cameron McKenzie is an AWS Certified AI Practitioner, Machine Learning Engineer, Copilot Expert, Solutions Architect and author of many popular books in the software development and Cloud Computing space. His growing YouTube channel training devs in Java, Spring, AI and ML has well over 30,000 subscribers.