Which strategy is the MOST EFFECTIVE for your ML training job while minimizing cost and ensuring the job completes successfully?

exams MLA-C01 MLA-C01 exam 0 Comments

You are an ML engineer at a data analytics company tasked with training a deep learning model on a large, computationally intensive dataset. The training job can tolerate interruptions and is expected to run for several hours or even days, depending on the available compute resources. The company has a limited budget for cloud infrastructure, so you need to minimize costs as much as possible.

Which strategy is the MOST EFFECTIVE for your ML training job while minimizing cost and ensuring the job completes successfully?
A . Start the training job using only Spot Instances to minimize cost, and switch to On-Demand instances manually if any Spot Instances are interrupted during training
B . Use Amazon SageMaker Managed Spot Training to dynamically allocate Spot Instances for the training job, automatically retrying any interrupted instances via checkpoints
C . Deploy the training job on a fixed number of On-Demand EC2 instances to ensure stability, and manually add Spot Instances as needed to speed up the job during off-peak hours
D . Use Amazon EC2 Auto Scaling to automatically add Spot Instances to the training job based on demand, and configure the job to continue processing even if some Spot Instances are interrupted

Answer: B

Explanation:

Correct option:

Use Amazon SageMaker Managed Spot Training to dynamically allocate Spot Instances for the training job, automatically retrying any interrupted instances via checkpoints

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting.

via –

https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/

Incorrect options:

Use Amazon EC2 Auto Scaling to automatically add Spot Instances to the training job based on demand, and configure the job to continue processing even if some Spot Instances are interrupted – Amazon EC2 Auto Scaling can add Spot Instances based on demand, but it does not provide the same level of automation and resilience as SageMaker Managed Spot Training, especially for ML-specific workloads where Spot interruptions need to be handled gracefully.

Deploy the training job on a fixed number of On-Demand EC2 instances to ensure stability, and manually add Spot Instances as needed to speed up the job during off-peak hours – Using a fixed number of On-Demand EC2 instances provides stability, but manually adding Spot Instances introduces complexity and may not fully optimize costs. Automating this process with SageMaker is more efficient.

Start the training job using only Spot Instances to minimize cost, and switch to On-Demand instances manually if any Spot Instances are interrupted during training – Starting with only Spot Instances minimizes costs, but manually switching to On-Demand instances increases the risk of delays and interruptions if Spot capacity becomes unavailable. SageMaker Managed Spot Training offers a more reliable and automated solution.

References:

https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html

https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-tr

aining-jobs/