Which approach allows the Specialist to use all the data to train the model?

exams MLS-C01 V2 MLS-C01 exam 0 Comments

A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance.

Which approach allows the Specialist to use all the data to train the model?
A . Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training
code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
B . Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset
C . Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
D . Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.

Answer: A

Explanation:

Pipe input mode is a feature of Amazon SageMaker that allows streaming large datasets from Amazon S3 directly to the training algorithm without downloading them to the local disk. This reduces the startup time, disk space, and cost of training jobs. Pipe input mode is supported by most of the built-in algorithms and can also be used with custom training algorithms. To use Pipe input mode, the data needs to be in a binary format such as protobuf recordIO or TFRecord. The training code needs to use the PipeModeDataset class to read the data from the named pipe provided by SageMaker. To verify that the training code and the model parameters are working as expected, it is recommended to train locally on a smaller subset of the data before launching a full-scale training job on SageMaker. This approach is faster and more efficient than the other options, which involve either downloading the full dataset to an EC2 instance or using AWS Glue, which is not designed for training machine learning models.

References:

Using Pipe input mode for Amazon SageMaker algorithms Using Pipe Mode with Your Own Algorithms PipeModeDataset Class