Given the size and nature of the dataset, which SageMaker input mode and AWS Cloud Storage configuration is the MOST SUITABLE for this use case?

exams MLA-C01 MLA-C01 exam 0 Comments

You are a data scientist at a healthcare company developing a machine learning model to analyze medical imaging data, such as X-rays and MRIs, for disease detection. The dataset consists of 10 million high-resolution images stored in Amazon S3, amounting to several terabytes of data. The training process requires processing these images efficiently to avoid delays due to I/O bottlenecks, and you must ensure that the chosen data access method aligns with the large dataset size and the high throughput requirements of the model.

Given the size and nature of the dataset, which SageMaker input mode and AWS Cloud Storage configuration is the MOST SUITABLE for this use case?
A . Select the Pipe input mode to stream the data directly from Amazon S3 to the training instances, allowing the model to start processing data immediately without requiring local storage for the entire dataset
B . Use the File input mode with EFS (Amazon Elastic File System) to mount the dataset across multiple instances, ensuring data is shared and accessible during distributed training
C . Implement the FastFile input mode with FSx for Lustre, to enable on-demand streaming of data chunks from Amazon S3 with low latency and high throughput
D . Use the File input mode to download the entire dataset from Amazon S3 to the training instances’
local storage before starting the training process, ensuring that all data is available locally during training

Answer: A

Explanation:

Correct option:

Select the Pipe input mode to stream the data directly from Amazon S3 to the training instances, allowing the model to start processing data immediately without requiring local storage for the entire dataset

In pipe mode, data is pre-fetched from Amazon S3 at high concurrency and throughput, and streamed into a named pipe, which also known as a First-In-First-Out (FIFO) pipe for its behavior. Each pipe may only be read by a single process.

Pipe input mode is designed for large datasets, allowing data to be streamed directly from Amazon S3 into the training instances. This minimizes disk usage and allows training to begin immediately as the data streams in, making it ideal for your scenario where high throughput and efficiency are critical.

via – https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

Incorrect options:

Use the File input mode to download the entire dataset from Amazon S3 to the training instances’ local storage before starting the training process, ensuring that all data is available locally during training – The File input mode downloads the entire dataset to the training instance before starting the training job. For a dataset as large as yours, this would lead to significant delays and require large amounts of local storage, which is not optimal for efficiency or cost.

Implement the FastFile input mode with FSx for Lustre, to enable on-demand streaming of data chunks from Amazon S3 with low latency and high throughput – FastFile mode is useful for scenarios where you need rapid access to data with low latency, but it is best suited for workloads with many small files. You should note that FastFile mode can be used only while accessing data from Amazon S3 and not with Amazon FSx for Lustre. So, this option acts as a distractor.

Use the File input mode with EFS (Amazon Elastic File System) to mount the dataset across multiple instances, ensuring data is shared and accessible during distributed training – Using Amazon EFS for the given use case requires transferring the medical imaging data from Amazon S3 into Amazon EFS, which leads to unnecessary data transfer as well as data storage costs. So, this option is ruled out.

References:

https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-sagemaker-fast-file-mode/