Which storage scheme is MOST adapted to this scenario?

exams MLS-C01 V2 MLS-C01 exam 0 Comments

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?
A . Store datasets as files in Amazon S3.
B . Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
C . Store datasets as tables in a multi-node Amazon Redshift cluster.
D . Store datasets as global tables in Amazon DynamoDB.

Answer: A

Explanation:

The best storage scheme for this scenario is to store datasets as files in Amazon S3. Amazon S3 is a scalable, cost-effective, and durable object storage service that can store any amount and type of data. Amazon S3 also supports querying data using SQL with Amazon Athena, a serverless interactive query service that can analyze data directly in S3. This way, the Data Science team can easily explore and analyze their datasets without having to load them into a database or a compute instance. The other options are not as suitable for this scenario because:

Storing datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance would limit the scalability and availability of the data, as EBS volumes are only accessible within a single availability zone and have a maximum size of 16 TiB. Also, EBS volumes are more expensive than S3 buckets and require provisioning and managing EC2 instances.

Storing datasets as tables in a multi-node Amazon Redshift cluster would incur higher costs and complexity than using S3 and Athena. Amazon Redshift is a data warehouse service that is optimized for analytical queries over structured or semi-structured data. However, it requires setting up and maintaining a cluster of nodes, loading data into tables, and choosing the right distribution and sort keys for optimal performance. Moreover, Amazon Redshift charges for both storage and compute, while S3 and Athena only charge for the amount of data stored and scanned, respectively.

Storing datasets as global tables in Amazon DynamoDB would not be feasible for large amounts of

data, as DynamoDB is a key-value and document database service that is designed for fast and

consistent performance at any scale. However, DynamoDB has a limit of 400 KB per item and 25 GB

per partition key value, which may not be enough for storing large datasets. Also, DynamoDB does

not support SQL queries natively, and would require using a service like Amazon EMR or AWS Glue to

run SQL queries over DynamoDB data.

References:

Amazon S3 – Cloud Object Storage

Amazon Athena C Interactive SQL Queries for Data in Amazon S3 Amazon EBS – Amazon Elastic Block Store (EBS) Amazon Redshift C Data Warehouse Solution – AWS

Amazon DynamoDB C NoSQL Cloud Database Service