How should you create a dataset following Google-recommended best practices?

You have been asked to develop an input pipeline for an ML training model that processes images from disparate sources at a low latency. You discover that your input data does not fit in memory.

How should you create a dataset following Google-recommended best practices?
A . Create a tf.data.Dataset.prefetch transformation
B . Convert the images to tf .Tensor Objects, and then run Dataset. from_tensor_slices{).
C . Convert the images to tf .Tensor Objects, and then run tf. data. Dataset. from_tensors ().
D . Convert the images Into TFRecords, store the images in Cloud Storage, and then use the tf. data API to read the images for training

Answer: D

Explanation:

Cite from Google Pag: to construct a Dataset from data in memory, use

tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). When input data is stored in a file (not in memory), the recommended TFRecord format, you can use tf.data.TFRecordDataset(). tf.data.Dataset is for data in memory. tf.data.TFRecordDataset is for data in non-memory storage. https://cloud.google.com/architecture/ml-on-gcp-best-practices#store-image-video-audio-and-unstructured-data-on-cloud-storage

" Store image, video, audio and unstructured data on Cloud Storage Store these data in large container formats on Cloud Storage. This applies to sharded TFRecord files if you’re using TensorFlow, or Avro files if you’re using any other framework. Combine many individual images, videos, or audio clips into large files, as this will improve your read and write throughput to Cloud Storage. Aim for files of at least 100mb, and between 100 and 10,000 shards. To enable data management, use Cloud Storage buckets and directories to group the shards. "

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments