Google Professional Machine Learning Engineer Google Professional Machine Learning Engineer Online Training

Question #91

You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values.

How should you encode these categorical values as input into the model?

A . Convert each categorical value into an integer value.
B . Convert the categorical string data to one-hot hash buckets.
C . Map the categorical variables into a vector of boolean values.
D . Convert each categorical value into a run-length encoded string.

Reveal Solution Hide Solution

Question #92

You need to train a natural language model to perform text classification on product descriptions that contain millions of examples and 100,000 unique words. You want to preprocess the words individually so that they can be fed into a recurrent neural network.

What should you do?

A . Create a hot-encoding of words, and feed the encodings into your model.
B . Identify word embeddings from a pre-trained model, and use the embeddings in your model.
C . Sort the words by frequency of occurrence, and use the frequencies as the encodings in your model.
D . Assign a numerical value to each word from 1 to 100,000 and feed the values as inputs in your model.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Option A is incorrect because creating a one-hot encoding of words, and feeding the encodings into your model is not an efficient way to preprocess the words individually for a natural language model. One-hot encoding is a method of representing categorical variables as binary vectors, where each element corresponds to a category and only one element is 1 and the rest are 01. However, this method is not suitable for high-dimensional and sparse data, such as words in a large vocabulary, because it requires a lot of memory and computation, and does not capture the semantic similarity or relationship between words2.

Option B is correct because identifying word embeddings from a pre-trained model, and using the embeddings in your model is a good way to preprocess the words individually for a natural language model. Word embeddings are low-dimensional and dense vectors that represent the meaning and usage of words in a continuous space3. Word embeddings can be learned from a large corpus of text using neural networks, such as word2vec, GloVe, or BERT4. Using pre-trained word embeddings can save time and resources, and improve the performance of the natural language model, especially when the training data is limited or noisy5.

Option C is incorrect because sorting the words by frequency of occurrence, and using the frequencies as the encodings in your model is not a meaningful way to preprocess the words individually for a natural language model. This method implies that the frequency of a word is a good indicator of its importance or relevance, which may not be true. For example, the word “the” is very frequent but not very informative, while the word “unicorn” is rare but more distinctive. Moreover, this method does not capture the semantic similarity or relationship between words, and may introduce noise or bias into the model.

Option D is incorrect because assigning a numerical value to each word from 1 to 100,000 and feeding the values as inputs in your model is not a valid way to preprocess the words individually for a natural language model. This method implies an ordinal relationship between the words, which may not be true. For example, assigning the values 1, 2, and 3 to the words “apple”, “banana”, and “orange” does not make sense, as there is no inherent order among these fruits. Moreover, this method does not capture the semantic similarity or relationship between words, and may confuse the model with irrelevant or misleading information.

Reference: One-hot encoding

Word embeddings

Word embedding

Pre-trained word embeddings

Using pre-trained word embeddings in a Keras model

[Term frequency]

[Term frequency-inverse document frequency]

[Ordinal variable]

[Encoding categorical features]

Question #93

Your data science team has requested a system that supports scheduled model retraining, Docker containers, and a service that supports autoscaling and monitoring for online prediction requests.

Which platform components should you choose for this system?

A . Vertex AI Pipelines and App Engine
B . Vertex AI Pipelines, Vertex AI Prediction, and Vertex AI Model Monitoring
C . Cloud Composer, BigQuery ML, and Vertex AI Prediction
D . Cloud Composer, Vertex AI Training with custom containers, and App Engine

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Option A is incorrect because Vertex AI Pipelines and App Engine do not meet all the requirements of the system. Vertex AI Pipelines is a service that allows you to create, run, and manage ML workflows using TensorFlow Extended (TFX) components or custom components1. App Engine is a service that allows you to build and deploy scalable web applications using standard or flexible environments2. However, App Engine does not support Docker containers in the standard environment, and does not provide a dedicated service for online prediction and monitoring of ML models3.

Option B is correct because Vertex AI Pipelines, Vertex AI Prediction, and Vertex AI Model Monitoring meet all the requirements of the system. Vertex AI Prediction is a service that allows you to deploy and serve ML models for online or batch prediction, with support for autoscaling and custom containers4. Vertex AI Model Monitoring is a service that allows you to monitor the performance and fairness of your deployed models, and get alerts for any issues or anomalies5.

Option C is incorrect because Cloud Composer, BigQuery ML, and Vertex AI Prediction do not meet all the requirements of the system. Cloud Composer is a service that allows you to create, schedule, and manage workflows using Apache Airflow. BigQuery ML is a service that allows you to create and use ML models within BigQuery using SQL queries. However, BigQuery ML does not support custom containers, and Vertex AI Prediction does not support scheduled model retraining or model monitoring.

Option D is incorrect because Cloud Composer, Vertex AI Training with custom containers, and App Engine do not meet all the requirements of the system. Vertex AI Training is a service that allows you to train ML models using built-in algorithms or custom containers. However, Vertex AI Training does not support online prediction or model monitoring, and App Engine does not support Docker containers in the standard environment or online prediction and monitoring of ML models3.

Reference: Vertex AI Pipelines overview

App Engine overview

Choosing an App Engine environment

Vertex AI Prediction overview

Vertex AI Model Monitoring overview

[Cloud Composer overview]

[BigQuery ML overview]

[BigQuery ML limitations]

[Vertex AI Training overview]

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution

Question #94

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance.

Which action should you try first to increase the efficiency of your pipeline?

A . Preprocess the input CSV file into a TFRecord file.
B . Randomly select a 10 gigabyte subset of the data to train your model.
C . Split into multiple CSV files and use a parallel interleave transformation.
D . Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Reveal Solution Hide Solution