Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner?

exams MLS-C01 V2 MLS-C01 exam 0 Comments

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions

Here is an example from the dataset

"The quck BROWN FOX jumps over the lazy dog "

Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Select THREE)
A . Perform part-of-speech tagging and keep the action verb and the nouns only
B . Normalize all words by making the sentence lowercase
C . Remove stop words using an English stopword dictionary.
D . Correct the typography on "quck" to "quick."
E . One-hot encode all words in the sentence
F . Tokenize the sentence into words.

Answer: B, C, F

Explanation:

To prepare the data for Word2Vec, the Specialist needs to perform some preprocessing steps that can help reduce the noise and complexity of the data, as well as improve the quality of the embeddings.

Some of the common preprocessing steps for Word2Vec are:

Normalizing all words by making the sentence lowercase: This can help reduce the vocabulary size and treat words with different capitalizations as the same word. For example, “Fox” and “fox” should be considered as the same word, not two different words.

Removing stop words using an English stopword dictionary: Stop words are words that are very common and do not carry much semantic meaning, such as “the”, “a”, “and”, etc. Removing them can help focus on the words that are more relevant and informative for the task.

Tokenizing the sentence into words: Tokenization is the process of splitting a sentence into smaller units, such as words or subwords. This is necessary for Word2Vec, as it operates on the word level and requires a list of words as input.

The other options are not necessary or appropriate for Word2Vec:

Performing part-of-speech tagging and keeping the action verb and the nouns only: Part-of-speech tagging is the process of assigning a grammatical category to each word, such as noun, verb, adjective, etc. This can be useful for some natural language processing tasks, but not for Word2Vec, as it can lose some important information and context by discarding other words.

Correcting the typography on “quck” to “quick”: Typo correction can be helpful for some tasks, but not for Word2Vec, as it can introduce errors and inconsistencies in the data. For example, if the typo is intentional or part of a dialect, correcting it can change the meaning or style of the sentence. Moreover, Word2Vec can learn to handle typos and variations in spelling by learning similar embeddings for them.

One-hot encoding all words in the sentence: One-hot encoding is a way of representing words as vectors of 0s and 1s, where only one element is 1 and the rest are 0. The index of the 1 element corresponds to the word’s position in the vocabulary. For example, if the vocabulary is [“cat”, “dog”, “fox”], then “cat” can be encoded as [1, 0, 0], “dog” as [0, 1, 0], and “fox” as [0, 0, 1]. This can be useful for some machine learning models, but not for Word2Vec, as it does not capture the semantic similarity and relationship between words. Word2Vec aims to learn dense and low-dimensional embeddings for words, where similar words have similar vectors.