When working with textual data and trying to classify text into different languages, which approach to representing features makes the most sense?

exams AIP-210 AIP-210 exam 0 Comments

When working with textual data and trying to classify text into different languages, which approach to representing features makes the most sense?
A . Bag of words model with TF-IDF
B . Bag of bigrams (2 letter pairs)
C . Word2Vec algorithm
D . Clustering similar words and representing words by group membership

Answer: B

Explanation:

A bag of bigrams (2 letter pairs) is an approach to representing features for textual data that involves counting the frequency of each pair of adjacent letters in a text. For example, the word “hello” would be represented as {“he”: 1, “el”: 1, “ll”: 1, “lo”: 1}. A bag of bigrams can capture some information about the spelling and structure of words, which can be useful for identifying the language of a text. For example, some languages have more common bigrams than others, such as “th” in English or “ch” in German.