Which of the following approaches should the Data Science team take to mitigate this issue?

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users

The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns

Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.

Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)
A . Add more deep trees to the random forest to enable the model to learn more features.
B . indicate a copy of the samples in the test database in the training dataset
C . Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
D . Change the cost function so that false negatives have a higher impact on the cost value than false positives
E . Change the cost function so that false positives have a higher impact on the cost value than false negatives

Answer: C, D

Explanation:

The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:

C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.

D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.

Reference: Bagging and Random Forest for Imbalanced Classification Surviving in a Random Forest with Imbalanced Datasets

machine learning – random forest for imbalanced data? – Cross Validated Biased Random Forest For Dealing With the Class Imbalance Problem

Latest MLS-C01 Dumps Valid Version with 104 Q&As

Latest And Valid Q&A | Instant Download | Once Fail, Full Refund

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments