How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?
Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written.
You have a large training dataset that is structured like this:
You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets.
How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?
A)
B)
C)
D)
A . Option A
B . Option B
C . Option C
D . Option D
Answer: C
Explanation:
The best way to distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion is to use option C. This option ensures that each subset contains a balanced and representative sample of the different classes (Democrat and Republican) and the different authors. This way, the model can learn from a diverse and comprehensive set of articles and avoid overfitting or underfitting.
Option C also avoids the problem of data leakage, which occurs when the same author appears in more than one subset, potentially biasing the model and inflating its performance. Therefore, option C is the most suitable technique for this use case.
Latest Professional Machine Learning Engineer Dumps Valid Version with 60 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund