What should your next step be to identify and fix the problem?
You are working on a classification problem with time series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of 99% for training data after just a few experiments. You haven’t explored using any sophisticated algorithms or spent any time on hyperparameter tuning.
What should your next step be to identify and fix the problem?
A . Address the model overfitting by using a less complex algorithm.
B . Address data leakage by applying nested cross-validation during model training.
C . Address data leakage by removing features highly correlated with the target value.
D . Address the model overfitting by tuning the hyperparameters to reduce the AUC ROC value.
Answer: B
Explanation:
Data leakage is a problem where information from outside the training dataset is used to create the model, resulting in an overly optimistic or invalid estimate of the model performance. Data leakage can occur in time series data when the temporal order of the data is not preserved during data preparation or model evaluation. For example, if the data is shuffled before splitting into train and test sets, or if future data is used to impute missing values in past data, then data leakage can occur. One way to address data leakage in time series data is to apply nested cross-validation during model training. Nested cross-validation is a technique that allows you to perform both model selection and model evaluation in a robust way, while preserving the temporal order of the data. Nested cross-validation involves two levels of cross-validation: an inner loop for model selection and an outer loop for model evaluation. The inner loop splits the training data into k folds, trains and tunes the model on k-1 folds, and validates the model on the remaining fold. The inner loop repeats this process for each fold and selects the best model based on the validation performance. The outer loop splits the data into n folds, trains the best model from the inner loop on n-1 folds, and tests the model on the remaining fold. The outer loop repeats this process for each fold and evaluates the model performance based on the test results.
Nested cross-validation can help to avoid data leakage in time series data by ensuring that the model is trained and tested on non-overlapping data, and that the data used for validation is never seen by the model during training. Nested cross-validation can also provide a more reliable estimate of the model performance than a single train-test split or a simple cross-validation, as it reduces the variance and bias of the estimate.
Reference: Data Leakage in Machine Learning
How to Avoid Data Leakage When Performing Data Preparation Classification on a single time series – prevent leakage between train and test
Latest Professional Machine Learning Engineer Dumps Valid Version with 60 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund