A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage.
As part of the cross-validation process, they are using the following code block:
A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?
A . The model will take longer to train for each unique combination of hvperparameter values
B . The feature engineering stages will be computed using validation data
C . The cross-validation process will no longer be
D . The cross-validation process will no longer be reproducible
E . The model will be refit one more per cross-validation fold
Answer: B
Explanation:
If the model object is passed to the estimator parameter of CrossValidator and the cross-validation object itself is placed as a stage in the pipeline, the feature engineering stages within the pipeline would be applied separately to each training and validation fold during cross-validation. This leads to a significant issue: the feature engineering stages would be computed using validation data, thereby leaking information from the validation set into the training process. This would potentially invalidate the cross-validation results by giving an overly optimistic performance estimate.
Reference: Cross-validation and Pipeline Integration in MLlib (Avoiding Data Leakage in Pipelines).
Latest Databricks Machine Learning Associate Dumps Valid Version with 74 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund