Which of the following is a negative consequence of the approach suggested by the colleague?
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block: A colleague suggests that the code block can be changed to speed up the...
Which of the following lines of code can be used to complete the code block to successfully complete the task?
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model: They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df: Which of the following lines of code can be used to complete...
Which of the following changes does the machine learning engineer need to make to complete the task?
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: The machine learning engineer shares the following code...
Which of the following possible explanations for this difference is invalid?
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that...
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema: prediction DOUBLE actual DOUBLE Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data...
Which of the following changes can the data scientist make to accomplish the task?
A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation. They attempt to run the following code block, but it does not accomplish the desired...
Which of the following lines of code can the data scientist run to accomplish the task?
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature. Which of the following lines of code can the data scientist run to accomplish the task?A...
Which of the following approaches can they take to include as much information as possible in the feature set?
A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the...
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?A . MLflow Experiment TrackingB . Spark MLC . Autoscaling clustersD . Autoscaling clustersE . Delta LakeView AnswerAnswer: B Explanation: Spark ML (part of Apache Spark's MLlib) is...
Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?
Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?A . TrainValidationSplitB . DataFrame.whereC . CrossValidatorD . TrainValidationSplitModelE . DataFrame.randomSplitView AnswerAnswer: E Explanation: The correct method to randomly split a Spark DataFrame into training and...