Which of the following pieces of code can be used to fill in the above blank to complete the task?
A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df. They have written the following incomplete code block: Which of the following pieces of code...
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?
A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API. Which of the following blocks of...
Which of the following describes why?
A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult. Which of the following describes why?A . Gradient boosting is not a linear algebra-based algorithm which is...
Which of the following feature engineering tasks will be the least efficient to distribute?
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process. Which of the following feature engineering tasks will be the least efficient to distribute?A . One-hot encoding categorical featuresB . Target encoding categorical featuresC . Imputing missing feature values with the meanD...
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?A . When the features are of the categorical typeB . When the features are of the boolean typeC . When the features contain a lot of extreme outliersD...
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data. Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to...
Which of the following is a negative consequence of the approach suggested by the colleague?
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block: A colleague suggests that the code block can be changed to speed up the...
Which of the following lines of code can be used to complete the code block to successfully complete the task?
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model: They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df: Which of the following lines of code can be used to complete...
Which of the following changes does the machine learning engineer need to make to complete the task?
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: The machine learning engineer shares the following code...
Which of the following possible explanations for this difference is invalid?
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that...