Which of the following approaches can they take to include as much information as possible in the feature set?
A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the...
Which of the following classification metrics should be used to evaluate the model?
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model. Which of the following classification metrics should be used to evaluate the model?A...
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?A . MLflow Experiment TrackingB . Spark MLC . Autoscaling clustersD . Autoscaling clustersE . Delta LakeView AnswerAnswer: B Explanation: Spark ML (part of Apache Spark's MLlib) is...
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?A . Leave-one-out encodingB . Target encodingC . One-hot encodingD . CategoricalE . String indexingView AnswerAnswer: C Explanation: The method that transforms categorical features into a series of binary indicator variables is...
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?A . pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadataB . pandas API on Spark DataFrames are more performant than Spark DataFramesC . pandas API on Spark DataFrames...
Which of the following possible explanations for this difference is invalid?
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that...
Which of the following changes do they need to make to the above code block in order to accomplish the task?
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block: Which of the following changes do...
Which of the following lines of code can be used to complete the code block to successfully complete the task?
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model: They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df: Which of the following lines of code can be used to complete...
Which of the following lines of code can the data scientist run to accomplish the task?
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature. Which of the following lines of code can the data scientist run to accomplish the task?A...
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data. Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to...