Which of the following lines of code can the data scientist run to accomplish the task?

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature. Which of the following lines of code can the data scientist run to accomplish the task?A...

March 6, 2025 No Comments READ MORE +

Which of the following approaches can they take to include as much information as possible in the feature set?

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the...

March 5, 2025 No Comments READ MORE +

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?A . MLflow Experiment TrackingB . Spark MLC . Autoscaling clustersD . Autoscaling clustersE . Delta LakeView AnswerAnswer: B Explanation: Spark ML (part of Apache Spark's MLlib) is...

March 5, 2025 No Comments READ MORE +

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?A . TrainValidationSplitB . DataFrame.whereC . CrossValidatorD . TrainValidationSplitModelE . DataFrame.randomSplitView AnswerAnswer: E Explanation: The correct method to randomly split a Spark DataFrame into training and...

March 4, 2025 No Comments READ MORE +

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?A . KerasB . pandasC . PvTorchD . Spark MLE . Scikit-learnView AnswerAnswer: D Explanation: Spark ML (Machine Learning Library) is designed specifically for...

March 4, 2025 No Comments READ MORE +

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an...

February 28, 2025 No Comments READ MORE +

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?A . Leave-one-out encodingB . Target encodingC . One-hot encodingD . CategoricalE . String indexingView AnswerAnswer: C Explanation: The method that transforms categorical features into a series of binary indicator variables is...

February 26, 2025 No Comments READ MORE +

Which of the following terms is used to describe this combination of models?

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5....

February 24, 2025 No Comments READ MORE +

Which of the following code blocks will accomplish this task?

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0. Which of the following code blocks will accomplish this task?A . spark_df[spark_df["price"] > 0]B . spark_df.filter(col("price") >...

February 18, 2025 No Comments READ MORE +

Which of the following statements describes a Spark ML estimator?

Which of the following statements describes a Spark ML estimator?A . An estimator is a hyperparameter arid that can be used to train a modelB . An estimator chains multiple alqorithms toqether to specify an ML workflowC . An estimator is a trained ML model which turns a DataFrame with...

February 9, 2025 No Comments READ MORE +