Databricks Databricks Machine Learning Associate Databricks Certified Machine Learning Associate Exam Online Training
Databricks Databricks Machine Learning Associate Online Training
The questions for Databricks Machine Learning Associate were last updated at Nov 19,2024.
- Exam Code: Databricks Machine Learning Associate
- Exam Name: Databricks Certified Machine Learning Associate Exam
- Certification Provider: Databricks
- Latest update: Nov 19,2024
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?
- A . There is no way to return the metadata description programmatically.
- B . fs.create_training_set("new_table")
- C . fs.get_table("new_table").description
- D . fs.get_table("new_table").load_df()
- E . fs.get_table("new_table")
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?
- A . spark_df[spark_df["price"] > 0]
- B . spark_df.filter(col("price") > 0)
- C . SELECT * FROM spark_df WHERE price > 0
- D . spark_df.loc[spark_df["price"] > 0,:]
- E . spark_df.loc[:,spark_df["price"] > 0]
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization’s leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
- A . RMSE
- B . Precision
- C . Area under the residual operating curve
- D . Accuracy
- E . Recall
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
- A . When the features are of the categorical type
- B . When the features are of the boolean type
- C . When the features contain a lot of extreme outliers
- D . When the features contain no outliers
- E . When the features contain no missing no values
A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?
- A . Impute the missing values using each respective feature variable’s mean value instead of the median value
- B . Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
- C . Remove all feature variables that originally contained missing values from the feature set
- D . Create a binary feature variable for each feature that contained missing values indicating whether each row’s value has been imputed
- E . Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?
- A . spark_df.summary ()
- B . spark_df.stats()
- C . spark_df.describe().head()
- D . spark_df.printSchema()
- E . spark_df.toPandas()
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
- A . One-hot encoding is not supported by most machine learning libraries.
- B . One-hot encoding is dependent on the target variable’s values which differ for each application.
- C . One-hot encoding is computationally intensive and should only be performed on small samples of
training sets for individual machine learning problems. - D . One-hot encoding is not a common strategy for representing categorical feature variables numerically.
- E . One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?
- A . The second model is much more accurate than the first model
- B . The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE
- C . The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE
- D . The first model is much more accurate than the second model
- E . The RMSE is an invalid evaluation metric for regression problems
A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem.
The following root-mean-squared-error values are calculated on each of the validation folds:
• 10.0
• 12.0
• 17.0
Which of the following values represents the overall cross-validation root-mean-squared error?
- A . 13.0
- B . 17.0
- C . 12.0
- D . 39.0
- E . 10.0
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage.
As part of the cross-validation process, they are using the following code block:
A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?
- A . The model will take longer to train for each unique combination of hvperparameter values
- B . The feature engineering stages will be computed using validation data
- C . The cross-validation process will no longer be
- D . The cross-validation process will no longer be reproducible
- E . The model will be refit one more per cross-validation fold