Databricks Databricks Machine Learning Associate Databricks Certified Machine Learning Associate Exam Online Training
Databricks Databricks Machine Learning Associate Online Training
The questions for Databricks Machine Learning Associate were last updated at Apr 23,2025.
- Exam Code: Databricks Machine Learning Associate
- Exam Name: Databricks Certified Machine Learning Associate Exam
- Certification Provider: Databricks
- Latest update: Apr 23,2025
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?
- A . One-hot encoding categorical features
- B . Target encoding categorical features
- C . Imputing missing feature values with the mean
- D . Imputing missing feature values with the true median
- E . Creating binary indicator features for missing values
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
- A . The vectorized pandas UDFs allow for the use of type hints
- B . The vectorized pandas UDFs process data in batches rather than one row at a time
- C . The vectorized pandas UDFs allow for pandas API use inside of the function
- D . The vectorized pandas UDFs work on distributed DataFrames
- E . The vectorized pandas UDFs process data in memory rather than spilling to disk
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space.
As a result, they have the following code block:
Which of the following changes do they need to make to the above code block in order to accomplish the task?
- A . Change SparkTrials() to Trials()
- B . Reduce num_evals to be less than 10
- C . Change fmin() to fmax()
- D . Remove the trials=trials argument
- E . Remove the algo=tpe.suggest argument
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.
The Spark DataFrame train_df has the following schema:
The machine learning engineer shares the following code block:
Which of the following changes does the machine learning engineer need to make to complete the task?
- A . They need to call the transform method on train df
- B . They need to convert the features column to be a vector
- C . They do not need to make any changes
- D . They need to utilize a Pipeline to fit the model
- E . They need to split the features column out into one column for each feature
Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?
- A . Keras
- B . pandas
- C . PvTorch
- D . Spark ML
- E . Scikit-learn
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema: prediction DOUBLE actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?
A)
B)
C)
D)
E)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
- E . Option E
A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.
They have written the following incomplete code block:
Which of the following pieces of code can be used to fill in the above blank to complete the task?
- A . applyInPandas
- B . mapInPandas
- C . predict
- D . train_model
- E . groupedApplyIn
Which of the following statements describes a Spark ML estimator?
- A . An estimator is a hyperparameter arid that can be used to train a model
- B . An estimator chains multiple alqorithms toqether to specify an ML workflow
- C . An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions
- D . An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer
- E . An estimator is an evaluation tool to assess to the quality of a model
A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?
- A . import pyspark.pandas as ps
df = ps.DataFrame(spark_df) - B . import pyspark.pandas as ps
df = ps.to_pandas(spark_df) - C . spark_df.to_sql()
- D . import pandas as pd
df = pd.DataFrame(spark_df) - E . spark_df.to_pandas()
A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?
- A . Bootstrap aggregation
- B . Support vector machines
- C . Bucketing
- D . Ensemble learning
- E . Stacking