Refer to the exhibit below:
Based on the output from the Data Exploration node shown in the exhibit, which variable has the most thin tails (most platykurtic distribution)?
- A . Logi_rfm4
- B . Logi_rfm6
- C . Logi_rfm8
- D . Logi_rfm12
Given the following properties for a neural network model, which statement is true regrading hidden units in the model? The following SAS program is submitted:
- A . There are no hidden units in the model.
- B . The number of hidden units is 1.
- C . The number of hidden units is 50.
- D . The number of hidden units is 26.
Which of the following is an example of a NoSQL database that is commonly used to store unstructured data?
- A . MySQL
- B . MongoDB
- C . Oracle Database
- D . Microsoft SQL Server
What is the primary goal of A/B testing in the context of model deployment?
- A . To evaluate the model’s accuracy
- B . To compare two different versions of a model or strategy to determine which performs better
- C . To assess data quality
- D . To create synthetic data
What does the term "bias" in machine learning refer to?
- A . A model’s inability to generalize to new data
- B . Systematic errors that cause a model to consistently underpredict or overpredict
- C . The simplicity of a model
- D . The overall accuracy of a model
What is the significance of the "bias-variance trade-off" in machine learning?
- A . It represents the trade-off between underfitting and overfitting.
- B . It indicates the trade-off between accuracy and precision.
- C . It refers to the trade-off between the number of features and the model’s complexity.
- D . It is not relevant in machine learning.
What is the purpose of cross-validation in model building and evaluation?
- A . Splitting the dataset into training and testing sets
- B . Reducing the dataset size
- C . Assessing the model’s generalization performance
- D . Generating synthetic data
What is the purpose of a "canary release" in the context of model deployment?
- A . To assess data quality
- B . To deploy a new model version to a small subset of users or systems for testing
- C . To create synthetic data
- D . To evaluate model accuracy
When deploying a machine learning model, what is "model drift"?
- A . A sudden increase in the model’s accuracy
- B . A change in the distribution of the input data or target variable over time
- C . The process of feature extraction
- D . A measure of feature importance
Which algorithm is commonly used for decision-making tasks in classification models?
- A . K-Means
- B . Decision Trees
- C . Principal Component Analysis (PCA)
- D . Linear Regression
What is the primary purpose of model documentation in the model deployment phase?
- A . To create synthetic data
- B . To assess data quality
- C . To provide information on the model’s development, architecture, and usage
- D . To evaluate the model’s accuracy
Which of the following best describes unstructured data?
- A . Data that is organized in rows and columns
- B . Data that is difficult to process and lacks a predefined structure
- C . Data stored in a relational database
- D . Data with a clear schema
What is the main advantage of using a RESTful API (Representational State Transfer) as a data source?
- A . Real-time data processing
- B . Support for complex data structures
- C . Simple and standardized communication
- D . High security features
Which statements are true for the F1 score?
(Choose 2.)
- A . F1 score is calculated based on a depth value.
- B . F1 score is calculated based on a cut off value.
- C . F1 score is applicable to a model with a binary target.
- D . F1 score is applicable to a model with an interval target.
Which type of model is commonly used for anomaly detection in datasets?
- A . Decision Trees
- B . Clustering Models
- C . Linear Regression
- D . Principal Component Analysis (PCA)
What does API stand for in the context of data sources?
- A . Application Programming Interface
- B . Advanced Programming Integration
- C . Automated Program Integration
- D . Application Program Interface
Which data source allows for real-time data streaming and processing?
- A . Data warehouses
- B . Cloud storage
- C . IoT devices
- D . Static data files
What is the main advantage of ensemble methods in model building?
- A . They require minimal data preprocessing
- B . They produce simple and interpretable models
- C . They combine multiple models to improve predictive performance
- D . They work well with high-dimensional data
Which machine learning technique is typically used for building a model to predict a numeric target variable?
- A . Classification
- B . Regression
- C . Clustering
- D . Dimensionality reduction
Refer to the treemap shown in the exhibit below:
Which statement is true about the tree map for a decision tree with a binary target?
- A . The top bar represents the node with the highest probability of event.
- B . The darker bars represent nodes with a lower probability of event.
- C . The top bar represents the node with the highest count.
- D . The wider bars represent nodes with a higher probability of event.
What is the primary purpose of model assessment in the context of data science and machine learning?
- A . Data preprocessing
- B . Model building
- C . Evaluating and selecting the best-performing model
- D . Data visualization
What is the main advantage of ensemble learning methods, such as Random Forest, in a machine learning pipeline?
- A . They are simple and easy to interpret.
- B . They are not suitable for large datasets.
- C . They combine multiple models to improve predictive performance.
- D . They require minimal data preprocessing.
When deploying a machine learning model, what is meant by "model latency"?
- A . The time it takes to build a model
- B . The time it takes to train a model
- C . The time it takes for the model to make predictions once deployed
- D . The time it takes to create synthetic data
What is the purpose of an ROC curve (Receiver Operating Characteristic) in model assessment?
- A . To evaluate regression models
- B . To visualize data distribution
- C . To compare a model’s true positive rate with the false positive rate
- D . To measure feature importance
Which data source typically provides access to real-time financial market data?
- A . Social media platforms
- B . Weather stations
- C . Stock market APIs
- D . Online news websites
In model assessment, what does "cross-validation" aim to address?
- A . Training a model
- B . Overfitting and generalization
- C . Data preprocessing
- D . Model deployment
Which metric is commonly used to evaluate the performance of a regression model?
- A . F1 Score
- B . Mean Absolute Error (MAE)
- C . Precision
- D . Confusion Matrix
What is "model reevaluation" in the model deployment phase?
- A . The process of data preprocessing
- B . The process of selecting features
- C . The periodic assessment of a deployed model’s performance and potential retraining
- D . The evaluation of data distribution
What is overfitting in machine learning, and how can it be addressed in a pipeline?
- A . Overfitting occurs when the model is too simple and underperforms.
- B . Overfitting occurs when the model fits the training data too closely and may not generalize well. It can be addressed by regularization techniques.
- C . Overfitting occurs when the model is too complex and overperforms.
- D . Overfitting is not a concern in machine learning pipelines.
What is a data lake?
- A . A data storage solution designed for high-speed data retrieval
- B . A centralized repository for storing all structured and unstructured data at any scale
- C . A specialized database for time-series data
- D . A backup system for relational databases