When anticipating additional data sources that might be relevant, what is a crucial factor to consider?
- A . The color scheme of the data visualization
- B . The data source’s popularity on social media
- C . The relevance of the data source to the business problem
- D . The graphical interface of the data source
A virtual assistant has been developed and deployed based on the Watson Assistant service. The assistant will support customers by answering FAQs (Frequent Answered Questions).
Which metric is a good indicator of the performance of the virtual assistant?
- A . The Area Under the Curve (AUC)
- B . Measure escalated calls using A/B testing
- C . The Root Mean Squared Error (RMSE) of words
- D . The F1 score of predicted intents in the Analytics tab
Which of the following is a critical first step in understanding a business problem for data science projects?
- A . Selecting the machine learning algorithm
- B . Defining the project scope
- C . Choosing the visualization tools
- D . Deploying the model
How can data splits be made reproducible in a machine learning experiment?
- A . By using a different random seed each time the data is split
- B . By partitioning the data manually
- C . By using a consistent random seed when splitting the data
- D . By splitting the data in a sequential manner without randomization
What is the key difference between batch processing and streaming in data processing?
- A . Batch processing involves real-time data processing, whereas streaming does not process data
- B . Streaming is suitable for large, historical datasets, whereas batch processing is for real-time data analysis
- C . Batch processing processes data in large blocks at a time, whereas streaming processes data in real-time as it arrives
- D . Batch processing processes data in large blocks at a time, whereas streaming processes data in real-time as it arrives
Which of the following is NOT a type of data source commonly integrated with Cloud Pak for Data?
- A . Social media feeds
- B . Proprietary in-memory databases
- C . Paper-based records
- D . Cloud storage services
When selecting a small number of algorithms based on model requirements, what factor should you primarily consider?
- A . The popularity of the algorithm in recent academic papers.
- B . Compatibility of the algorithm with the data characteristics and the predictive task.
- C . The algorithm that requires the least amount of data preprocessing.
- D . Choosing algorithms that are only based on supervised learning.
The first step in performing exploratory data analysis (EDA) typically involves:
- A . Choosing a color palette for data visualization
- B . Determining the hypothesis for the analysis
- C . Connecting to as many data sources as possible
- D . Selecting a random sample of data to analyze
In the context of deployment environments, understanding resources is crucial.
What does this typically involve?
- A . Choosing the most aesthetically pleasing user interface
- B . Determining the computational power and memory requirements for the deployed solution
- C . Selecting the programming language with the least number of keywords
- D . Focusing exclusively on the cost of storage
Which Python library is commonly used for data manipulation and analysis, and is available in Cloud Pak for Data?
- A . TensorFlow
- B . PyTorch
- C . Pandas
- D . Keras
When helping businesses articulate and define problems, what is an essential first step?
- A . Identifying potential data sources
- B . Defining key performance indicators (KPIs)
- C . Establishing a clear problem statement
- D . Selecting the analytical techniques
An E-retailer uses several important data sources, including web logs which contain all of the information on how customers navigate the web site. There are non-informative entries in the web logs that need to be removed.
During which phase should these non-informative entries be removed in the CRISP-DM model?
- A . Modeling
- B . Data Preparation
- C . Data Understanding
- D . Business Understanding
What is a key disadvantage of using Grid Search for hyperparameter tuning?
- A . It is too quick and may miss out on evaluating some hyperparameters
- B . It requires no prior knowledge of the hyperparameters
- C . It can be computationally expensive and time-consuming due to its exhaustive nature
- D . It is unable to handle discrete parameters
Which method is used for merging records in SPSS Modeler Merge node that allows specifying a requirement to be satisfied in order for the merge to take place?
- A . Key
- B . Order
- C . Filter
- D . Condition
Why is it important to create data splits that are reproducible?
- A . To ensure that each model run can be exactly replicated for verification and comparison
- B . To guarantee that the model will perform with 100% accuracy on unseen data
- C . To use more data for testing than for training
- D . To allow for larger test sets for more comprehensive testing
In unsupervised learning, which algorithm is best suited for grouping customers based on their purchase history to target marketing efforts more effectively?
- A . Support Vector Machines
- B . K-Means Clustering
- C . Linear Regression
- D . Decision Trees
In the context of avoiding underfitting and overfitting, what role does splitting the data into training, testing, and validation sets play?
- A . It ensures that the model is trained on the maximum amount of data possible
- B . It allows for the model to be validated and tested on different subsets of data to check its generalization ability
- C . It guarantees that the model will perform with 100% accuracy on unseen data
- D . It increases the computational complexity without improving model performance
Which of the following is true about the AUC measure in the context of classification models?
- A . It represents the degree of separability between classes.
- B . It is less useful when the classes are highly imbalanced.
- C . It indicates the number of false positives.
- D . It measures the model’s accuracy using a single threshold.
What is the primary purpose of partitioning data into training and test sets?
- A . To ensure that the model gets exposed to all possible data scenarios during training
- B . To maximize the accuracy of the model by using all data for training
- C . To evaluate the model’s performance on unseen data
- D . To increase the computational efficiency of model training
Which analytic technique is NOT typically used to address business requirements?
- A . Regression analysis
- B . Clustering
- C . Decision trees
- D . Proofreading
Which two packages can be used to customize the software configuration of a Jupyter notebook environment in Cloud Pak for Data?
- A . vim
- B . pip
- C . sudo
- D . bash
- E . conda
Which statement describes bagging?
- A . Building models and using their output as features into a final model.
- B . Building models in parallel and aggregating their predictions to select the final prediction.
- C . Building models sequentially and evaluating the success of earlier models. It combines a set of weak learners into a strong learner.
- D . Building models with artificial neural networks based on the sharedweight architecture of the convolution kernels or filters.
Assessing the feasibility of a solution(s) often requires evaluating:
- A . The color scheme of the user interface
- B . Market competition only
- C . Technical feasibility, cost, and time constraints
- D . Preferred communication channels of the project manager
Which statement best differentiates machine learning from deep learning?
- A . Machine learning algorithms perform better on structured data, while deep learning excels with unstructured data like images and text.
- B . Deep learning algorithms require less data to learn.
- C . Machine learning models are always transparent, whereas deep learning models cannot be interpreted.
- D . Deep learning algorithms are a subset of machine learning algorithms that do not require feature engineering.
Given the Confusion matrix below, which is the formula for specificity?
- A . TN/(TN + FP)
- B . TP/(FP + TP)
- C . TP/(FN + TP)
- D . (TP + TN)/(FN + FP + TN + TP)
F1-score is particularly useful when:
- A . You need a balance between precision and recall.
- B . The dataset size is extremely large.
- C . Only the model’s accuracy matters.
- D . The data is completely balanced.
Cloud Pak for Data’s integration with Spark allows users to:
- A . Perform complex computations on small datasets only
- B . Leverage distributed computing for processing large datasets efficiently
- C . Avoid using any form of data processing or analysis
- D . Use Spark exclusively for data visualization purposes
What is data leakage in the context of model training?
- A . When data from outside the training dataset is accidentally included in the training process
- B . A situation where the test data is not available
- C . Leakage of sensitive information due to poor data handling practices
- D . Loss of data during the splitting process
Which statistical method reduces the number of attributes by lumping highly correlated attributes together?
- A . Binning
- B . Principal Component Analysis (PCA)
- C . Long Short Term Memory Network (LSTM)
- D . Synthetic Minority Over-sampling Technique (SMOTE)
In classification models, which of the following metrics is NOT directly derived from the confusion matrix?
- A . Precision
- B . Recall
- C . Mean Absolute Error (MAE)
- D . F1-score