In a decision tree, what is an example of a pure node?
- A . 25 positives; 75 negatives
- B . 50 positives; 50 negatives
- C . 75 positives; 25 negatives
- D . 100 positives; 0 negatives
When would you prefer a Naive Bayes model to a logistic regression model for classification?
- A . When you are using several categorical input variables with over 1000 possible values each.
- B . When you need to estimate the probability of an outcome, not just which class it is in.
- C . When all the input variables are numerical.
- D . When some of the input variables might be correlated.
What is an appropriate assignment for a data scientist?
- A . Monitor key performance indicators
- B . Define an OLAP database schema
- C . Conduct customer surveys
- D . Develop predictive models
What is the output format from the Map function of MapReduce?
- A . Key-value pairs
- B . Binary representation of keys concatenated with structured data
- C . Compressed index
- D . Unique key record and separate records of all possible values
What does the R code z <- f[1:10, ] do?
- A . Assigns the first 10 rows of f to the vector z
- B . Assigns the 1st 10 columns of the 1st row of f to z
- C . Assigns a sequence of values from 1 to 10 to z
- D . Assigns the 1st 10 columns to z
What is a core deliverable at the end of the analytic project?
- A . An implemented database design
- B . A whitepaper describing the project and the implementation
- C . A presentation for project sponsors
- D . The training materials
Consider the following SQL statement:
SELECT employee_id, year, salary, avg(salary)
OVER
(PARTITION BY employee_id ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as result_1
FROM employee
ORDER BY employee_id, year
For each employee_id, what is returned as result_1?
- A . Three year rolling average salary
- B . Four year rolling average salary
- C . Average salary across all employee_id values
- D . Average employee_id
What is the mandatory Clause that must be included when using Window functions?
- A . OVER
- B . RANK
- C . PARTITION BY
- D . RANK BY
In a fitted ARIMA(1,2,3) model, how many differences are applied?
- A . 0
- B . 1
- C . 2
- D . 3
If R factors are categorical variables, which data classification level are they most closely related?
- A . Nominal
- B . Ordinal
- C . Interval
- D . Ratio
Consider this SQL statement:
SELECT product, prod_cost, avg(prod_cost) OVER (PARTITION BY product)
FROM product_detail
The OVER clause makes this what type of function?
- A . Window function
- B . Aggregate function
- C . System function
- D . User-defined function
In a Student’s t-test, what is the meaning of the p-value?
- A . it is the area under the appropriate tails of the Student’s distribution
- B . it is the "power" of the Student’s t-test
- C . it is the mean of the distribution for the null hypothesis
- D . it is the mean of the distribution for the alternate hypothesis
Consider these itemsets:
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)
What is the confidence of the rule (gloves -> hat)?
- A . 75%
- B . 60%
- C . 66%
- D . 80%
During the data preparation phase, you notice a high correlation between average spend on video games, age of players, and number of science fiction shows watched.
Which technique could you use to address the three correlated variables?
- A . Square the three variables to remove the correlation
- B . Combine the three variables into one new variable
- C . Drop the three variables to improve the model
- D . Use scaling to make the three variables equivalent in size
You are attempting to find the Euclidean distance between two centroids:
Centroid A’s coordinates: (X = 2, Y = 4)
Centroid B’s coordinates (X = 8, Y = 10)
Which formula finds the correct Euclidean distance?
- A . SQRT((2-8)2+(4-10)2) or 8.49
- B . SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17
- C . ((2-8)2+(4-10)2) or 72
- D . ((2-8) x 2 + (4-10) x 2) or 148
In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables?
- A . Apply a transformation to a variable
- B . Use a different statistical package
- C . Calculate the R-Squared value
- D . Change the units of measurement on the independent variable
Which chart type is intended to display correlations between sets of numeric data?
- A . Scatterplot
- B . Histogram
- C . Pie chart
- D . Line Chart
What does the Receiver Operating Characteristic (ROC) curve show?
- A . Relationship between p-value and true positive rate
- B . Relationship between p-value and true negative rate
- C . Relationship between true positive rate and false positive rate
- D . Relationship between true positive rate and true negative rate
A fair six-sided die is rolled. Let A denote the event that an odd number is rolled. Let C denote the event that a 1, 2, or 3 is rolled.
What is the value of the conditional probability, P(C|A)?
- A . 2/3
- B . 1/2
- C . 1/3
- D . 1/4
Which word or phrase completes the statement? Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to __________.
- A . Optimization and Predictive Modeling
- B . Alerts and Queries
- C . Structured Data and Data Sources
- D . Sales and profit reporting
Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model: Y = b0 + b1x1+b2x2+….+bnxn
- A . Ordinary Least squares
- B . Apriori Algorithm
- C . Ridge and Lasso
- D . Integer programming
What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?
- A . Linear regression
- B . Expected value
- C . Variance
- D . Quantiles
Refer to the exhibit.
You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of- squares (wss) data as shown in the exhibit.
How many customer groups should you specify?
- A . 2
- B . 3
- C . 4
- D . 8
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
- A . Define the process to maintain the model
- B . Try different analytical techniques
- C . Try different variables
- D . Transform existing variables
Which word or phrase completes the statement; “A theater actor is to ‘artistic and expressive’ as a data scientist is to.”?
- A . Communicative and collaborative
- B . Introverted and technical
- C . Logical and steadfast
- D . Independent and intelligent
When is the GROUP BY ROLLUP clause used in an OLAP query?
- A . All subtotals and grand totals are to be included in the output
- B . Subtotals are only to be included in the output
- C . Grand totals are only to be included in the output
- D . Specific subtotals and grand totals for a combination of variables are only to be included in the output
You have run the association rules algorithm on your data set, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant.
What else must be true?
- A . {grape, apple, orange} must be a frequent itemset.
- B . {banana, apple, grape, orange} must be a frequent itemset.
- C . {grape} => {banana, apple} must be a relevant rule.
- D . {banana, apple} => {orange} must be a relevant rule.
Which type of numeric value does a logistic regression model estimate?
- A . Probability
- B . A p-value
- C . Any integer
- D . Any real number
You are having a discussion with a business colleague. The colleague mentions that they want to perform K-means clustering on text file data stored in HDFS.
Which tool should be recommended?
- A . Mahout
- B . HBase
- C . Scribe
- D . Sqoop
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?
- A . Discovery
- B . Data Preparation
- C . Model Building
- D . Communicate Results
Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming.
Which query interface would you recommend?
- A . Pig
- B . Hive
- C . Howl
- D . HBase
What is a consideration when building decision trees?
- A . Cannot handle variables that affect the outcome in a discontinuous way
- B . Short decision trees are likely subject to overfit
- C . Correlated variables can cause double-counting
- D . Tree structure is sensitive to small changes in the training data
You need to run a hypothesis test across three normally distributed populations.
Which technique should you use?
- A . Z-test
- B . Welch’s t-test
- C . ANOVA
- D . Wilcoxon rank sum test
The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in- depth insight.
They have identified several popular product review blogs that historically have published thousands of user reviews of your company’s products. You have been asked to provide the desired analysis.
You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product’s name and extract the relevant text from each matching review.
What is the next step you should take?
- A . Convert the extracted text into a suitable document representation and index into a review corpus
- B . Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
- C . Read the extracted text for each review and manually tabulate the results
- D . Group the reviews using Naïve Bayesian classification
Which process in text analysis can be used to reduce dimensionality?
- A . Stemming
- B . Parsing
- C . Digitizing
- D . Sorting
Which analytical method is considered unsupervised?
- A . K-means clustering
- B . Naïve Bayesian classifier
- C . Decision tree
- D . Linear regression
Refer to the exhibit.
Which type of data issue would you suspect based on the exhibit?
- A . "Saturated" data, indicating potential issues with data definitions
- B . Incomplete data, indicating potential issues with data transmission
- C . Mis-scaled data, indicating potential issues with data entry
- D . The exhibit does not raise any obvious concerns with the data.
You have created a Logistic Regression model to predict customer churn for your company. The company’s Marketing department wants to use your model to identify at-risk customers and offer incentives to keep them from leaving.
Using two different thresholds for the model provides the two confusion matrices shown in the graphic. Marketing understands the relative costs of missing at-risk customers versus offering incentives to customers who are not at risk. Therefore, you need their advice on how to set the appropriate threshold on the churn model.
You are meeting with the Marketing team. In the meeting, you plan to state: “Raising the threshold from 0.5 to 0.75 reduces the number of unnecessary incentives that can be offered, at the cost of missing more of the customers who churned.”
What is the most appropriate visual to reinforce this statement?
A)
B)
C)
D)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
Your customer provided you with 2, 000 unlabeled records and asked you to separate them into three groups.
What is the correct analytical method to use?
- A . K-means clustering
- B . Linear regression
- C . Naive Bayesian classification
- D . Logistic regression
How is dimensionality defined in a "bag of words" document representation?
- A . Average number of words per sentence in the document
- B . Total number of words in the document
- C . Number of unique terms in the document
- D . Frequency of repeated words in the document
You received 100,000 home loan records and want to quickly determine if there is any correlation between mortgage age and mortgage amount before conducting advanced analysis.
Which tool should be used for the preliminary analysis?
- A . Scatter plot
- B . Stacked Bar chart
- C . Box and Whisker plot
- D . Histogram
What is the output of the K-means clustering algorithm?
- A . Centroid positioning and entropy of each record in each cluster
- B . Center of each discovered cluster and mapping of each record to a cluster
- C . Two dimensional representation of the data and the clusters
- D . Intercept and coefficients for each input variable in the dataset
You are provided with the following list.
Which window function is missing?
cume_dist()
dense_rank()
rank()
percent_rank()
first_value()
last_value()
lag()
lead()
ntile()
- A . row_preceding()
- B . row_number()
- C . median()
- D . cumulative_sum()
In text analysis, what makes the corpus representation dynamic?
- A . Algorithms used to determine the classification or tagging
- B . Search and retrieval process for finding the document that meets the search criteria
- C . Inherent high dimensionality in the problem of text analysis
- D . Requirement to update index and corpus metrics continuously
How are window functions different from regular aggregate functions?
- A . Rows retain their separate identities and the window function can access more than the current row.
- B . Rows are grouped into an output row and the window function can access more than the current row.
- C . Rows retain their separate identities and the window function can only access the current row.
- D . Rows are grouped into an output row and the window function can only access the current row.
You have created a Linear Regression model to predict total sales based on variables M, N, P and Q as shown in the graphic. You originally expected all variables to have positive coefficients.
Which action would you take?
- A . Accept all variables and begin model validation steps against holdout data
- B . Accept only positive variables and investigate potential correlation with the dependent variable
- C . Accept only statistically significant variables and investigate correlated independent variables
- D . Accept none of the variables and investigate correlations between all variables
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has been loaded into your analytics database; revenue data, pricing data, and online transaction data.
You find that all the data comes in different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported monthly.
What is your next step?
- A . Report back to the business owner that the current data model does not support the business question.
- B . Interpolate a daily model for revenue from the monthly revenue data.
- C . Aggregate all data to the monthly level in order to create a monthly revenue model.
- D . Disregard revenue as a driver in the pricing model, and create a daily model based on pricing and transactions only.
Which key role for a successful analytic project can provide business domain expertise with a deep understanding of the data and key performance indicators?
- A . Business Intelligence Analyst
- B . Project Manager
- C . Project Sponsor
- D . Business User
A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process.
What is a concern the data scientist should have about the data?
- A . It is too processed
- B . It is not structured
- C . It is not normalized
- D . It is too centralized
You have just completed the Discovery phase of a project and finished interviewing the main stakeholders. You have identified the necessary data feeds and are now beginning to set up the analytic sandbox.
What is the next step?
- A . Assess data quality
- B . Perform ELT / ETL
- C . Create data visualizations
- D . Run descriptive statistics for several data sets
In which lifecycle stage are appropriate analytical techniques determined?
- A . Model planning
- B . Model building
- C . Data preparation
- D . Discovery
What is holdout data?
- A . a subset of the provided data set selected at random and used to validate the model
- B . a subset of the provided data set selected at random and used to initially construct the model
- C . a subset of the provided data set that is removed by the data scientist because it contains data errors
- D . a subset of the provided data set that is removed by the data scientist because it contains outliers
In a t-test with unknown variance, what values are used to calculate the t-statistic?
- A . Sample mean, sample standard deviation, and sample size
- B . Mean, sample standard deviation, and population size
- C . Sample mean, standard deviation, and sample size
- D . Mean, standard deviation, and population size
Which participant in a data analytics project is typically responsible for assessing the validity of the model?
- A . Data scientist
- B . Business user
- C . Project sponsor
- D . Project manager
In a user-defined aggregate function, what is SFUNC?
- A . Window function
- B . State transition function
- C . Final calculation function
- D . Segment-level calculation function
What is required in a presentation for project sponsors?
- A . The "Big Picture" takeaways for executive level stakeholders
- B . Data warehouse design changes
- C . Line by line review of the developed code
- D . Detailed statistical basis for the modeling approach used in the project
Consider the following itemsets:
(hat, scarf, coat)
(hat, scarf, coat, gloves)
(hat, scarf, gloves)
(hat, gloves)
(scarf, coat, gloves)
If the minimum support is 50%, what represents the complete list of frequent 2-itemsets?
- A . (hat, scarf), (hat, gloves)
- B . (hat, scarf), (scarf, coat), (coat, gloves)
- C . (scarf, gloves), (scarf, coat) (hat, gloves)
- D . (hat, scarf), (hat, gloves), (scarf, gloves), (scarf, coat)
Which activity is performed in the Operationalize phase of the data analytics lifecycle?
- A . Try different variables
- B . Try different analytical techniques
- C . Assess the benefits
- D . Transform existing variables
Which ROC curve represents a perfect model fit?
A)
B)
C)
D)
- A . Exhibit A
- B . Exhibit B
- C . Exhibit C
- D . Exhibit D
Which Hadoop service is responsible for requesting resources for, and monitoring the completion of, MapReduce processes?
- A . Application Manager
- B . NameNode
- C . Application Master
- D . DataNode
To ensure a successful analytic project, which key role can consult and advise the project team on the value of end results and how these will be used on a daily basis?
- A . Business User
- B . Project Manager
- C . Data Scientist
- D . Business Intelligence Analyst
Which word or phrase completes the statement? Emphasis color is to standard color as _______.
- A . Main message is to context
- B . Main message is to key findings
- C . Frequent item set is to item
- D . Pie chart is to proportions
A data scientist is preparing a presentation for a meeting with the project’s business sponsors. The distribution of per-sale revenue is an important finding from the analysis. The graphics illustrate four ways to plot the per-sale revenue distribution..”
Which graphic is most appropriate for the sponsor presentation?
- A . Figure A
- B . Figure B
- C . Figure C
- D . Figure D
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in the previous model planning stage, and all tests have yielded statistically insignificant results.
What is your next step?
- A . Report that the results are insignificant, and reevaluate the original business question.
- B . Run all the models again against a larger sample, leveraging more historical data.
- C . Move forward on the model with the highest significance scores relative to the others.
- D . Modify samples used by the models and iterate until a significant result occurs.
A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units.
Which action should the team recommend?
- A . The manufacturing process should be inspected for problems.
- B . A larger sample size should be taken to determine if the plant is functioning properly
- C . A smaller sample size should be taken to determine if the plant is functioning properly
- D . The manufacturing process is functioning properly and no further action is required.
Data visualization is used in the final presentation of an analytics project.
For what else is this technique commonly used?
- A . Assessing data quality
- B . Descriptive statistics
- C . ETLT
- D . Model selection
Refer to the exhibit.
What provides the decision tree for predicting whether or not someone is a good or bad credit risk.
What would be the assigned probability, p(good), of a single male with no known savings?
- A . 0.83
- B . 0
- C . 0.498
- D . 0.6
Which SQL OLAP extension provides all possible grouping combinations?
- A . CUBE
- B . ROLLUP
- C . UNION ALL
- D . CROSS JOIN
Assume you are performing an analysis to determine fraud detection on credit card usage. You will need to ensure higher-risk transactions. These may indicate that fraudulent credit card activity is retained in your data for analysis and not dropped as outliers during pre- processing.
What is the approach for loading data into the analytical sandbox for this analysis?
- A . ELT
- B . ETL
- C . EDW
- D . OLTP
What type of data is represented in the exhibit?
- A . Structured
- B . Unstructured
- C . Quasi-structured
- D . Semi-structured
When is a Wilcoxon Rank-Sum test used?
- A . When an assumption about the distribution of the populations cannot be made
- B . When the data can be easily sorted
- C . When the populations represent the sums of other values
- D . When the data cannot be easily sorted
Refer to the Exhibit.
In the Exhibit. For effective visualization, what is the chart’s primary flaw?
- A . The use of 3 dimensions.
- B . The slanting of axis labels.
- C . The location of the legend.
- D . The order of the columns.
What requests resources from YARN during a MapReduce job?
- A . Map and reduce tasks
- B . ApplicationMaster
- C . ApplicationsManager
- D . DataNodes
Since R factors are categorical variables, they are most closely related to which data classification level?
- A . nominal
- B . ordinal
- C . interval
- D . ratio
What is a distinct property of Logistic Regression compared with Linear Regression?
- A . Logistic Regression handles missing values well
- B . Logistic Regression is robust with redundant or correlated variables
- C . Logistic Regression returns probability estimates of an event
- D . Logistic Regression works well with discrete variables that have many distinct values
You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%.
What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?
- A . 42.0
- B . 4.2
- C . 0.42
- D . 0.042
Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing.
What will be your approach for loading data into the analytical sandbox for this analysis?
- A . ELT
- B . ETL
- C . EDW
- D . OLTP
What is an appropriate data visualization to use in a presentation for an analyst audience?
- A . Pie chart
- B . Area chart
- C . Stacked bar chart
- D . ROC curve
How is HDFS defined?
- A . Large “web table” capable of holding millions of rows and millions of columns
- B . Row-column oriented datastore supporting redundancy and high availability
- C . Reliable, redundant distributed file system
- D . Reliable file system stored on a single extensible storage platform
Which word or phrase completes the statement? Structured data is to OLAP data as quasi- structured data is to
- A . Clickstream data
- B . XML data
- C . Text documents
- D . Image files
You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database.
Which tool/library would you use to produce these models with the least effort?
- A . MADlib
- B . Mahout
- C . RStudio
- D . HBase
A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet.
What is the most appropriate model to use? Suppose labeled training data is available.
- A . Naïve Bayesian classifier
- B . Linear regression
- C . Logistic regression
- D . K-means clustering
What does R code nv <- v[v < 1000] do?
- A . Selects the values in vector v that are less than 1000 and assigns them to the vector nv
- B . Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000
- C . Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv
- D . Selects values of vector v less than 1000, modifies v, and makes a copy to nv
You have run a Linear Regression model on the data shown in the graphic.
Which value is a reasonable guess for R-squared?
- A . -.8
- B . .8
- C . .25
- D . 1.25
You have created a scatterplot of two continuous variables for 2000 records. You want to add a line to the scatterplot to check linearity of the data.
Which function would best address this need?
- A . abline()
- B . glm()
- C . hist()
- D . lm()
Why do the Naïve Bayesian classifier implementations use the log of probability value rather than the pure probability value?
- A . To ensure the conditional independence of attribute values
- B . To avoid numerical underflow errors in high dimensional problems
- C . To obtain a more accurate estimate of the probabilities without the need for a Laplace smoothing
- D . To invalidate the variables that are continuous
Consider the following SQL query:
SELECT product_id FROM supplier_A
UNION
SELECT product_id FROM supplier_B;
What is the expected result?
- A . All product_id values from both tables with duplicates or repeating rows
- B . All product_id values from supplier_A table but not from supplier_B table
- C . All product_id values from supplier_B table but not from supplier_A table
- D . All product_id values from both tables with no duplicates or repeating rows
In data visualization, which type of chart is recommended to represent frequency data?
- A . Line chart
- B . Histogram
- C . Q-Q chart
- D . Scatterplot
Which word or phrase completes the statement; “Excessive emphasis color is to Bar chart as __________________.”?
- A . Multicollinearity is to OLS
- B . Multicollinearity is to serial correlation
- C . Confidence is to leverage
- D . Confidence interval is to regression
You submit a MapReduce job to a Hadoop cluster. Although the job was successfully submitted, you notice that it is not completing.
What should be done?
- A . Ensure that a DataNode is running
- B . Ensure that the TaskTracker is running
- C . Ensure that the NameNode is running
- D . Ensure that the JobTracker is running
Trend, seasonal, and cyclical are components of a time series.
What is another component?
- A . Irregular
- B . Linear
- C . Quadratic
- D . Exponential
Trend, seasonal, and cyclical are components of a time series.
What is another component?
- A . Irregular
- B . Linear
- C . Quadratic
- D . Exponential
Trend, seasonal, and cyclical are components of a time series.
What is another component?
- A . Irregular
- B . Linear
- C . Quadratic
- D . Exponential
Trend, seasonal, and cyclical are components of a time series.
What is another component?
- A . Irregular
- B . Linear
- C . Quadratic
- D . Exponential
Variable D is not significantly impacting the dependent variable.
After seeing your findings, the majority of your team agreed that variable B should be positively impacting the dependent variable.
What is a possible reason the coefficient for variable B was negative and not positive?
- A . Variable B is interacting with another variable due to correlated inputs
- B . Variable B needs a quadratic transformation due to its relationship to the dependent variable
- C . The information gain from variable B is already provided by another variable
- D . Variable B needs a logarithmic transformation due to its relationship to the dependent variable
Refer to the exhibit.
You have run a linear regression model against your data, and have plotted true outcome versus predicted outcome. The R-squared of your model is 0.75.
What is your assessment of the model?
- A . The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model’s quality over typical data.
- B . The R-squared is good. The model should perform well.
- C . The extreme-valued outliers may negatively affect the model’s performance. Remove them to see if the R-squared improves over typical data.
- D . The observations seem to come from two different populations, but this model fits them both equally well.
If distributed Item-based Collaborative Filtering is an algorithm supported by Mahout, what is the use case category of the algorithm?
- A . Classification
- B . Recommenders
- C . Frequent Itemset
- D . Clustering
Your risk analysis team has access to new customer financial data. You want to use this data to improve your prediction of credit default. Previously, the team was using only credit bureau scores, loan size, and customer income to assess risk of default.
What is the null hypothesis that should be used to evaluate the model?
- A . New model predicts as well as the toss of a coin weighted by the average default rate
- B . New model predicts better than the toss of a coin weighted by the average default rate
- C . Model using the new financial data predicts the outcome just as well as the previous model
- D . Model using the new financial data predicts the outcome better than the previous model
Which assumption makes the Naïve Bayesian classifier different from the general Bayesian model?
- A . Number of features cannot be greater than the number of records
- B . Features of a class are conditionally independent of one another
- C . All variables need to be numeric
- D . Fewer features can be used with the Naïve Bayes classifier
Refer to the exhibit.
You have plotted the distribution of savings account sizes for your bank.
How would you proceed, based on this distribution?
- A . The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
- B . The data is extremely skewed, but looks bimodal; replot the data in the range 2, 500-10, 000 to be sure.
- C . The accounts of size greater than 2500 are rare, and probably outliers. Eliminate them from your future analysis.
- D . The data is extremely skewed. Split your analysis into two cohorts: accounts less than 2500, and accounts greater than 2500
You have the following corpus of texts:
“The cat hit the dog.”
“The dog bit the mail carrier.”
“The mail carrier chased the truck.”
“The truck hit the wall while avoiding the dog that chased the cat.”
“The cat climbed the wall.”
If the tf-idf metric is used to score relevance for search and retrieval, which term has the highest discriminatory power?
- A . Dog
- B . Chased
- C . Bit
- D . Truck
What is required in a presentation for business analysts?
- A . Budgetary considerations and requests
- B . Operational process changes
- C . Detailed statistical explanation of the applicable modeling theory
- D . The presentation author’s credentials
You are using the Apriori algorithm to determine the likelihood that a person who owns a home has a good credit score. You have determined that the confidence for the rules used in the algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are homeowners".
What can you determine from the lift calculation?
- A . Support for the association is low
- B . Leverage of the rules is low
- C . The rule is coincidental
- D . The rule is true