DELL EMC D-DS-FN-23 Dell Data Science Foundations 2023 Online Training

exams

10 months ago

Question #1

In a decision tree, what is an example of a pure node?

A . 25 positives; 75 negatives
B . 50 positives; 50 negatives
C . 75 positives; 25 negatives
D . 100 positives; 0 negatives

Reveal Solution Hide Solution

Correct Answer: D

Question #2

When would you prefer a Naive Bayes model to a logistic regression model for classification?

A . When you are using several categorical input variables with over 1000 possible values each.
B . When you need to estimate the probability of an outcome, not just which class it is in.
C . When all the input variables are numerical.
D . When some of the input variables might be correlated.

Reveal Solution Hide Solution

Correct Answer: A

Question #3

What is an appropriate assignment for a data scientist?

A . Monitor key performance indicators
B . Define an OLAP database schema
C . Conduct customer surveys
D . Develop predictive models

Reveal Solution Hide Solution

Correct Answer: D

Question #4

What is the output format from the Map function of MapReduce?

A . Key-value pairs
B . Binary representation of keys concatenated with structured data
C . Compressed index
D . Unique key record and separate records of all possible values

Reveal Solution Hide Solution

Correct Answer: A

Question #5

What does the R code z <- f[1:10, ] do?

A . Assigns the first 10 rows of f to the vector z
B . Assigns the 1st 10 columns of the 1st row of f to z
C . Assigns a sequence of values from 1 to 10 to z
D . Assigns the 1st 10 columns to z

Reveal Solution Hide Solution

Correct Answer: A

Question #6

What is a core deliverable at the end of the analytic project?

A . An implemented database design
B . A whitepaper describing the project and the implementation
C . A presentation for project sponsors
D . The training materials

Reveal Solution Hide Solution

Correct Answer: C

Question #7

Consider the following SQL statement:

SELECT employee_id, year, salary, avg(salary)

OVER

(PARTITION BY employee_id ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as result_1

FROM employee

ORDER BY employee_id, year

For each employee_id, what is returned as result_1?

A . Three year rolling average salary
B . Four year rolling average salary
C . Average salary across all employee_id values
D . Average employee_id

Reveal Solution Hide Solution

Correct Answer: A

Question #8

What is the mandatory Clause that must be included when using Window functions?

A . OVER
B . RANK
C . PARTITION BY
D . RANK BY

Reveal Solution Hide Solution

Correct Answer: A

Question #9

In a fitted ARIMA(1,2,3) model, how many differences are applied?

A . 0
B . 1
C . 2
D . 3

Reveal Solution Hide Solution

Correct Answer: C

Question #10

If R factors are categorical variables, which data classification level are they most closely related?

A . Nominal
B . Ordinal
C . Interval
D . Ratio

Reveal Solution Hide Solution

Correct Answer: A

Question #11

Consider this SQL statement:

SELECT product, prod_cost, avg(prod_cost) OVER (PARTITION BY product)

FROM product_detail

The OVER clause makes this what type of function?

A . Window function
B . Aggregate function
C . System function
D . User-defined function

Reveal Solution Hide Solution

Correct Answer: A

Question #12

In a Student’s t-test, what is the meaning of the p-value?

A . it is the area under the appropriate tails of the Student’s distribution
B . it is the "power" of the Student’s t-test
C . it is the mean of the distribution for the null hypothesis
D . it is the mean of the distribution for the alternate hypothesis

Reveal Solution Hide Solution

Correct Answer: A

Question #13

Consider these itemsets:

(hat, scarf, coat)

(hat, scarf, coat, gloves)

(hat, scarf, gloves)

(hat, gloves)

(scarf, coat, gloves)

What is the confidence of the rule (gloves -> hat)?

A . 75%
B . 60%
C . 66%
D . 80%

Reveal Solution Hide Solution

Correct Answer: A

Question #14

During the data preparation phase, you notice a high correlation between average spend on video games, age of players, and number of science fiction shows watched.

Which technique could you use to address the three correlated variables?

A . Square the three variables to remove the correlation
B . Combine the three variables into one new variable
C . Drop the three variables to improve the model
D . Use scaling to make the three variables equivalent in size

Reveal Solution Hide Solution

Correct Answer: B

Question #15

You are attempting to find the Euclidean distance between two centroids:

Centroid A’s coordinates: (X = 2, Y = 4)

Centroid B’s coordinates (X = 8, Y = 10)

Which formula finds the correct Euclidean distance?

A . SQRT((2-8)2+(4-10)2) or 8.49
B . SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17
C . ((2-8)2+(4-10)2) or 72
D . ((2-8) x 2 + (4-10) x 2) or 148

Reveal Solution Hide Solution

Correct Answer: A

Question #16

In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables?

A . Apply a transformation to a variable
B . Use a different statistical package
C . Calculate the R-Squared value
D . Change the units of measurement on the independent variable

Reveal Solution Hide Solution

Correct Answer: A

Question #17

Which chart type is intended to display correlations between sets of numeric data?

A . Scatterplot
B . Histogram
C . Pie chart
D . Line Chart

Reveal Solution Hide Solution

Correct Answer: A

Question #18

What does the Receiver Operating Characteristic (ROC) curve show?

A . Relationship between p-value and true positive rate
B . Relationship between p-value and true negative rate
C . Relationship between true positive rate and false positive rate
D . Relationship between true positive rate and true negative rate

Reveal Solution Hide Solution

Correct Answer: C

Question #19

A fair six-sided die is rolled. Let A denote the event that an odd number is rolled. Let C denote the event that a 1, 2, or 3 is rolled.

What is the value of the conditional probability, P(C|A)?

A . 2/3
B . 1/2
C . 1/3
D . 1/4

Reveal Solution Hide Solution

Correct Answer: A

Question #20

Which word or phrase completes the statement? Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to __________.

A . Optimization and Predictive Modeling
B . Alerts and Queries
C . Structured Data and Data Sources
D . Sales and profit reporting

Reveal Solution Hide Solution

Correct Answer: A

Question #21

Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model: Y = b0 + b1x1+b2x2+….+bnxn

A . Ordinary Least squares
B . Apriori Algorithm
C . Ridge and Lasso
D . Integer programming

Reveal Solution Hide Solution

Correct Answer: A

Question #22

What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

A . Linear regression
B . Expected value
C . Variance
D . Quantiles

Reveal Solution Hide Solution

Correct Answer: A

Question #23

Refer to the exhibit.

You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of- squares (wss) data as shown in the exhibit.

How many customer groups should you specify?

A . 2
B . 3
C . 4
D . 8

Reveal Solution Hide Solution

Correct Answer: C

Question #24

Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?

A . Define the process to maintain the model
B . Try different analytical techniques
C . Try different variables
D . Transform existing variables

Reveal Solution Hide Solution

Correct Answer: A

Question #25

Which word or phrase completes the statement; “A theater actor is to ‘artistic and expressive’ as a data scientist is to.”?

A . Communicative and collaborative
B . Introverted and technical
C . Logical and steadfast
D . Independent and intelligent

Reveal Solution Hide Solution

Correct Answer: A

Question #26

When is the GROUP BY ROLLUP clause used in an OLAP query?

A . All subtotals and grand totals are to be included in the output
B . Subtotals are only to be included in the output
C . Grand totals are only to be included in the output
D . Specific subtotals and grand totals for a combination of variables are only to be included in the output

Reveal Solution Hide Solution

Correct Answer: A

Question #27

You have run the association rules algorithm on your data set, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant.

What else must be true?

A . {grape, apple, orange} must be a frequent itemset.
B . {banana, apple, grape, orange} must be a frequent itemset.
C . {grape} => {banana, apple} must be a relevant rule.
D . {banana, apple} => {orange} must be a relevant rule.

Reveal Solution Hide Solution

Correct Answer: A

Question #28

Which type of numeric value does a logistic regression model estimate?

A . Probability
B . A p-value
C . Any integer
D . Any real number

Reveal Solution Hide Solution

Correct Answer: A

Question #29

You are having a discussion with a business colleague. The colleague mentions that they want to perform K-means clustering on text file data stored in HDFS.

Which tool should be recommended?

A . Mahout
B . HBase
C . Scribe
D . Sqoop

Reveal Solution Hide Solution

Correct Answer: A

Question #30

In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

A . Discovery
B . Data Preparation
C . Model Building
D . Communicate Results

Reveal Solution Hide Solution

Correct Answer: B

Question #31

Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming.

Which query interface would you recommend?

A . Pig
B . Hive
C . Howl
D . HBase

Reveal Solution Hide Solution

Correct Answer: A

Question #32

What is a consideration when building decision trees?

A . Cannot handle variables that affect the outcome in a discontinuous way
B . Short decision trees are likely subject to overfit
C . Correlated variables can cause double-counting
D . Tree structure is sensitive to small changes in the training data

Reveal Solution Hide Solution

Correct Answer: D

Question #33

You need to run a hypothesis test across three normally distributed populations.

Which technique should you use?

A . Z-test
B . Welch’s t-test
C . ANOVA
D . Wilcoxon rank sum test

Reveal Solution Hide Solution

Correct Answer: C

Question #34

The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in- depth insight.

They have identified several popular product review blogs that historically have published thousands of user reviews of your company’s products. You have been asked to provide the desired analysis.

You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product’s name and extract the relevant text from each matching review.

What is the next step you should take?

A . Convert the extracted text into a suitable document representation and index into a review corpus
B . Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product
C . Read the extracted text for each review and manually tabulate the results
D . Group the reviews using Naïve Bayesian classification

Reveal Solution Hide Solution

Correct Answer: A

Question #35

Which process in text analysis can be used to reduce dimensionality?

A . Stemming
B . Parsing
C . Digitizing
D . Sorting

Reveal Solution Hide Solution

Correct Answer: A

Question #36

Which analytical method is considered unsupervised?

A . K-means clustering
B . Naïve Bayesian classifier
C . Decision tree
D . Linear regression

Reveal Solution Hide Solution

Correct Answer: A

Question #37

Refer to the exhibit.

Which type of data issue would you suspect based on the exhibit?

A . "Saturated" data, indicating potential issues with data definitions
B . Incomplete data, indicating potential issues with data transmission
C . Mis-scaled data, indicating potential issues with data entry
D . The exhibit does not raise any obvious concerns with the data.

Reveal Solution Hide Solution

Correct Answer: A

Question #38

You have created a Logistic Regression model to predict customer churn for your company. The company’s Marketing department wants to use your model to identify at-risk customers and offer incentives to keep them from leaving.

Using two different thresholds for the model provides the two confusion matrices shown in the graphic. Marketing understands the relative costs of missing at-risk customers versus offering incentives to customers who are not at risk. Therefore, you need their advice on how to set the appropriate threshold on the churn model.

You are meeting with the Marketing team. In the meeting, you plan to state: “Raising the threshold from 0.5 to 0.75 reduces the number of unnecessary incentives that can be offered, at the cost of missing more of the customers who churned.”

What is the most appropriate visual to reinforce this statement?

A)

A . Option A
B . Option B
C . Option C
D . Option D

Reveal Solution Hide Solution

Correct Answer: B

Question #39

Your customer provided you with 2, 000 unlabeled records and asked you to separate them into three groups.

What is the correct analytical method to use?

A . K-means clustering
B . Linear regression
C . Naive Bayesian classification
D . Logistic regression

Reveal Solution Hide Solution

Correct Answer: A

Question #40

How is dimensionality defined in a "bag of words" document representation?

A . Average number of words per sentence in the document
B . Total number of words in the document
C . Number of unique terms in the document
D . Frequency of repeated words in the document

Reveal Solution Hide Solution

Correct Answer: C

Question #41

You received 100,000 home loan records and want to quickly determine if there is any correlation between mortgage age and mortgage amount before conducting advanced analysis.

Which tool should be used for the preliminary analysis?

A . Scatter plot
B . Stacked Bar chart
C . Box and Whisker plot
D . Histogram

Reveal Solution Hide Solution

Correct Answer: A

Question #42

What is the output of the K-means clustering algorithm?

A . Centroid positioning and entropy of each record in each cluster
B . Center of each discovered cluster and mapping of each record to a cluster
C . Two dimensional representation of the data and the clusters
D . Intercept and coefficients for each input variable in the dataset

Reveal Solution Hide Solution

Correct Answer: B

Question #43

You are provided with the following list.

Which window function is missing?

cume_dist()

dense_rank()

rank()

percent_rank()

first_value()

last_value()

lag()

lead()

ntile()

A . row_preceding()
B . row_number()
C . median()
D . cumulative_sum()

Reveal Solution Hide Solution

Correct Answer: B

Question #44

In text analysis, what makes the corpus representation dynamic?

A . Algorithms used to determine the classification or tagging
B . Search and retrieval process for finding the document that meets the search criteria
C . Inherent high dimensionality in the problem of text analysis
D . Requirement to update index and corpus metrics continuously

Reveal Solution Hide Solution

Correct Answer: D

Question #45

How are window functions different from regular aggregate functions?

A . Rows retain their separate identities and the window function can access more than the current row.
B . Rows are grouped into an output row and the window function can access more than the current row.
C . Rows retain their separate identities and the window function can only access the current row.
D . Rows are grouped into an output row and the window function can only access the current row.

Reveal Solution Hide Solution

Correct Answer: A

Question #46

You have created a Linear Regression model to predict total sales based on variables M, N, P and Q as shown in the graphic. You originally expected all variables to have positive coefficients.

Which action would you take?

A . Accept all variables and begin model validation steps against holdout data
B . Accept only positive variables and investigate potential correlation with the dependent variable
C . Accept only statistically significant variables and investigate correlated independent variables
D . Accept none of the variables and investigate correlations between all variables

Reveal Solution Hide Solution

Correct Answer: D

Question #47

You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has been loaded into your analytics database; revenue data, pricing data, and online transaction data.

You find that all the data comes in different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported monthly.

What is your next step?

A . Report back to the business owner that the current data model does not support the business question.
B . Interpolate a daily model for revenue from the monthly revenue data.
C . Aggregate all data to the monthly level in order to create a monthly revenue model.
D . Disregard revenue as a driver in the pricing model, and create a daily model based on pricing and transactions only.

Reveal Solution Hide Solution

Correct Answer: A

Question #48

Which key role for a successful analytic project can provide business domain expertise with a deep understanding of the data and key performance indicators?

A . Business Intelligence Analyst
B . Project Manager
C . Project Sponsor
D . Business User

Reveal Solution Hide Solution

Correct Answer: A

Question #49

A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process.

What is a concern the data scientist should have about the data?

A . It is too processed
B . It is not structured
C . It is not normalized
D . It is too centralized

Reveal Solution Hide Solution

Correct Answer: A

Question #50

You have just completed the Discovery phase of a project and finished interviewing the main stakeholders. You have identified the necessary data feeds and are now beginning to set up the analytic sandbox.

What is the next step?

A . Assess data quality
B . Perform ELT / ETL
C . Create data visualizations
D . Run descriptive statistics for several data sets

Reveal Solution Hide Solution

Correct Answer: B

Question #51

In which lifecycle stage are appropriate analytical techniques determined?

A . Model planning
B . Model building
C . Data preparation
D . Discovery

Reveal Solution Hide Solution

Correct Answer: A

Question #52

What is holdout data?

A . a subset of the provided data set selected at random and used to validate the model
B . a subset of the provided data set selected at random and used to initially construct the model
C . a subset of the provided data set that is removed by the data scientist because it contains data errors
D . a subset of the provided data set that is removed by the data scientist because it contains outliers

Reveal Solution Hide Solution

Correct Answer: A

Question #53

In a t-test with unknown variance, what values are used to calculate the t-statistic?

A . Sample mean, sample standard deviation, and sample size
B . Mean, sample standard deviation, and population size
C . Sample mean, standard deviation, and sample size
D . Mean, standard deviation, and population size

Reveal Solution Hide Solution

Correct Answer: A

Question #54

Which participant in a data analytics project is typically responsible for assessing the validity of the model?

A . Data scientist
B . Business user
C . Project sponsor
D . Project manager

Reveal Solution Hide Solution

Correct Answer: A

Question #55

In a user-defined aggregate function, what is SFUNC?

A . Window function
B . State transition function
C . Final calculation function
D . Segment-level calculation function

Reveal Solution Hide Solution

Correct Answer: B

Question #56

What is required in a presentation for project sponsors?

A . The "Big Picture" takeaways for executive level stakeholders
B . Data warehouse design changes
C . Line by line review of the developed code
D . Detailed statistical basis for the modeling approach used in the project

Reveal Solution Hide Solution

Correct Answer: A

Question #57

Consider the following itemsets:

(hat, scarf, coat)

(hat, scarf, coat, gloves)

(hat, scarf, gloves)

(hat, gloves)

(scarf, coat, gloves)

If the minimum support is 50%, what represents the complete list of frequent 2-itemsets?

A . (hat, scarf), (hat, gloves)
B . (hat, scarf), (scarf, coat), (coat, gloves)
C . (scarf, gloves), (scarf, coat) (hat, gloves)
D . (hat, scarf), (hat, gloves), (scarf, gloves), (scarf, coat)

Reveal Solution Hide Solution

Correct Answer: D

Question #58

Which activity is performed in the Operationalize phase of the data analytics lifecycle?

A . Try different variables
B . Try different analytical techniques
C . Assess the benefits
D . Transform existing variables

Reveal Solution Hide Solution

Correct Answer: C

Question #59

Which ROC curve represents a perfect model fit?

A)

A . Exhibit A
B . Exhibit B
C . Exhibit C
D . Exhibit D

Reveal Solution Hide Solution

Correct Answer: A

Question #60

Which Hadoop service is responsible for requesting resources for, and monitoring the completion of, MapReduce processes?

A . Application Manager
B . NameNode
C . Application Master
D . DataNode

Reveal Solution Hide Solution

Correct Answer: C

Question #61

To ensure a successful analytic project, which key role can consult and advise the project team on the value of end results and how these will be used on a daily basis?

A . Business User
B . Project Manager
C . Data Scientist
D . Business Intelligence Analyst

Reveal Solution Hide Solution

Correct Answer: A

Question #62

Which word or phrase completes the statement? Emphasis color is to standard color as _______.

A . Main message is to context
B . Main message is to key findings
C . Frequent item set is to item
D . Pie chart is to proportions

Reveal Solution Hide Solution

Correct Answer: A

Question #63

A data scientist is preparing a presentation for a meeting with the project’s business sponsors. The distribution of per-sale revenue is an important finding from the analysis. The graphics illustrate four ways to plot the per-sale revenue distribution..”

Which graphic is most appropriate for the sponsor presentation?

A . Figure A
B . Figure B
C . Figure C
D . Figure D

Reveal Solution Hide Solution

Correct Answer: B

Question #64

You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in the previous model planning stage, and all tests have yielded statistically insignificant results.

What is your next step?

A . Report that the results are insignificant, and reevaluate the original business question.
B . Run all the models again against a larger sample, leveraging more historical data.
C . Move forward on the model with the highest significance scores relative to the others.
D . Modify samples used by the models and iterate until a significant result occurs.

Reveal Solution Hide Solution

Correct Answer: A

Question #65

A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units.

Which action should the team recommend?

A . The manufacturing process should be inspected for problems.
B . A larger sample size should be taken to determine if the plant is functioning properly
C . A smaller sample size should be taken to determine if the plant is functioning properly
D . The manufacturing process is functioning properly and no further action is required.

Reveal Solution Hide Solution

Correct Answer: A

Question #66

Data visualization is used in the final presentation of an analytics project.

For what else is this technique commonly used?

A . Assessing data quality
B . Descriptive statistics
C . ETLT
D . Model selection

Reveal Solution Hide Solution

Correct Answer: A

Question #67

Refer to the exhibit.

What provides the decision tree for predicting whether or not someone is a good or bad credit risk.

What would be the assigned probability, p(good), of a single male with no known savings?

A . 0.83
B . 0
C . 0.498
D . 0.6

Reveal Solution Hide Solution

Correct Answer: A

Question #68

Which SQL OLAP extension provides all possible grouping combinations?

A . CUBE
B . ROLLUP
C . UNION ALL
D . CROSS JOIN

Reveal Solution Hide Solution

Correct Answer: A

Question #69

Assume you are performing an analysis to determine fraud detection on credit card usage. You will need to ensure higher-risk transactions. These may indicate that fraudulent credit card activity is retained in your data for analysis and not dropped as outliers during pre- processing.

What is the approach for loading data into the analytical sandbox for this analysis?

A . ELT
B . ETL
C . EDW
D . OLTP

Reveal Solution Hide Solution

Correct Answer: A

Question #70

What type of data is represented in the exhibit?

A . Structured
B . Unstructured
C . Quasi-structured
D . Semi-structured

Reveal Solution Hide Solution

Correct Answer: A

Question #71

When is a Wilcoxon Rank-Sum test used?

A . When an assumption about the distribution of the populations cannot be made
B . When the data can be easily sorted
C . When the populations represent the sums of other values
D . When the data cannot be easily sorted

Reveal Solution Hide Solution

Correct Answer: A

Question #72

Refer to the Exhibit.

In the Exhibit. For effective visualization, what is the chart’s primary flaw?

A . The use of 3 dimensions.
B . The slanting of axis labels.
C . The location of the legend.
D . The order of the columns.

Reveal Solution Hide Solution

Correct Answer: A

Question #73

What requests resources from YARN during a MapReduce job?

A . Map and reduce tasks
B . ApplicationMaster
C . ApplicationsManager
D . DataNodes

Reveal Solution Hide Solution

Correct Answer: B

Question #74

Since R factors are categorical variables, they are most closely related to which data classification level?

A . nominal
B . ordinal
C . interval
D . ratio

Reveal Solution Hide Solution

Correct Answer: A

Question #75

What is a distinct property of Logistic Regression compared with Linear Regression?

A . Logistic Regression handles missing values well
B . Logistic Regression is robust with redundant or correlated variables
C . Logistic Regression returns probability estimates of an event
D . Logistic Regression works well with discrete variables that have many distinct values

Reveal Solution Hide Solution

Correct Answer: C

Question #76

You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%.

What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?

A . 42.0
B . 4.2
C . 0.42
D . 0.042

Reveal Solution Hide Solution

Correct Answer: A

Question #77

Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing.

What will be your approach for loading data into the analytical sandbox for this analysis?

A . ELT
B . ETL
C . EDW
D . OLTP

Reveal Solution Hide Solution

Correct Answer: A

Question #78

What is an appropriate data visualization to use in a presentation for an analyst audience?

A . Pie chart
B . Area chart
C . Stacked bar chart
D . ROC curve

Reveal Solution Hide Solution

Correct Answer: D

Question #79

How is HDFS defined?

A . Large “web table” capable of holding millions of rows and millions of columns
B . Row-column oriented datastore supporting redundancy and high availability
C . Reliable, redundant distributed file system
D . Reliable file system stored on a single extensible storage platform

Reveal Solution Hide Solution

Correct Answer: C

Question #80

Which word or phrase completes the statement? Structured data is to OLAP data as quasi- structured data is to

A . Clickstream data
B . XML data
C . Text documents
D . Image files

Reveal Solution Hide Solution

Correct Answer: A

Question #81

You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database.

Which tool/library would you use to produce these models with the least effort?

A . MADlib
B . Mahout
C . RStudio
D . HBase

Reveal Solution Hide Solution

Correct Answer: A

Question #82

A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet.

What is the most appropriate model to use? Suppose labeled training data is available.

A . Naïve Bayesian classifier
B . Linear regression
C . Logistic regression
D . K-means clustering

Reveal Solution Hide Solution

Correct Answer: A

Question #83

What does R code nv <- v[v < 1000] do?

A . Selects the values in vector v that are less than 1000 and assigns them to the vector nv
B . Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000
C . Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv
D . Selects values of vector v less than 1000, modifies v, and makes a copy to nv

Reveal Solution Hide Solution

Correct Answer: A

Question #84

You have run a Linear Regression model on the data shown in the graphic.

Which value is a reasonable guess for R-squared?

A . -.8
B . .8
C . .25
D . 1.25

Reveal Solution Hide Solution

Correct Answer: B

Question #85

You have created a scatterplot of two continuous variables for 2000 records. You want to add a line to the scatterplot to check linearity of the data.

Which function would best address this need?

A . abline()
B . glm()
C . hist()
D . lm()

Reveal Solution Hide Solution

Correct Answer: A

Question #86

Why do the Naïve Bayesian classifier implementations use the log of probability value rather than the pure probability value?

A . To ensure the conditional independence of attribute values
B . To avoid numerical underflow errors in high dimensional problems
C . To obtain a more accurate estimate of the probabilities without the need for a Laplace smoothing
D . To invalidate the variables that are continuous

Reveal Solution Hide Solution

Correct Answer: B

Question #87

Consider the following SQL query:

SELECT product_id FROM supplier_A

UNION

SELECT product_id FROM supplier_B;

What is the expected result?

A . All product_id values from both tables with duplicates or repeating rows
B . All product_id values from supplier_A table but not from supplier_B table
C . All product_id values from supplier_B table but not from supplier_A table
D . All product_id values from both tables with no duplicates or repeating rows

Reveal Solution Hide Solution

Correct Answer: D

Question #88

In data visualization, which type of chart is recommended to represent frequency data?

A . Line chart
B . Histogram
C . Q-Q chart
D . Scatterplot

Reveal Solution Hide Solution

Correct Answer: B

Question #89

Which word or phrase completes the statement; “Excessive emphasis color is to Bar chart as __________________.”?

A . Multicollinearity is to OLS
B . Multicollinearity is to serial correlation
C . Confidence is to leverage
D . Confidence interval is to regression

Reveal Solution Hide Solution

Correct Answer: A

Question #90

You submit a MapReduce job to a Hadoop cluster. Although the job was successfully submitted, you notice that it is not completing.

What should be done?

A . Ensure that a DataNode is running
B . Ensure that the TaskTracker is running
C . Ensure that the NameNode is running
D . Ensure that the JobTracker is running

Reveal Solution Hide Solution

Correct Answer: B

Question #91

Trend, seasonal, and cyclical are components of a time series.

What is another component?

A . Irregular
B . Linear
C . Quadratic
D . Exponential

Reveal Solution Hide Solution

Correct Answer: A

Question #91

Trend, seasonal, and cyclical are components of a time series.

What is another component?

A . Irregular
B . Linear
C . Quadratic
D . Exponential

Reveal Solution Hide Solution

Correct Answer: A

Question #91

Trend, seasonal, and cyclical are components of a time series.

What is another component?

A . Irregular
B . Linear
C . Quadratic
D . Exponential

Reveal Solution Hide Solution

Correct Answer: A

Question #91

Trend, seasonal, and cyclical are components of a time series.

What is another component?

A . Irregular
B . Linear
C . Quadratic
D . Exponential

Reveal Solution Hide Solution

Correct Answer: A

Question #95

Variable D is not significantly impacting the dependent variable.

After seeing your findings, the majority of your team agreed that variable B should be positively impacting the dependent variable.

What is a possible reason the coefficient for variable B was negative and not positive?

A . Variable B is interacting with another variable due to correlated inputs
B . Variable B needs a quadratic transformation due to its relationship to the dependent variable
C . The information gain from variable B is already provided by another variable
D . Variable B needs a logarithmic transformation due to its relationship to the dependent variable

Reveal Solution Hide Solution

Correct Answer: A

Question #96

Refer to the exhibit.

You have run a linear regression model against your data, and have plotted true outcome versus predicted outcome. The R-squared of your model is 0.75.

What is your assessment of the model?

A . The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model’s quality over typical data.
B . The R-squared is good. The model should perform well.
C . The extreme-valued outliers may negatively affect the model’s performance. Remove them to see if the R-squared improves over typical data.
D . The observations seem to come from two different populations, but this model fits them both equally well.

Reveal Solution Hide Solution

Correct Answer: A

Question #97

If distributed Item-based Collaborative Filtering is an algorithm supported by Mahout, what is the use case category of the algorithm?

A . Classification
B . Recommenders
C . Frequent Itemset
D . Clustering

Reveal Solution Hide Solution

Correct Answer: B

Question #98

Your risk analysis team has access to new customer financial data. You want to use this data to improve your prediction of credit default. Previously, the team was using only credit bureau scores, loan size, and customer income to assess risk of default.

What is the null hypothesis that should be used to evaluate the model?

A . New model predicts as well as the toss of a coin weighted by the average default rate
B . New model predicts better than the toss of a coin weighted by the average default rate
C . Model using the new financial data predicts the outcome just as well as the previous model
D . Model using the new financial data predicts the outcome better than the previous model

Reveal Solution Hide Solution

Correct Answer: C

Question #99

Which assumption makes the Naïve Bayesian classifier different from the general Bayesian model?

A . Number of features cannot be greater than the number of records
B . Features of a class are conditionally independent of one another
C . All variables need to be numeric
D . Fewer features can be used with the Naïve Bayes classifier

Reveal Solution Hide Solution

Correct Answer: B

Question #100

Refer to the exhibit.

You have plotted the distribution of savings account sizes for your bank.

How would you proceed, based on this distribution?

A . The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
B . The data is extremely skewed, but looks bimodal; replot the data in the range 2, 500-10, 000 to be sure.
C . The accounts of size greater than 2500 are rare, and probably outliers. Eliminate them from your future analysis.
D . The data is extremely skewed. Split your analysis into two cohorts: accounts less than 2500, and accounts greater than 2500

Reveal Solution Hide Solution

Correct Answer: A

Question #101

You have the following corpus of texts:

“The cat hit the dog.”

“The dog bit the mail carrier.”

“The mail carrier chased the truck.”

“The truck hit the wall while avoiding the dog that chased the cat.”

“The cat climbed the wall.”

If the tf-idf metric is used to score relevance for search and retrieval, which term has the highest discriminatory power?

A . Dog
B . Chased
C . Bit
D . Truck

Reveal Solution Hide Solution

Correct Answer: C

Question #102

What is required in a presentation for business analysts?

A . Budgetary considerations and requests
B . Operational process changes
C . Detailed statistical explanation of the applicable modeling theory
D . The presentation author’s credentials

Reveal Solution Hide Solution

Correct Answer: B

Question #103

You are using the Apriori algorithm to determine the likelihood that a person who owns a home has a good credit score. You have determined that the confidence for the rules used in the algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are homeowners".

What can you determine from the lift calculation?

A . Support for the association is low
B . Leverage of the rules is low
C . The rule is coincidental
D . The rule is true

Reveal Solution Hide Solution

Correct Answer: C