Exam4Training

CompTIA DA0-001 CompTIA Data+ Certification Online Training

Question #1

Refer to the exhibit.

A data analyst needs to calculate the mean for Q1 sales using the data set below:

Which of the following is the mean?

  • A . $2,466.18
  • B . $2,667.60
  • C . $3,082.72
  • D . $12,330.88

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The mean is the average of all the values in a data set. To calculate the mean, we add up all the values and divide by the number of values. In this case, the mean for Q1 sales is ($2,000 + $3,000 + $4,000 + $2,500 + $3,500) / 5 = $3,082.72

Reference: CompTIA Data+ Certification Exam Objectives, page 9

Question #2

A data analyst is creating a report that will provide information about various regions, products, and time periods.

Which of the following formats would be the MOST efficient way to deliver this report?

  • A . A workbook with multiple tabs for each region
  • B . A daily email with snapshots of regional summaries
  • C . A static report with a different page for every filtered view
  • D . A dashboard with filters at the top that the user can toggle

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A dashboard with filters at the top that the user can toggle would be the most efficient way to deliver this report, because it allows the user to customize the view and explore different combinations of regions, products, and time periods. A workbook with multiple tabs for each region would be cumbersome and repetitive. A daily email with snapshots of regional summaries would not provide enough detail or interactivity. A static report with a different page for every filtered view would be too long and hard to navigate.

Reference: CompTIA Data+ Certification Exam Objectives, page 14

Question #3

Refer to the exhibit.

A customer list from a financial services company is shown below:

A data analyst wants to create a likely-to-buy score on a scale from 0 to 100, based on an average of the three numerical variables: number of credit cards, age, and income.

Which of the following should the analyst do to the variables to ensure they all have the same weight in the score calculation?

  • A . Recode the variables.
  • B . Calculate the percentiles of the variables.
  • C . Calculate the standard deviations of the variables.
  • D . Normalize the variables.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Normalizing the variables means scaling them to a common range, such as 0 to 1 or -1 to 1, so that they have the same weight in the score calculation. Recoding the variables means changing their values or categories, which would alter their meaning and distribution. Calculating the percentiles of the variables means ranking them relative to each other, which would not account for their actual magnitudes. Calculating the standard deviations of the variables means measuring their variability, which would not make them comparable.

Reference: CompTIA Data+ Certification Exam Objectives, page 10

Question #4

Which of the following actions should be taken when transmitting data to mitigate the chance of a data leak occurring? (Choose two.)

  • A . Data identification
  • B . Data processing
  • C . Data Reporting
  • D . Data encryption
  • E . Data masking
  • F . Fata removal

Reveal Solution Hide Solution

Correct Answer: DE
DE

Explanation:

Data encryption and data masking are two actions that can be taken when transmitting data to mitigate the chance of a data leak occurring. Data encryption means transforming data into an unreadable format that can only be decrypted with a key. Data masking means hiding or replacing sensitive data with fictitious or anonymized data. Both methods protect the confidentiality and integrity of the data in transit.

Reference: CompTIA Data+ Certification Exam Objectives, page 13

Question #5

A data analyst has been asked to organize the table below in the following ways:

By sales from high to low –

By state in alphabetic order –

Which of the following functions will allow the data analyst to organize the table in this manner?

  • A . Conditional formatting
  • B . Grouping
  • C . Filtering
  • D . Sorting

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Sorting is the function that will allow the data analyst to organize the table in the desired manner. Sorting means arranging the data in a specific order, such as ascending or descending, based on one or more criteria. Sorting can be applied to any column in the table, such as sales or state.

Reference: CompTIA Data+ Certification Exam Objectives, page 11

Question #6

Which of the following BEST describes the issue in which character values are mixed with integer values in a data set column?

  • A . Duplicate data
  • B . Missing data
  • C . Data outliers
  • D . Invalid data type

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The invalid data type is the best description for the issue in which character values are mixed with integer values in a data set column. Invalid data type means that the data does not match the expected or required format or structure for a given variable or attribute. For example, if a column is supposed to store numerical values, but some rows contain text values, then those rows have an invalid data type.

Reference: CompTIA Data+ Certification Exam Objectives, page 10

Question #7

Which of the following is a process that is used during data integration to collect, blend, and load data?

  • A . MDM
  • B . ETL
  • C . OLTP
  • D . BI

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

ETL is a process that is used during data integration to collect, blend, and load data. ETL stands for extract, transform, and load, which are the three main steps involved in moving data from different sources to a common destination, such as a data warehouse or a data lake. ETL helps to consolidate and standardize data for analysis and reporting purposes.

Reference: CompTIA Data+ Certification Exam Objectives, page 12

Question #8

An analyst has received the requirements for an internal user dashboard. The analyst confirms the data sources and then creates a wireframe.

Which of the following is the NEXT step the analyst should take in the dashboard creation process?

  • A . Optimize the dashboard.
  • B . Create subscriptions.
  • C . Get stakeholder approval.
  • D . Deploy to production.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Getting stakeholder approval is the next step the analyst should take in the dashboard creation process, after confirming the data sources and creating a wireframe. Stakeholder approval means getting feedback and validation from the intended users or clients of the dashboard, to ensure that it meets their expectations and requirements. This step helps to avoid rework and ensure customer satisfaction.

Reference: CompTIA Data+ Certification Exam Objectives, page 14

Question #9

A data analyst has been asked to derive a new variable labeled “Promotion_flag” based on the total quantity sold by each salesperson.

Given the table below:

Which of the following functions would the analyst consider appropriate to flag “Yes” for every salesperson who has a number above 1,000,000 in the Quantity_sold column?

  • A . Date
  • B . Mathematical
  • C . Logical
  • D . Aggregate

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A logical function is a type of function that returns a value based on a condition or a set of conditions. For example, the IF function in Excel can be used to check if a certain condition is met, and then return one value if true, and another value if false. In this case, the data analyst can use a logical function to check if the Quantity_sold column is greater than 1,000,000, and then return “Yes” if true, and “No” if false. This would create a new variable called Promotion_flag that indicates whether the salesperson has sold more than 1,000,000 units or not.

Reference: CompTIA Data+ Certification Exam Objectives, Logical functions (reference)

Question #10

Refer to the exhibit.

Given the diagram below:

Which of the following data schemas shown?

  • A . Key-value pairs
  • B . Online transactional processing
  • C . Data Lake
  • D . Relational database

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A relational database is a type of database that organizes data into tables, where each table has a fixed number of columns and a variable number of rows. Each row in a table represents a record or an entity, and each column represents an attribute or a property of that entity. The tables are linked by common fields, called keys, which enable the database to establish relationships between the data. A relational database schema is a diagram that shows the structure and organization of the tables, columns, keys, and constraints in a relational database. The diagram given in the question is an example of a relational database schema, as it shows two tables: “Runs” and “Experiments”, with their respective columns, data types, and primary keys. The “Runs” table also has a foreign key that references the “ExperimentId” column in the “Experiments” table, indicating a relationship between the two tables. Therefore, the correct answer is D.

Reference: What is a database schema? | IBM, Database Schema – Javatpoint

Question #11

A company’s marketing department wants to do a promotional campaign next month. A data analyst on the team has been asked to perform customer segmentation, looking at how recently a customer bought the product, at what frequency, and at what value.

Which of the following types of analysis would this practice be considered?

  • A . Prescriptive
  • B . Trend
  • C . Gap
  • D . Custer

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Customer segmentation is a type of cluster analysis, which is a method of grouping data points based on their similarities or differences. Cluster analysis can help identify patterns and trends in the data, as well as target specific groups of customers for marketing purposes. One common technique for customer segmentation is RFM analysis, which stands for recency, frequency, and monetary value. This technique assigns a score to each customer based on how recently they bought the product, how often they buy the product, and how much they spend on the product. These scores can then be used to create clusters of customers with different characteristics and preferences. Therefore, the correct answer is D.

Reference: Cluster Analysis – Statistics Solutions, RFM Analysis: The Ultimate Guide for Customer Segmentation

Question #12

A publishing group has requested a dashboard to track submissions before publication. A key requirement is that all changes are tracked, as multiple users will be checking out documents and editing them before submissions are considered final.

Which of the following is the BEST way to meet this stakeholder requirement?

  • A . Display the version number next to each submission on the dashboard.
  • B . Present a data refresh date at the top of the dashboard.
  • C . Confirm the dashboard is adhering to the corporate style guide.
  • D . Use permissions to ensure users only see certain versions of the submissions.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A static report is a type of report that shows a snapshot of data at a specific point in time. A static report does not change or update automatically, unless the data source is refreshed or the report is regenerated. A static report is suitable for situations where the data does not change frequently or where historical data is needed for comparison or analysis. In this case, the data analyst is asked to create a sales report for the second-quarter 2020 board meeting, which will include a review of the business’s performance through the second quarter. The board meeting will be held on July 15, 2020, after the numbers are finalized. This means that the data analyst does not need to show real-time or dynamic data, but rather a fixed and accurate view of the sales data for the second quarter. Therefore, a static report would be the best way to meet this stakeholder requirement. Therefore, the correct answer is A.

Reference: What are Static Reports? | Sisense, Static vs Dynamic Reports – What’s The Difference? | datapine

Question #13

The number of phone calls that the call center receives in a day is an example of:

  • A . continuous data.
  • B . categorical data.
  • C . ordinal data.
  • D . discrete data.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Discrete data is a type of data that can only take certain values, usually whole numbers or integers. Discrete data can be counted, but not measured. For example, the number of students in a class, the number of books in a library, or the number of phone calls that a call center receives in a day are all examples of discrete data. Discrete data is different from continuous data, which can take any value within a range, and can be measured with precision. For example, the height of a person, the weight of a fruit, or the temperature of a room are all examples of continuous data. Therefore, the correct answer is D.

Reference: [Discrete vs Continuous Data: Definition and Examples – Statistics How To], [Discrete Data – Definition and Examples | Math Goodies]

Question #14

A data analyst is asked to create a sales report for the second-quarter 2020 board meeting, which will include a review of the business’s performance through the second quarter. The board meeting will be held on July 15, 2020, after the numbers are finalized.

Which of the following report types should the data analyst create?

  • A . Static
  • B . Real-time
  • C . Self-service
  • D . Dynamic

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A dynamic report is a type of report that shows data that changes or updates automatically based on certain criteria or parameters. A dynamic report can allow users to interact with the data, filter it, drill down into it, or visualize it in different ways. A dynamic report is suitable for situations where the data changes frequently or where real-time or near-real-time data is needed for decision making or analysis. In this case, the data analyst is asked to create a sales report for the second-quarter 2020 board meeting, which will include a review of the business’s performance through the second quarter. The board meeting will be held on July 15, 2020, after the numbers are finalized. This means that the data analyst does not need to show real-time or dynamic data, but rather a fixed and accurate view of the sales data for the second quarter. Therefore, a static report would be the best way to meet this stakeholder requirement. Therefore, the correct answer is A.

Reference: [What are Dynamic Reports? | Sisense], Static vs Dynamic Reports – What’s The Difference? | datapine

Question #15

Which of the following would be considered non-personally identifiable information?

  • A . Cell phone device name
  • B . Customer’s name
  • C . Government ID number
  • D . Telephone number

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Non-personally identifiable information (non-PII) is any data that cannot be used to identify, contact, or locate a specific individual, either alone or combined with other sources. Non-PII can include aggregated statistics, anonymous data, device identifiers, IP addresses, cookies, and other types of information that do not reveal the identity or location of a person. Cell phone device name is an example of non-PII, as it does not reveal any personal information about the owner or user of the device. Therefore, the correct answer is A.

Reference: What is Non-Personally Identifiable Information (Non-PII)? | Definition and Examples, What is Personally Identifiable Information (PII)? | Definition and Examples

Question #16

Which of the following is the correct data type for text?

  • A . Boolean
  • B . String
  • C . Integer
  • D . Float

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

A string is a data type that represents a sequence of characters, such as text, symbols, numbers, or punctuation marks. Strings are enclosed in quotation marks, such as “Hello”, “123”, or “!@#”. Strings can be manipulated, concatenated, sliced, indexed, formatted, and searched using various methods and functions. A string is different from other data types, such as boolean, integer, or float, which represent logical values (true or false), whole numbers, or decimal numbers respectively. Therefore, the correct answer is B.

Reference: What is a String? | Definition and Examples, Python String Methods

Question #17

Which of the following should be accomplished NEXT after understanding a business requirement for a data analysis report?

  • A . Rephrase the business requirement.
  • B . Determine the data necessary for the analysis.
  • C . Build a mock dashboard/presentation layout.
  • D . Perform exploratory data analysis.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Exploratory data analysis (EDA) is a process of examining and summarizing a dataset using various techniques, such as descriptive statistics, visualizations, correlations, outliers detection, and hypothesis testing. EDA can help reveal the main characteristics, patterns, trends, and insights from the data, as well as identify any problems or issues with the data quality or structure. EDA is usually performed after understanding a business requirement for a data analysis report and before building a mock dashboard/presentation layout. Therefore, the correct answer is B.

Reference: [What is Exploratory Data Analysis? | Definition and Examples], [Exploratory Data Analysis in Python]

Question #18

Which of the following is a common data analytics tool that is also used as an interpreted, high-level, general-purpose programming language?

  • A . SAS
  • B . Microsoft Power BI
  • C . IBM SPSS
  • D . Python

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Python is a common data analytics tool that is also used as an interpreted, high-level, general-purpose programming language. Python has a simple and expressive syntax that makes it easy to read and write code. Python also has a rich set of libraries and frameworks that support various tasks and applications in data analytics, such as data manipulation, visualization, machine learning, natural language processing, web scraping, and more. Some examples of popular Python libraries for data analytics are pandas, numpy, matplotlib, seaborn, scikit-learn, nltk, and beautifulsoup. Python is different from other data analytics tools that are not programming languages but rather software applications or platforms that provide graphical user interfaces (GUIs) for data analysis and visualization. Some examples of these tools are SAS, Microsoft Power BI, IBM SPSS. Therefore, the correct answer is D.

Reference: [What is Python? | Definition and Examples], [Python Libraries for Data Science]

Question #19

A data analyst needs to present the results of an online marketing campaign to the marketing manager. The manager wants to see the most important KPIs and measure the return on marketing investment.

Which of the following should the data analyst use to BEST communicate this information to the manager?

  • A . A real-time monitor that allows the manager to view performance the day the campaign was launched
  • B . A sell-service dashboard that allows the manager to look at the company’s annual budget performance
  • C . A spreadsheet of the raw data from all marketing campaigns and channels
  • D . A summary with statistics, conclusions, and recommendations from the data analyst

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A summary with statistics, conclusions, and recommendations from the data analyst is the best way to communicate the results of an online marketing campaign to the marketing manager. A summary can provide a concise and clear overview of the most important KPIs and measure the return on marketing investment, as well as highlight the main findings and insights from the data analysis. A summary can also include actionable suggestions and best practices for improving the campaign performance and achieving the marketing objectives. A summary is different from other options, such as a real-time monitor, a self-service dashboard, or a spreadsheet of raw data, which may not provide enough context, interpretation, or guidance for the manager. Therefore, the correct answer is D.

Reference: How to Write a Data Analysis Report: 6 Essential Tips, How to Write a Marketing Report (with Pictures) – wikiHow

Question #20

A data analyst for a media company needs to determine the most popular movie genre. Given the table below:

Which of the following must be done to the Genre column before this task can be completed?

  • A . Append
  • B . Merge
  • C . Concatenate
  • D . Delimit

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Delimiting is the process of splitting a column of data into multiple columns based on a separator or delimiter character. Delimiting can help separate data that is combined or concatenated in one column into distinct values or categories. For example, if a column contains text values that are separated by commas, such as “Comedy, Suspense”, delimiting can split this column into two columns, one for “Comedy” and one for “Suspense”. Delimiting is different from other options, such as appending, merging, or concatenating, which are methods of combining or joining data from multiple columns or sources. In this case, the data analyst needs to determine the most popular movie genre based on the Genre column in the table. However, this column contains multiple genres for each movie, separated by commas. Therefore, the data analyst must delimit this column before this task can be completed. Therefore, the correct answer is D.

Reference: Split text into different columns with functions – Office Support, How to Split Text in Excel (Using Formulas & Split Function)

Question #21

An e-commerce company recently tested a new website layout. The website was tested by a test group of customers, and an old website was presented to a control group.

The table below shows the percentage of users in each group who made purchases on the websites:

Which of the following conclusions is accurate at a 95% confidence interval?

  • A . In Germany, the increase in conversion from the new layout was not significant.
  • B . In France, the increase in conversion from the new layout was not significant.
  • C . In general, users who visit the new website are more likely to make a purchase.
  • D . The new layout has the lowest conversion rates in the United Kingdom.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The p-value is a measure of how likely it is to observe a difference in conversion rates as large or larger than the one observed, assuming that there is no difference between the groups. A common threshold for statistical significance is 0.05, meaning that there is a 5% or less chance of observing such a difference by chance alone. The table shows the p-values for each country, and we can see that only Germany has a p-value above 0.05 (0.13). This means that we cannot reject the null hypothesis that there is no difference in conversion rates between the test and control groups in Germany. Therefore, the increase in conversion from the new layout was not significant in Germany. For the other countries, the p-values are below 0.05, indicating that the increase in conversion from the new layout was statistically significant. Option A is correct.

Option B is incorrect because the increase in conversion from the new layout was significant in France (p-value = 0.002).

Option C is incorrect because it does not account for the variation across countries. While the overall conversion rate for the test group (8.4%) is higher than the control group (6.8%), this difference may not be statistically significant when we consider the country-specific effects.

Option D is incorrect because the new layout has the highest conversion rate in the United Kingdom

(9.6%), not the lowest.

Reference:

P-value Calculator & Statistical Significance Calculator

p-value Calculator | Formula | Interpretation

How to obtain the P value from a confidence interval | The BMJ Confidence Intervals & P-values for Percent Change / Relative Difference

Question #22

An analyst needs to provide a chart to identify the composition between the categories of the survey response data set:

Which of the following charts would be BEST to use?

  • A . Histogram
  • B . Pie
  • C . Line
  • D . Scatter pot
  • E . Waterfall

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

A pie chart is the best choice to show the composition between the categories of the survey response data set. A pie chart represents the whole with a circle, divided by slices into parts. Each slice shows the relative size of each category as a percentage of the total. A pie chart is useful when the categories are mutually exclusive and add up to 100%. The table shows the favorite color and the number of responses for each color, which can be easily converted into percentages. A pie chart can show how each color contributes to the total number of responses.

Option A is incorrect because a histogram is used to show how data points are distributed along a numerical scale. The survey response data set is not numerical, but categorical.

Option C is incorrect because a line chart is used to show trends or changes over time. The survey response data set does not have a time dimension.

Option D is incorrect because a scatter plot is used to show the relationship between two numerical variables. The survey response data set does not have two numerical variables.

Option E is incorrect because a waterfall chart is used to show how an initial value is increased or decreased by a series of intermediate values. The survey response data set does not have an initial value or intermediate values.

Reference:

How to Choose the Right Chart for Your Data – Infogram

How to Choose the Right Data Visualization | Tutorial by Chartio Find the Best Visualizations for Your Metrics – The Data School How to choose the best chart or graph for your data

Question #23

Five dogs have the following heights in millimeters: 300, 430, 170, 470, 600

Which of the following is the mean height for the five dogs?

  • A . 394mm
  • B . 405mm
  • C . 493mm
  • D . 504mm

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The mean height for the five dogs is calculated by adding up all the heights and dividing by the number of dogs.

The formula is: mean = (300 + 430 + 170 + 470 + 600) / 5 mean = 1970 / 5 mean = 394

Therefore, option A is correct.

Option B is incorrect because it is the median height, which is the middle value when the heights are arranged in ascending order.

Option C is incorrect because it is the mean height multiplied by 1.25.

Option D is incorrect because it is the mean height multiplied by 1.28.

Question #24

Which of the following are reasons to create and maintain a data dictionary? (Choose two.)

  • A . To improve data acquisition
  • B . To remember specifics about data fields
  • C . To specify user groups for databases
  • D . To provide continuity through personnel turnover
  • E . To confine breaches of PHI data
  • F . To reduce processing power requirements

Reveal Solution Hide Solution

Correct Answer: B, D
B, D

Explanation:

A data dictionary is a collection of metadata that describes the data elements in a database or dataset. It can help improve data acquisition by providing information about the data sources, formats, quality, and usage. It can also help remember specifics about data fields, such as their names, definitions, types, sizes, and relationships. Therefore, options B and D are correct.

Option A is incorrect because it is not a reason to create and maintain a data dictionary, but a benefit of doing so.

Option C is incorrect because specifying user groups for databases is not a function of a data dictionary, but a function of a database management system or a security policy.

Option E is incorrect because confining breaches of PHI data is not a function of a data dictionary, but a function of a data protection or encryption system.

Option F is incorrect because reducing processing power requirements is not a function of a data dictionary, but a function of a data compression or optimization system.

Question #25

A recurring event is being stored in two databases that are housed in different geographical locations. A data analyst notices the event is being logged three hours earlier in one database than in the other database.

Which of the following is the MOST likely cause of the issue?

  • A . The data analyst is not querying the databases correctly.
  • B . The databases are recording different events.
  • C . The databases are recording the event in different time zones.
  • D . The second database is logging incorrectly.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The most likely cause of the issue is that the databases are recording the event in different time zones. For example, if one database is in New York and the other database is in Los Angeles, there is a three-hour difference between them. Therefore, an event that occurs at 12:00 PM in New York would be recorded as 9:00 AM in Los Angeles. To avoid this issue, the databases should either use a common time zone or convert the timestamps to a standard format. Therefore, option C is correct.

Option A is incorrect because the data analyst is not querying the databases incorrectly, but rather observing a discrepancy in the timestamps.

Option B is incorrect because the databases are recording the same event, but with different timestamps.

Option D is incorrect because the second database is not logging incorrectly, but rather using a different time zone.

Question #26

Which of the following is an example of a at flat file?

  • A . CSV file
  • B . PDF file
  • C . JSON file
  • D . JPEG file

Reveal Solution Hide Solution

Correct Answer: D
Question #27

Refer to the exhibit.

Given the following graph:

Which of the following summary statements upholds integrity in data reporting?

  • A . Sales are approximately equal for Product A and Product B across all strategies.
  • B . Strategy 4 provides the best sales in comparison to other strategies.
  • C . While Strategy 2 does not result in the highest sales of Product D, over all products it appears to be the most effective.
  • D . Product D should be promoted more than the other products in all strategies.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Strategy 4 provides the best sales in comparison to other strategies. This is because the total sales for Strategy 4 are the highest among all the strategies, as shown by the black line. The other statements are not accurate or do not uphold integrity in data reporting.

Here is why:

Statement A is false because sales are not approximately equal for Product A and Product B across all strategies. For example, in Strategy 1, Product A has more sales than Product B, while in Strategy 3, Product B has more sales than Product A.

Statement C is misleading because it does not account for the difference in scale between the products. While Strategy 2 has the highest total sales among all products, it does not necessarily mean that it is the most effective for each product. For instance, Product D has very low sales in Strategy 2 compared to other strategies.

Statement D is biased because it does not provide any evidence or justification for why Product D should be promoted more than the other products in all strategies. It also ignores the fact that Product D has the lowest sales among all products in most of the strategies.

Question #28

An analyst is required to run a text analysis of data that is found in articles from a digital news outlet.

Which of the following would be the BEST technique for the analyst to apply to acquire the data?

  • A . Web scraping
  • B . Sampling
  • C . Data wrangling
  • D . ETL

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because web scraping is a technique that allows the analyst to extract data from web pages, such as articles from a digital news outlet. Web scraping can be done using various tools and methods, such as Python libraries, browser extensions, or online services. The other techniques are not suitable for acquiring data from web pages.

Here is why:

Sampling is a technique that involves selecting a subset of data from a larger population, usually for statistical analysis or testing purposes. Sampling does not help the analyst to acquire data from web pages, but rather to reduce the amount of data to be analyzed.

Data wrangling is a technique that involves transforming and cleaning data to make it suitable for analysis or visualization. Data wrangling does not help the analyst to acquire data from web pages, but rather to improve the quality and usability of the data.

ETL stands for Extract, Transform, and Load, which is a process that involves moving data from one or more sources to a destination, such as a data warehouse or a database. ETL does not help the analyst to acquire data from web pages, but rather to store and organize the data.

Question #29

An analyst runs a report on a daily basis, and the number of datapoints must be validated before the data can be analyzed. The number of datapoints increases each day by approximately 20% of the total number from the day before. On a given day, the number of datapoints was 8,798.

Which of the following should be the total number of datapoints on the next day?

  • A . 7,038
  • B . 9,600
  • C . 10,600
  • D . 10,800

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because the number of datapoints increases each day by approximately 20% of the total number from the day before.

Therefore, to find the number of datapoints on the next day, we can use the formula:

Plugging in the given values, we get:

Since we are dealing with whole numbers, we can round up the result to the nearest integer, which is 10,600.


Question #30

An analyst has been tracking company intranet usage and has been asked to create a chat to show the most-used/most-clicked portions of a homepage that contains more than 30 links.

Which of the following visualizations would BEST illustrate this information?

  • A . Scatter plot
  • B . Heat map
  • C . Pie chart
  • D . Infographic

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because a heat map is a visualization that uses colors to represent different values or intensities of a variable. A heat map can be used to show the most-used/most-clicked portions of a homepage that contains more than 30 links by assigning different colors to each link based on how frequently they are clicked by the users. For example, a link that is clicked very often can be colored red, while a link that is clicked rarely can be colored blue. A heat map can help the analyst to identify which links are more popular or important than others on the homepage. The other visualizations are not as effective as a heat map for this purpose.

Here is why:

A scatter plot is a visualization that uses dots or points to represent the relationship between two variables. A scatter plot cannot show the most-used/most-clicked portions of a homepage that contain more than 30 links because it does not have a clear way of mapping each link to a point on the graph.

A pie chart is a visualization that uses slices or sectors to represent the proportion of each category in a whole. A pie chart cannot show the most-used/most-clicked portions of a homepage that contains more than 30 links because it does not have enough space to display all the categories clearly and accurately.

An infographic is a visualization that uses images, icons, charts, and text to convey information or tell a story. An infographic cannot show the most-used/most-clicked portions of a homepage that contain more than 30 links because it does not have a consistent or standardized way of representing each link and its click frequency.

Question #31

An analyst has generated a report that includes the number of months in the first two quarters of 2019 when sales exceeded $50,000:

Which of the following functions did the analyst use to generate the data in the Sales_indicator column?

  • A . Aggregate
  • B . Logical
  • C . Date
  • D . Sort

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because a logical function is a type of function that returns a value based on a condition or a set of conditions. A logical function can be used to generate the data in the Sales_indicator column by comparing the values in the Sales column with a threshold of $50,000 and returning either “Exceeded $50,000” or “Not exceeded $50,000” accordingly.

For example, a logical function in Excel that can achieve this is:

The other functions are not suitable for generating the data in the Sales_indicator column.

Here is why:

Aggregate is a type of function that performs a calculation on a group of values, such as sum, average, count, etc. An aggregate function cannot generate the data in the Sales_indicator column because it does not compare the values in the Sales column with a threshold or return a text value based on a condition.

Date is a type of function that manipulates or extracts information from dates, such as year, month, day, etc. A date function cannot generate the data in the Sales_indicator column because it does not use the values in the Sales column or return a text value based on a condition.

Sort is a type of function that arranges the values in a column or a range in ascending or descending order. A sort function cannot generate the data in the Sales_indicator column because it does not create a new column or return a text value based on a condition.


Question #32

While reviewing survey data, an analyst notices respondents entered “Jan,” “January,” and “01” as responses for the month of January.

Which of the following steps should be taken to ensure data consistency?

  • A . Delete any of the responses that do not have “January” written out.
  • B . Replace any of the responses that have “01”.
  • C . Filter on any of the responses that do not say “January” and update them to “January”.
  • D . Sort any of the responses that say “Jan” and update them to “01”.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Filter on any of the responses that do not say “January” and update them to “January”. This is because filtering and updating are data cleansing techniques that can be used to ensure data consistency, which means that the data is uniform and follows a standard format. By filtering on any of the responses that do not say “January” and updating them to “January”, the analyst can make sure that all the responses for the month of January are written in the same way. The other steps are not appropriate for ensuring data consistency.

Here is why:

Deleting any of the responses that do not have “January” written out would result in data loss, which means that some information would be missing from the data set. This could affect the accuracy and reliability of the analysis.

Replacing any of the responses that have “01” would not solve the problem of data inconsistency, because there would still be two different ways of writing the month of January: “Jan” and “January”. This could cause confusion and errors in the analysis.

Sorting any of the responses that say “Jan” and updating them to “01” would also not solve the problem of data inconsistency, because there would still be two different ways of writing the month of January: “01” and “January”. This could also cause confusion and errors in the analysis.

Question #33

Which of the following data cleansing issues will be fixed when a DISTINCT function is applied?

  • A . Missing data
  • B . Duplicate data
  • C . Redundant data
  • D . Invalid data

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because duplicate data refers to data that is repeated or copied in a data set, which can affect the quality and validity of the analysis. A DISTINCT function is a type of function that removes duplicate values from a column or a table, leaving only unique values.

For example, a DISTINCT function in SQL that can achieve this is:

The other data cleansing issues will not be fixed by applying a DISTINCT function.

Here is why:

Missing data refers to data that is absent or incomplete in a data set, which can affect the accuracy

and reliability of the analysis. A DISTINCT function does not help with missing data, because it does not fill in or impute the missing values.

Redundant data refers to data that is unnecessary or irrelevant for the analysis, which can affect the efficiency and performance of the analysis. A DISTINCT function does not help with redundant data, because it does not remove or filter out the redundant values.

Invalid data refers to data that is incorrect or inaccurate in a data set, which can affect the validity and reliability of the analysis. A DISTINCT function does not help with invalid data, because it does not validate or correct the invalid values.


Question #34

A county in Illinois is conducting a survey to determine the mean annual income per household. The county is 427sq mi (2.65q km).

Which of the following sampling methods would MOST likely result in a representative sample?

  • A . A stratified phone survey of 100 people that is conducted between 2:00 p.m. and 3:00 p.m.
  • B . A systematic survey that is sent to 100 single-family homes in the county
  • C . Surveys sent to ten randomly selected homes within 5mi (8km) of the county’s office
  • D . Surveys sent to 100 randomly selected homes that are reflective of the population

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Surveys sent to 100 randomly selected homes that are reflective of the population. This is because a random sample is a type of sample that is selected by using a random method, such as a lottery or a computer-generated number, which ensures that every element in the population has an equal chance of being selected. A random sample can result in a representative sample, which means that the sample reflects the characteristics and diversity of the population. By sending surveys to 100 randomly selected homes that are reflective of the population, the analyst can ensure that the sample is representative of the county’s households and their income levels. The other sampling methods are not likely to result in a representative sample.

Here is why:

A stratified phone survey of 100 people that is conducted between 2:00 p.m. and 3:00 p.m. would result in a biased sample, which means that the sample favors or excludes certain groups or elements in the population. By conducting the survey only between 2:00 p.m. and 3:00 p.m., the analyst would miss out on people who are not available or reachable at that time, such as those who are working or sleeping. This could affect the representativeness and generalizability of the sample. A systematic survey that is sent to 100 single-family homes in the county would result in an unrepresentative sample, which means that the sample does not reflect the characteristics and diversity of the population. By sending surveys only to single-family homes, the analyst would ignore other types of households, such as apartments, condos, or mobile homes. This could affect the accuracy and reliability of the sample.

Surveys sent to ten randomly selected homes within 5mi (8km) of the county’s office would result in a small sample, which means that the sample size is too low to capture the variability and diversity of the population. By sending surveys only to ten homes within a limited area, the analyst would miss out on many households that are located in different parts of the county. This could affect the precision and confidence of the sample.

Question #35

Which of the following statistical methods requires two or more categorical variables?

  • A . Simple linear regression
  • B . Chi-squared test
  • C . Z-test
  • D . Two-sample t-test

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because a chi-squared test is a type of statistical method that tests the association or independence between two or more categorical variables, such as gender, race, or occupation. A chi-squared test can be used to compare the observed frequencies of the categories with the expected frequencies under the null hypothesis of no association or independence. For example, a chi-squared test can be used to determine if there is a relationship between smoking and lung cancer. The other statistical methods do not require two or more categorical variables.

Here is why:

Simple linear regression is a type of statistical method that models the relationship between a continuous dependent variable and a continuous or categorical independent variable, such as height, weight, or education level. A simple linear regression can be used to estimate the slope and intercept of the best-fitting line that describes how the dependent variable changes with the independent variable. For example, a simple linear regression can be used to predict the weight of a person based on their height.

Z-test is a type of statistical method that tests the significance of the difference between a sample mean and a population mean, or between two sample means, when the population standard deviation or the sample sizes are large enough. A z-test can be used to compare the average scores of two groups of students on a standardized test.

Two-sample t-test is a type of statistical method that tests the significance of the difference between two sample means when the population standard deviation is unknown or the sample sizes are small. A two-sample t-test can be used to compare the average salaries of two groups of employees in different departments.

Question #36

Which of the following data manipulation techniques is an example of a logical function?

  • A . WHERE
  • B . AGGREGATE
  • C . BOOLEAN
  • D . IF

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because an IF function is a type of logical function that returns a value based on a condition or

a set of conditions. An IF function can be used to manipulate data by applying different actions or calculations depending on whether the condition is true or false.

For example, an IF function in Excel that can achieve this is:

=IF (condition, value_if_true, value_if_false)

The other data manipulation techniques are not examples of logical functions. Here is why:

WHERE is a type of clause that filters data based on a condition or a set of conditions. A WHERE clause can be used to manipulate data by selecting only the rows that satisfy the condition(s).

For example, a WHERE clause in SQL that can achieve this is:

AGGREGATE is a type of function that performs a calculation on a group of values, such as sum, average, count, etc. An AGGREGATE function can be used to manipulate data by summarizing or aggregating the values in a column or a table.

For example, an AGGREGATE function in SQL that can achieve this is:

BOOLEAN is a type of data type that represents two possible values: true or false. A BOOLEAN data type can be used to manipulate data by storing or returning logical values based on a condition or a set of conditions.

For example, a BOOLEAN data type in Python that can achieve this is:



Question #37

A sales team wants visibility of current sales numbers, pipeline, and team performance. The team would also like to see calculations of individuals’ earned commissions and projected commissions based on sales, but they want that information to be kept confidential.

Which of the following would be the BEST way to provide this visibility?

  • A . Create a dashboard displaying a data refresh date so users know the current sales numbers and configure permissions to control access.
  • B . Create a dashboard for sales numbers, pipeline, and team and individual performance for the management team.
  • C . Create a dashboard with filters for the overall team, individuals, and management. Users can filter to see the data they want.
  • D . Create a dashboard with views for team, individuals, and management. Configure permissions to control access.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Create a dashboard with views for team, individuals, and management. Configure permissions to control access. This is because a dashboard is a type of visualization that displays multiple charts or graphs on a single page, usually to provide an overview or summary of some data or information. A dashboard can be used to provide visibility of current sales numbers, pipeline, and team performance by showing different metrics and indicators related to these aspects. By creating a dashboard with views for team, individuals, and management, the analyst can customize the content and layout of the dashboard for different audiences and purposes. By configuring permissions to control access, the analyst can ensure that the confidential information, such as individuals’ earned commissions and projected commissions based on sales, is only visible to the authorized users. The other ways are not the best way to provide this visibility.

Here is why:

Creating a dashboard displaying a data refresh date so users know the current sales numbers and configuring permissions to control access would not be sufficient to provide visibility of pipeline and team performance, as well as individuals’ earned commissions and projected commissions based on sales. The dashboard would only show the current sales numbers and the date when the data was updated, which would not give a comprehensive or detailed view of the sales situation.

Creating a dashboard for sales numbers, pipeline, and team and individual performance for the management team would not be appropriate to provide visibility for the sales team, as they would not have access to the dashboard or the information they need. The dashboard would only be available for the management team, which would limit the transparency and collaboration among the sales team members.

Creating a dashboard with filters for the overall team, individuals, and management would not be secure to provide visibility of confidential information, such as individuals’ earned commissions and projected commissions based on sales. The dashboard would allow users to filter and see the data they want, which could expose sensitive or personal information to unauthorized users.

Question #38

Which of the following is a characteristic of a relational database?

  • A . It utilizes key-value pairs.
  • B . It has undefined fields.
  • C . It is structured in nature.
  • D . It uses minimal memory.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

It is structured in nature. This is because a relational database is a type of database that organizes data into tables, which consist of rows and columns. A relational database is structured in nature, which means that the data has a predefined schema or format, and follows certain rules and constraints, such as primary keys, foreign keys, or referential integrity. A relational database can be used to store, query, and manipulate data using a structured query language (SQL). The other characteristics are not true for a relational database.

Here is why:

It utilizes key-value pairs. This is not true for a relational database, because key-value pairs are a way of storing data that associates each value with a unique key, such as an identifier or a name. Key-value pairs are typically used in non-relational databases, such as NoSQL databases, which do not have tables, rows, or columns, but rather store data in various formats, such as documents, graphs, or columns.

It has undefined fields. This is not true for a relational database, because fields are another name for

columns in a table, which define the attributes or properties of each row or record in the table. Fields have defined names, types, and lengths in a relational database, which specify the format and size of the data that can be stored in each field.

It uses minimal memory. This is not true for a relational database, because memory is the amount of space or storage that is used by a database to store and process data. Memory usage depends on various factors, such as the size, complexity, and number of tables and queries in a relational database. A relational database can use a lot of memory if it has many tables with many rows and columns, or if it performs complex or frequent queries on the data.

Question #39

A data analyst is asked on the morning of April 9, 2020, to create a sales report that identifies sales year to date. The daily sales data is current through the end of the day.

Which of the following date ranges should be on the report?

  • A . January 1, 2020 to April 1, 2020
  • B . January 1, 2020 to April 7, 2020
  • C . January 1, 2020 to April 8, 2020
  • D . January 1, 2020 to April 9, 2020

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because sales year to date refers to the sales that have occurred from the beginning of the current year until the current date. By creating a sales report that identifies sales year to date, the analyst can measure and compare the sales performance and progress of the current year. Since the analyst is asked to create the sales report on the morning of April 9, 2020, and the daily sales data is current through the end of the day, the date range that should be on the report is January 1, 2020 to April 9, 2020. The other date ranges are not correct for identifying sales year to date.

Here is why:

January 1, 2020 to April 1, 2020 would not include the sales that occurred in the first eight days of April, which would underestimate the sales year to date.

January 1, 2020 to April 7, 2020 would not include the sales that occurred in the last two days of April, which would also underestimate the sales year to date.

January 1, 2020 to April 8, 2020 would not include the sales that occurred on April 9, which would also underestimate the sales year to date.

Question #40

Refer to the exhibit.

Given the following data tables:

Which of the following MDM processes needs to take place FIRST?

  • A . Creation of a data dictionary
  • B . Compliance with regulations
  • C . Standardization of data field names
  • D . Consolidation of multiple data fields

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because a data dictionary is a type of document that defines and describes the data elements, attributes, and relationships in a database or a data set. A data dictionary can be used to facilitate the MDM (Master Data Management) process, which is a process that aims to ensure the quality, consistency, and accuracy of the data across different sources and systems. By creating a data dictionary first, the analyst can establish a common understanding and standardization of the data field names, types, formats, and meanings, as well as identify any potential issues or conflicts in the data, such as missing values, duplicate values, or inconsistent values. The other MDM processes can take place after creating a data dictionary.

Here is why:

Compliance with regulations is a type of MDM process that ensures that the data meets the legal and ethical requirements and standards of the industry or the organization. Compliance with regulations can take place after creating a data dictionary, because the data dictionary can help the analyst to identify and apply the relevant rules and policies to the data, such as data privacy, security, or retention.

Standardization of data field names is a type of MDM process that ensures that the data field names are consistent and uniform across different sources and systems. Standardization of data field names can take place after creating a data dictionary, because the data dictionary can provide a reference and a guideline for naming and labeling the data fields, as well as resolving any discrepancies or ambiguities in the data field names.

Consolidation of multiple data fields is a type of MDM process that combines or merges the data fields from different sources or systems into a single source or system. Consolidation of multiple data fields can take place after creating a data dictionary because the data dictionary can help the analyst to map and match the data fields from different sources or systems based on their definitions and descriptions, as well as eliminating any redundant or duplicate data fields.

Question #41

Which of the following is used for calculations and pivot tables?

  • A . IBM SPSS
  • B . SAS
  • C . Microsoft Excel
  • D . Domo

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because Microsoft Excel is a type of software application that allows users to create, edit, and analyze data in spreadsheets, which are composed of rows and columns of cells that can store various types of data, such as numbers, text, or formulas. Microsoft Excel can be used for calculations and pivot tables, which are two common features or functions in data analysis. Calculations are mathematical operations or expressions that can be performed on the data in the cells, such as addition, subtraction, multiplication, division, average, sum, etc. Pivot tables are interactive tables that can summarize and display the data in different ways, such as by grouping, filtering, sorting, or aggregating the data based on various criteria or categories. The other software applications are not used for calculations and pivot tables.

Here is why:

IBM SPSS is a type of software application that allows users to perform statistical analysis and modeling on data sets, such as regression, correlation, ANOVA, etc. IBM SPSS does not use spreadsheets or cells to store or manipulate data, but rather uses data views or variable views to display the data in rows and columns. IBM SPSS does not have pivot tables as a feature or function, but rather has output views or charts to display the results of the analysis.

SAS is a type of software application that allows users to perform data management and analysis using a programming language that consists of statements and commands. SAS does not use spreadsheets or cells to store or manipulate data, but rather uses data sets or tables that are stored in libraries or folders. SAS does not have pivot tables as a feature or function, but rather has procedures or macros that can produce summary tables or reports based on the data.

Domo is a type of software application that allows users to create and share dashboards and visualizations that display data from various sources and systems, such as databases, cloud services, or web applications. Domo does not use spreadsheets or cells to store or manipulate data, but rather uses connectors or APIs to access and integrate the data from different sources. Domo does not have pivot tables as a feature or function, but rather has cards or widgets that can show different aspects or metrics of the data.

Question #42

Refer to the exhibit.

Given the following report:

Which of the following components need to be added to ensure the report is point-in-time and static? (Choose two.)

  • A . A control group for the phrases
  • B . A summary of the KPIs
  • C . Filter buttons for the status
  • D . The date when the report was last accessed
  • E . The time period the report covers
  • F . The date on which the report was run

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

The date on which the report was run. This is because the time period the report covers and the date on which the report was run are two components that need to be added to ensure the report is point-in-time and static, which means that the report shows the data as it was at a specific moment or interval in time, and does not change or update with new data. By adding the time period the report covers and the date on which the report was run, the analyst can indicate when and for how long the data was collected and analyzed, as well as avoid any confusion or ambiguity about the currency or validity of the data. The other components do not need to be added to ensure the report is point-in-time and static.

Here is why:

A control group for the phrases is a type of group that serves as a baseline or a reference for comparison with another group that is exposed to some treatment or intervention, such as a target phrase in this case. A control group for the phrases does not need to be added to ensure the report is point-in-time and static, because it does not affect the time frame or the stability of the data. However, a control group for the phrases could be useful for evaluating the effectiveness or impact of the target phrases on customer satisfaction or retention.

A summary of the KPIs is a type of document that provides an overview or a highlight of the key performance indicators (KPIs), which are measurable values that indicate how well an organization or a process is achieving its goals or objectives. A summary of the KPIs does not need to be added to ensure the report is point-in-time and static, because it does not affect the time frame or the stability of the data. However, a summary of the KPIs could be useful for communicating or presenting the main findings or insights from the report.

Filter buttons for the status are a type of feature or function that allows users to select or deselect certain values or categories in a column or a table, such as ticket statuses in this case. Filter buttons for the status do not need to be added to ensure the report is point-in-time and static, because they do not affect the time frame or the stability of the data. However, filter buttons for the status could be useful for exploring or analyzing different aspects or segments of the data.

Question #43

An analyst has been asked to validate data quality.

Which of the following are the BEST reasons to validate data for quality control purposes? (Choose two.)

  • A . Retention
  • B . Integrity
  • C . Transmission
  • D . Consistency
  • E . Encryption
  • F . Deletion

Reveal Solution Hide Solution

Correct Answer: B,
B,

Explanation:

Integrity and

D. Consistency. This is because integrity and consistency are two of the best reasons to validate data for quality control purposes, which means to check and ensure that the data is accurate, complete, reliable, and usable for the intended analysis or purpose. By validating data for integrity and consistency, the analyst can prevent or correct any errors or issues in the data that could affect the validity or reliability of the analysis or the results.

Here is what integrity and consistency mean in terms of data quality:

Integrity refers to the completeness and validity of the data, which means that the data has no missing, incomplete, or invalid values that could compromise its meaning or usefulness. For example, validating data for integrity could involve checking for null values, outliers, or incorrect data types in the data set.

Consistency refers to the uniformity and standardization of the data, which means that the data follows a common format, structure, or rule across different sources or systems. For example, validating data for consistency could involve checking for spelling, punctuation, or capitalization errors in the data set.

The other reasons are not the best reasons to validate data for quality control purposes.

Here is why:

Retention refers to the storage and preservation of the data, which means that the data is kept and maintained in a secure and accessible way for future use or reference. Retention does not need to be validated for quality control purposes, because it does not affect the accuracy or reliability of the data itself.

Transmission refers to the transfer and exchange of the data, which means that the data is moved or shared between different sources or systems in a fast and efficient way. Transmission does not need to be validated for quality control purposes, because it does not affect the completeness or validity of the data itself.

Encryption refers to the protection and security of the data, which means that the data is encoded or scrambled in a way that prevents unauthorized access or use. Encryption does not need to be validated for quality control purposes, because it does not affect the uniformity or standardization of the data itself.

Deletion refers to the removal and disposal of the data, which means that the data is erased or destroyed in a way that prevents recovery or retrieval. Deletion does not need to be validated for quality control purposes, because it does not affect the meaning or usefulness of the data itself.

Question #44

A research analyst wants to determine whether the data being analyzed is connected to other datapoints.

Which of the following is the BEST type of analysis to conduct?

  • A . Trend analysis
  • B . Performance analysis
  • C . Link analysis
  • D . Exploratory analysis

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because link analysis is a type of analysis that determines whether the data being analyzed is connected to other datapoints, such as entities, events, or relationships. Link analysis can be used to identify and visualize the patterns, networks, or associations among the datapoints, as well as measure the strength, direction, or frequency of the connections. For example, link analysis can be used to determine if there is a connection between a customer’s purchase history and their loyalty program status. The other types of analysis are not the best types of analysis to conduct to determine whether the data being analyzed is connected to other datapoints.

Here is why:

Trend analysis is a type of analysis that determines whether the data being analyzed is changing over time, such as increasing, decreasing, or fluctuating. Trend analysis can be used to identify and visualize the patterns, cycles, or movements in the data points, as well as measure the rate, direction, or magnitude of the changes. For example, trend analysis can be used to determine if there is a change in a company’s sales revenue over a period of time.

Performance analysis is a type of analysis that determines whether the data being analyzed is meeting certain goals or objectives, such as targets, benchmarks, or standards. Performance analysis can be used to identify and visualize the gaps, deviations, or variations in the data points, as well as measure the efficiency, effectiveness, or quality of the outcomes. For example, performance analysis

can be used to determine if there is a gap between a student’s test score and their expected score based on their previous performance.

Exploratory analysis is a type of analysis that determines whether there are any insights or discoveries in the data being analyzed, such as patterns, relationships, or anomalies. Exploratory analysis can be used to identify and visualize the characteristics, features, or behaviors of the data points, as well as measure their distribution, frequency, or correlation. For example, exploratory analysis can be used to determine if there are any outliers or unusual values in a dataset.

Question #45

Which of the following variable name formats would be problematic if used in the majority of data software programs?

  • A . First_Name_
  • B . FirstName
  • C . First_Name
  • D . First Name

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because First Name is a variable name format that would be problematic if used in most of the data software programs, such as Excel, SQL, or Python. This is because First Name contains a space between two words, which could cause confusion or errors in the data software programs, as they might interpret the space as a separator or a delimiter between two different variables or values, rather than as part of a single variable name. For example, in SQL, a space is used to separate keywords, clauses, or expressions in a statement, such as SELECT, FROM, WHERE, etc. Therefore, using First Name as a variable name in SQL could result in a syntax error or an unexpected result. The other variable name formats would not be problematic if used in most of the data software programs.

Here is why:

First_Name_ is a variable name format that uses an underscore (_) to separate two words, which is a common and acceptable practice in most of the data software programs, as it helps to improve the readability and clarity of the variable name. For example, in Python, an underscore is used to follow the PEP 8 style guide for naming variables, which recommends using lowercase letters and underscores for multi-word variable names.

FirstName is a variable name format that uses camel case to separate two words, which is another common and acceptable practice in most of the data software programs, as it helps to reduce the length and complexity of the variable name. For example, in Excel, camel case is used to follow the VBA naming conventions for naming variables, which recommends using mixed case letters for multi-word variable names.

First_Name is a variable name format that also uses an underscore (_) to separate two words, which is also a common and acceptable practice in most of the data software programs, as it helps to improve the readability and clarity of the variable name. For example, in SQL, an underscore is used to follow the ANSI SQL naming standards for naming variables, which recommends using lowercase letters and underscores for multi-word variable names.

Question #46

Which of the following describes the method of sampling in which elements of data are selected randomly from each of the small subgroups within a population?

  • A . Simple random
  • B . Cluster
  • C . Systematic
  • D . Stratified

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because stratified is a type of sampling in which elements of data are selected randomly from each of the small subgroups within a population, such as age groups, gender groups, or income groups. Stratified sampling can be used to ensure that the sample is representative and proportional of the population, as well as reduce the sampling error or bias. For example, stratified sampling can be used to select a sample of voters from different political parties based on their proportion in the population. The other types of sampling are not the types of sampling in which elements of data are selected randomly from each of the small subgroups within a population.

Here is why:

Simple random is a type of sampling in which elements of data are selected randomly from the entire population, without dividing it into any subgroups. Simple random sampling can be used to ensure that every element in the population has an equal chance of being selected, as well as avoid any systematic error or bias. For example, simple random sampling can be used to select a sample of students from a school by using a lottery or a computer-generated number.

Cluster is a type of sampling in which elements of data are selected randomly from a few large subgroups within a population, such as regions, districts, or schools. Cluster sampling can be used to reduce the cost and complexity of sampling, as well as increase the feasibility and convenience of sampling. For example, cluster sampling can be used to select a sample of households from a few neighborhoods by using a map or a list.

Systematic is a type of sampling in which elements of data are selected at regular intervals from an ordered list or sequence within a population, such as every nth element or every kth element. Systematic sampling can be used to simplify and speed up the sampling process, as well as ensure that the sample covers the entire range or scope of the population. For example, systematic sampling can be used to select a sample of books from a library by using an alphabetical order or a numerical order.

Question #47

Given the following customer and order tables:

Which of the following describes the number of rows and columns of data that would be present after performing an INNER JOIN of the tables?

  • A . Five rows, eight columns
  • B . Seven rows, eight columns
  • C . Eight rows, seven columns
  • D . Nine rows, five columns

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because an INNER JOIN is a type of join that combines two tables based on a matching condition and returns only the rows that satisfy the condition. An INNER JOIN can be used to merge data from different tables that have a common column or a key, such as customer ID or order ID.

To perform an INNER JOIN of the customer and order tables, we can use the following SQL statement:

This statement will select all the columns (*) from both tables and join them on the customer ID column, which is the common column between them. The result of this statement will be a new table that has seven rows and eight columns, as shown below:

The reason why there are seven rows and eight columns in the result table is because:

There are seven rows because there are six customers and six orders in the original tables, but only five customers have matching orders based on the customer ID column. Therefore, only five rows will have data from both tables, while one row will have data only from the customer table (customer 5), and one row will have no data at all (null values).

There are eight columns because there are four columns in each of the original tables, and all of them are selected and joined in the result table. Therefore, the result table will have four columns from the customer table (customer ID, first name, last name, and email) and four columns from the order table (order ID, order date, product, and quantity).


Question #48

A development company is constructing a new unit in its apartment complex.

The complex has the following floor plans:

Using the average cost per square foot of the original floor plans, which of the following should be the price of the Rose unit?

  • A . $640,900
  • B . $690,000
  • C . $705,200
  • D . $702,500

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because the price of the Rose unit can be estimated using the average cost per square foot of the original floor plans, which are Jasmine, Orchid, Azalea, and Tulip. To find the average cost per square foot of the original floor plans, we can use the following formula:

Plugging in the values from the original floor plans, we get:

To find the price of the Rose unit, we can use the following formula:

Plugging in the values from the Rose unit, we get:

Therefore, the price of the Rose unit should be $705,200, using the average cost per square foot of the original floor plans.


Question #49

Which of the following is a control measure for preventing a data breach?

  • A . Data transmission
  • B . Data attribution
  • C . Data retention
  • D . Data encryption

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because data encryption is a type of control measure that prevents a data breach, which is an unauthorized or illegal access or use of data by an external or internal party. Data encryption can prevent a data breach by protecting and securing the data using a code or a key that scrambles or transforms the data into an unreadable or incomprehensible format, which can only be decoded or restored by authorized users who have the correct code or key. For example, data encryption can prevent a data breach by encrypting the data in transit or at rest, such as when the data is sent over a network or stored in a device. The other control measures are not used for preventing a data breach.

Here is why:

Data transmission is a type of process that transfers and exchanges data between different sources or systems, such as databases, cloud services, or web applications. Data transmission does not prevent a data breach, but rather exposes the data to potential risks or threats during the transfer or exchange. However, data transmission can be made more secure and less vulnerable to a data breach by using encryption or other methods, such as authentication or authorization.

Data attribution is a type of feature or function that assigns and tracks the ownership and origin of the data, such as the creator, modifier, or source of the data. Data attribution does not prevent a data breach but rather provides information and evidence about the data provenance and history. However, data attribution can be useful for detecting and responding to a data breach by using audit logs or metadata to identify and trace any unauthorized or illegal access or use of the data.

Data retention is a type of policy or standard that specifies and regulates the storage and preservation of the data, such as the duration, location, or format of the data. Data retention does not prevent a data breach, but rather affects the availability and accessibility of the data for future use or reference. However, data retention can be optimized and aligned with the legal and ethical requirements and standards of the industry or the organization to reduce the risk or impact of a data breach.

Question #50

A user receives a large custom report to track company sales across various date ranges. The user then completes a series of manual calculations for each date range.

Which of the following should an analyst suggest so the user has a dynamic, seamless experience?

  • A . Create multiple reports, one for each needed date range.
  • B . Build calculations into the report so they are done automatically.
  • C . Add macros to the report to speed up the filtering and calculations process.
  • D . Create a dashboard with a date range picker and calculations built in.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Create a dashboard with a date range picker and calculations built in. This is because a dashboard is a type of visualization that displays multiple charts or graphs on a single page, usually to provide an overview or summary of some data or information. A dashboard can be used to track company sales across various date ranges by showing different metrics and indicators related to sales, such as revenue, volume, or growth. By creating a dashboard with a date range picker and calculations built in, the analyst can suggest a way for the user to have a dynamic, seamless experience, which means that the user can interact with and customize the dashboard according to their needs or preferences, as well as avoid any manual work or errors. For example, a date range picker is a type of feature or function that allows users to select or adjust the time period for which they want to see the data on the dashboard, such as daily, weekly, monthly, or quarterly. A date range picker can make the dashboard dynamic, as it can automatically update or refresh the dashboard with new data based on the selected time period. Calculations are mathematical operations or expressions that can be performed on the data on the dashboard, such as addition, subtraction, multiplication, division, average, sum, etc. Calculations can make the dashboard seamless, as they can eliminate the need for manual calculations for each date range, as well as ensure accuracy and consistency of the results. The other ways are not the best ways to provide a dynamic, seamless experience for the user.

Here is why:

Creating multiple reports, one for each needed date range would not provide a dynamic, seamless experience for the user, but rather create a static, cumbersome experience, which means that the user cannot interact with or customize the reports according to their needs or preferences, as well as have to deal with multiple files or pages. For example, creating multiple reports would make it difficult for the user to compare or contrast the sales across different date ranges, as well as increase the workload and complexity of managing and maintaining the reports.

Building calculations into the report so they are done automatically would not provide a dynamic, seamless experience for the user, but rather provide a partial, limited experience, which means that the user can only benefit from one aspect or feature of the report, but not from others. For example, building calculations into the report would help with avoiding manual work or errors, but it would not help with interacting with or customizing the report according to different date ranges.

Adding macros to the report to speed up the filtering and calculations process would not provide a dynamic, seamless experience for the user, but rather provide an advanced, complex experience, which means that the user would need to have some technical skills or knowledge to use or apply the macros, as well as face some potential risks or challenges. For example, adding macros to the report would require the user to know how to write or run the macros, which are a type of code or script that automates certain tasks or actions on the report, such as filtering or calculating the data. Adding macros to the report could also expose the user to some security or compatibility issues, such as viruses, malware, or errors.

Question #51

A table in a hospital database has a column for patient height in inches and a column for patient height in centimeters. This is an example of:

  • A . dependent data.
  • B . duplicate data.
  • C . invalid data
  • D . redundant data

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

This is because redundant data is a type of data that is unnecessary or irrelevant for the analysis or purpose, which can affect the efficiency and performance of the analysis or process. Redundant data can be caused by having multiple data fields that store the same or similar information, such as patient height in inches and patient height in centimeters in this case. Redundant data can be eliminated or reduced by using data cleansing techniques, such as removing or merging the redundant data fields. The other types of data are not examples of data that is unnecessary or irrelevant for the analysis or purpose.

Here is what they mean in terms of data quality:

Dependent data is a type of data that relies on or is influenced by another data field or value, such as a formula or a calculation that uses other data fields or values as inputs or outputs. Dependent data can be useful or important for the analysis or purpose, as it can provide additional information or insights based on the existing data.

Duplicate data is a type of data that is repeated or copied in a data set, which can affect the quality and validity of the analysis or process. Duplicate data can be caused by having multiple records or rows that have the same or similar values for one or more data fields or columns, such as customer ID or order ID. Duplicate data can be eliminated or reduced by using data cleansing techniques, such as removing or filtering out the duplicate records or rows.

Invalid data is a type of data that is incorrect or inaccurate in a data set, which can affect the validity and reliability of the analysis or process. Invalid data can be caused by having values that do not match the expected format, type, range, or rule for a data field or column, such as an email address that does not have an @ symbol or a date that does not follow the YYYY-MM-DD format. Invalid data can be eliminated or reduced by using data cleansing techniques, such as validating or correcting the invalid values.

Question #52

While reviewing survey data, a research analyst notices data is missing from all the responses to a single question.

Which of the following methods would BEST address this issue?

  • A . Replace missing data.
  • B . Remove duplicate data.
  • C . Replace redundant data.
  • D . Remove invalid data.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because missing data is a type of data quality issue that occurs when data is absent or incomplete in a data set, which can affect the accuracy and reliability of the analysis or process. Missing data can be caused by various factors, such as human error, system error, or non-response. Missing data can be addressed by using various methods, such as replacing missing data, which means filling in or imputing the missing values with some reasonable estimates, such as mean, median, mode, or regression. The other methods are not used to address missing data.

Here is why:

Remove duplicate data is a type of method that eliminates or reduces duplicate data, which is a type of data quality issue that occurs when data is repeated or copied in a data set. Removing duplicate data does not address missing data, but rather affects the quantity and validity of the data.

Replace redundant data is a type of method that eliminates or reduces redundant data, which is a type of data quality issue that occurs when data is unnecessary or irrelevant for the analysis or purpose. Replacing redundant data does not address missing data, but rather affects the efficiency and performance of the analysis or process.

Remove invalid data is a type of method that eliminates or reduces invalid data, which is a type of data quality issue that occurs when data is incorrect or inaccurate in a data set. Removing invalid data does not address missing data, but rather affects the validity and reliability of the analysis or process.

Question #53

Which of the following BEST describes standard deviation?

  • A . A measure that is used to establish a relationship between two variables
  • B . A measure of how data is distributed
  • C . A measure of the amount of dispersion of a set of values
  • D . A measure that is used to find the significant difference between variables

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A measure of the amount of dispersion of a set of values. This is because standard deviation is a type of statistical measure that quantifies how much the values in a data set vary or deviate from the mean or the average of the data set. Standard deviation can be used to describe the spread or the distribution of the data, as well as to identify any outliers or extreme values in the data. For example, a low standard deviation indicates that the values are close to the mean, while a high standard deviation indicates that the values are far from the mean. The other options are not correct descriptions of standard deviation.

Here is why:

A measure that is used to establish a relationship between two variables is not a correct description of standard deviation, but rather a description of correlation or regression, which are types of statistical measures that quantify how two variables are related or associated with each other. Correlation or regression can be used to test or model the dependence or the influence of one variable on another variable, as well as to predict or estimate the value of one variable based on the value of another variable.

A measure of how data is distributed is not a correct description of standard deviation, but rather a description of frequency or probability, which are types of statistical measures that quantify how often or how likely a value or an event occurs in a data set. Frequency or probability can be used to describe the occurrence or the chance of the data, as well as to compare or contrast different categories or groups of the data.

A measure that is used to find the significant difference between variables is not a correct description of standard deviation, but rather a description of hypothesis testing or inferential statistics, which are types of statistical methods that use sample data to make generalizations or conclusions about a population or a parameter. Hypothesis testing or inferential statistics can be used to test or verify a claim or an assumption about the data, as well as to measure the confidence or the error of the estimation.

Question #54

A data analyst was asked to create a chart that shows the relationship between study hours and exam scores for each student using the data sets in the table below:

Which of the following charts would BEST represent the relationship between the variables?

  • A . A histogram
  • B . A scatter plot
  • C . A heat map
  • D . A bar chart

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because a scatter plot is a type of chart that shows the relationship between two variables for each observation or unit in a data set, such as study hours and exam scores for each student in this case. A scatter plot can be used to display and analyze the correlation, trend, or pattern among the variables, as well as identify any outliers or clusters in the data. For example, a scatter plot can show if there is a positive, negative, or no correlation between study hours and exam scores, as well as show if there are any students who have unusually high or low exam scores compared to their study hours. The other charts are not the best charts to represent the relationship between the variables.

Here is why:

A histogram is a type of chart that shows the frequency or the count of values in a single variable for different intervals or bins, such as exam scores for different ranges in this case. A histogram can be used to display and analyze the distribution, shape, or spread of the variable, as well as identify any gaps, peaks, or skewness in the data. For example, a histogram can show if most students have high, low, or average exam scores, as well as show if there are any intervals that have no students at all. A heat map is a type of chart that shows the intensity or the magnitude of values in two variables for different categories or groups, such as exam scores and study hours for different student names in this case. A heat map can be used to display and analyze the variation, contrast, or comparison among the categories or groups, as well as identify any hot spots, cold spots, or gradients in the data. For example, a heat map can show which students have higher or lower exam scores and study hours than others, as well as show if there is a color pattern that indicates a relationship between exam scores and study hours.

A bar chart is a type of chart that shows the value or the amount of a single variable for different

categories or groups, such as exam scores for different student names in this case. A bar chart can be used to display and analyze the comparison, ranking, or proportion among the categories or groups, as well as identify any differences, similarities, or outliers in the data. For example, a bar chart can show which students have higher or lower exam scores than others, as well as show if there are any students who have exceptionally high or low exam scores.

Question #55

Refer to the exhibit.

Given the table below:

Which of the following variable types BEST describes the “Year” column?

  • A . Numeric
  • B . Date
  • C . Alphanumeric
  • D . Text

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because date is a type of variable that represents a specific point or period in time, such as a day, a month, or a year. Date variables can be used to store, manipulate, or analyze temporal data, such as transaction dates, birth dates, or expiration dates. For example, date variables can be used to calculate the duration or the difference between two dates, or to filter or sort the data by date. The other variable types are not correct descriptions of the “Year” column.

Here is why:

Numeric is a type of variable that represents a numerical value, such as an integer, a decimal, or a fraction. Numeric variables can be used to store, manipulate, or analyze quantitative data, such as amounts, prices, or scores. For example, numeric variables can be used to perform arithmetic operations or calculations on the data, or to measure the central tendency or the dispersion of the data.

Alphanumeric is a type of variable that represents a combination of alphabetic and numeric characters, such as letters, numbers, symbols, or spaces. Alphanumeric variables can be used to store, manipulate, or analyze textual data, such as names, addresses, or codes. For example, alphanumeric variables can be used to concatenate or split the data, or to search or match the data using patterns or expressions.

Text is a type of variable that represents a sequence of alphabetic characters, such as letters or words. Text variables can be used to store, manipulate, or analyze textual data, such as names, categories, or labels. For example, text variables can be used to change the case or the length of the data, or to compare or classify the data using criteria or rules.

Question #56

Refer to the exhibit.

Given the following data:

Which of the following BEST describes the data set?

  • A . There is data bias.
  • B . The data is incomplete.
  • C . The data is inconsistent.
  • D . The data is outliers.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because inconsistency is a type of data quality issue that occurs when the data does not follow a common format, structure, or rule across different sources or systems, which can affect the efficiency and performance of the analysis or process. Inconsistency can be caused by having different spellings, punctuations, capitalizations, or abbreviations for the same or similar values in a data set, such as “M”, “m”, “Male”, or “male” for gender in this case. Inconsistency can be eliminated or reduced by using data cleansing techniques, such as standardizing or normalizing the data values. The other options are not correct descriptions of the data set.

Here is why:

Data bias is a type of data quality issue that occurs when the data is not representative or proportional of the population or the parameter, which can affect the validity and reliability of the analysis or process. Data bias can be caused by having a sample that is too small, too large, or too skewed for the population or the parameter, such as having only male customers for a product that targets both genders in this case. Data bias can be eliminated or reduced by using sampling techniques, such as stratified or cluster sampling.

The data is incomplete is a type of data quality issue that occurs when the data is absent or missing in a data set, which can affect the accuracy and reliability of the analysis or process. The data is incomplete can be caused by various factors, such as human error, system error, or non-response. The data is incomplete can be addressed by using various methods, such as replacing or imputing the missing values with some reasonable estimates, such as mean, median, mode, or regression.

The data is outliers is a type of data quality issue that occurs when the data has values that are unusually high or low compared to the rest of the data set, which can affect the quality and validity of the analysis or process. The data is outliers can be caused by various factors, such as measurement error, natural variation, or extreme events. The data is outliers can be addressed by using various

methods, such as removing or filtering out the outliers, or using robust statistics that are less sensitive to outliers, such as median, interquartile range, or box plot.

Question #57

An analysts building a monthly report for production and wants to ensure the audience is aware of its once-a-month cadence.

Which of the following is the MOST important to convey that information?

  • A . The date of the dashboard build
  • B . The data refresh date
  • C . A report summary
  • D . Frequently asked questions

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because the date of the dashboard build is the most important component to convey that information, which is the once-a-month cadence of the monthly report for production. The date of the dashboard build can convey that information by indicating when the dashboard was created or updated, as well as showing the frequency or interval of the dashboard creation or update. For example, the date of the dashboard build can convey that information by displaying a date format that includes the month and year, such as January 2020, February 2020, etc., or by displaying a text format that includes the word “monthly”, such as Monthly Report for Production – January 2020, Monthly Report for Production – February 2020, etc. The other components are not the most important components to convey that information.

Here is why:

The data refresh date is a component that indicates when the data on the dashboard was refreshed or retrieved from the source or system, such as a database, a cloud service, or a web application. The data refresh date does not convey that information, but rather conveys how current or up-to-date the data on the dashboard is.

A report summary is a component that provides an overview or a highlight of the main findings or insights from the dashboard, such as key metrics, indicators, or trends. A report summary does not convey that information, but rather conveys what the dashboard is about or what it shows.

Frequently asked questions is a component that provides answers or explanations to common or expected questions from the audience or users of the dashboard, such as how to use or interpret the dashboard, what are the assumptions or limitations of the dashboard, etc. Frequently asked questions does not convey that information, but rather conveys how to understand or interact with the dashboard.

Question #58

An analyst is working with the income data of suburban families in the United States. The data set has a lot of outliers, and the analyst needs to provide a measure that represents the typical income.

Which of the following would BEST fulfill the analyst’s goal?

  • A . Median
  • B . Mean
  • C . Mode
  • D . Standard deviation

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

his is because median is a type of statistical measure that represents the typical value or central tendency of a data set, which means that it divides the data set into two equal halves, such that half of the values are above it and half are below it. Median can be used to provide a measure that represents the typical income of suburban families in the United States, especially when the data set has a lot of outliers, which means that it has values that are unusually high or low compared to the rest of the data set. Median can provide a measure that represents the typical income of suburban families in the United States, because it is not affected or skewed by the outliers, as it only depends on the middle value or the middle two values of the data set, regardless of how extreme or distant the outliers are. For example, median can provide a measure that represents the typical income of suburban families in the United States, by finding the income value that splits the data set into two equal groups of families, such that 50% of the families have higher incomes and 50% have lower incomes. The other statistical measures are not the best measures to represent the typical income of suburban families in the United States.

Here is why:

Mean is a type of statistical measure that represents the average value or central tendency of a data set, which means that it is the sum of all the values divided by the number of values. Mean is not a good measure to represent the typical income of suburban families in the United States, especially when the data set has a lot of outliers, because it is affected or skewed by the outliers, as it takes into account all the values in the data set, regardless of how extreme or distant they are. For example, mean can provide a measure that does not represent the typical income of suburban families in the United States, by finding the income value that is influenced by a few very high or very low incomes, which could make it higher or lower than most of the incomes in the data set.

Mode is a type of statistical measure that represents the most frequent value or mode of a data set, which means that it is the value that occurs most often in the data set. Mode is not a good measure to represent the typical income of suburban families in the United States, especially when the data set has a lot of outliers, because it is not representative or indicative of the central tendency or distribution of the data set, as it only depends on the count or occurrence of a single value or a few values in the data set, regardless of how common or rare they are. For example, mode can provide a measure that does not represent the typical income of suburban families in the United States, by finding the income value that is repeated more often than others, which could be an outlier or an anomaly in the data set.

Standard deviation is a type of statistical measure that represents the amount of dispersion or variation of a data set, which means that it quantifies how much the values in a data set vary or deviate from the mean or average of the data set. Standard deviation is not a measure that represents the typical income of suburban families in the United States, but rather a measure that describes the spread or distribution of their incomes, as well as identifies any outliers or extreme values in their incomes. For example, standard deviation can provide a measure that describes how diverse or homogeneous their incomes are, as well as how far their incomes are from their average income.

Question #59

Which of the following would be used to store unstructured data from different sources?

  • A . A data lake
  • B . A database management system
  • C . A database
  • D . A data warehouse

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because a data lake is a type of storage system that stores unstructured data from different sources, such as text, images, audio, video, etc. A data lake can be used to store unstructured data from different sources by using a schema-on-read approach, which means that it does not impose any structure or format on the data when it is stored, but rather applies it when it is read or accessed. A data lake can also be used to store unstructured data from different sources by using a distributed file system, such as Hadoop, which means that it can store large volumes and varieties of data across multiple servers or nodes. The other storage systems are not used to store unstructured data from different sources.

Here is why:

A database management system is a type of software application that manages and controls databases, which are collections of structured or semi-structured data that are organized into tables, rows, and columns. A database management system is not used to store unstructured data from different sources, but rather to store structured or semi-structured data from specific sources by using a schema-on-write approach, which means that it imposes a structure or format on the data when it is stored, and requires it to follow certain rules and constraints, such as primary keys, foreign keys, or referential integrity.

A database is a type of storage system that stores structured or semi-structured data that are organized into tables, rows, and columns. A database is not used to store unstructured data from different sources, but rather to store structured or semi-structured data from specific sources by using a relational model, which means that it establishes and maintains relationships between different tables based on common columns or keys. A database can also be used to store structured or semi-structured data from specific sources by using a query language, such as SQL, which means that it can access and manipulate the data using statements or commands.

A data warehouse is a type of storage system that stores structured or semi-structured data that are integrated and aggregated from different sources or systems, such as databases, cloud services, or web applications. A data warehouse is not used to store unstructured data from different sources, but rather to store structured or semi-structured data from various sources by using an ETL process, which means that it extracts, transforms, and loads the data into a common format, structure, or schema. A data warehouse can also be used to store structured or semi-structured data from various sources by using an OLAP model, which means that it supports online analytical processing of the data using multidimensional cubes or queries.

Question #60

An analyst is designing a dashboard to determine which site has the highest percentage of new customers. The analyst must choose an appropriate chart to include in the dashboard.

The following data is available:

Which of the following types of charts should be considered to BEST display the data?

  • A . Include a bar chart using the site and the percentage of new customers data.
  • B . Include a line chart using the site and the percentage of new customers data.
  • C . Include a pie chat using the site and percentage of new customers data.
  • D . Include a scatter chart using the site and the percent of new customers data.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

This is because a bar chart is a type of chart that shows the value or the amount of a single variable for different categories or groups, such as the percentage of new customers for different sites in this case. A bar chart can be used to display and analyze the comparison, ranking, or proportion among the categories or groups, as well as identify any differences, similarities, or outliers in the data. For example, a bar chart can show which site has the highest or lowest percentage of new customers, as well as show how much each site contributes to the total percentage of new customers. The other types of charts are not the best charts to display the data.

Here is why:

A line chart is a type of chart that shows the change or the trend of a single variable over time, such as the percentage of new customers over months or years in this case. A line chart can be used to display and analyze the movement, cycle, or pattern of the variable, as well as identify any peaks, valleys, or fluctuations in the data. For example, a line chart can show how the percentage of new customers increases or decreases over time, as well as show if there are any seasonal or periodic variations in the data.

A pie chart is a type of chart that shows the proportion or the percentage of a single variable for different categories or groups, such as the percentage of new customers for different sites in this case. A pie chart can be used to display and analyze the composition, distribution, or share of the variable, as well as identify any segments, slices, or fractions in the data. For example, a pie chart can show how much each site represents of the total percentage of new customers, as well as show if there are any dominant or minor sites in the data.

A scatter chart is a type of chart that shows the relationship between two variables for each observation or unit in a data set, such as the percentage of new customers and another variable for each site in this case. A scatter chart can be used to display and analyze the correlation, trend, or pattern among the variables, as well as identify any outliers or clusters in the data. For example, a scatter chart can show if there is a positive, negative, or no correlation between the percentage of new customers and another variable, such as sales revenue or customer satisfaction.

Question #61

A cereal manufacturer wants to determine whether the sugar content of its cereal has increased over the years.

Which of the following is the appropriate descriptive statistic to use?

  • A . Frequency
  • B . Percent change
  • C . Variance
  • D . Mean

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

This is because percent change is a type of descriptive statistic that measures the relative change or difference of a variable over time, such as the sugar content of cereal over years in this case. Percent change can be used to determine whether the sugar content of cereal has increased over years by comparing the initial and final values of the sugar content, as well as calculating the ratio or proportion of the change. For example, percent change can be used to determine whether the sugar content of cereal has increased over years by finding out how much more (or less) sugar there is in cereal now than before, as well as expressing it as a fraction or a percentage of the original sugar content. The other descriptive statistics are not appropriate to use to determine whether the sugar content of cereal has increased over years.

Here is why:

Frequency is a type of descriptive statistic that measures how often or how likely a value or an event occurs in a data set, such as how many times a certain sugar content appears in cereal in this case. Frequency does not measure the relative change or difference of a variable over time, but rather measures the occurrence or chance of a variable at a given time.

Variance is a type of descriptive statistic that measures how much the values in a data set vary or deviate from the mean or average of the data set, such as how much variation there is in sugar content among different cereals in this case. Variance does not measure the relative change or difference of a variable over time, but rather measures the dispersion or spread of a variable at a given time.

Mean is a type of descriptive statistic that measures the average value or central tendency of a data set, such as what is the typical sugar content of cereal in this case. Mean does not measure the relative change or difference of a variable over time, but rather measures the summary or representation of a variable at a given time.

Question #62

The process of performing initial investigations on data to spot outliers, discover patterns, and test assumptions with statistical insight and graphical visualization is called:

  • A . a t-test.
  • B . a performance analysis.
  • C . an exploratory data analysis.
  • D . a link analysis.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

This is because exploratory data analysis is a type of process that performs initial investigations on data to spot outliers, discover patterns, and test assumptions with statistical insight and graphical visualization, such as box plots, histograms, scatter plots, etc. Exploratory data analysis can be used to understand and summarize the data, as well as to generate hypotheses or questions for further analysis or research. For example, exploratory data analysis can be used to identify and visualize the characteristics, features, or behaviors of the data, as well as to measure their distribution, frequency, or correlation. The other options are not types of processes that perform initial investigations on data to spot outliers, discover patterns, and test assumptions with statistical insight and graphical visualization. Here is what they mean:

A t-test is a type of statistical method that tests whether there is a significant difference between the means of two groups or samples, such as whether there is a difference between the average exam scores of two classes in this case. A t-test can be used to test or verify a claim or an assumption about the data, as well as to measure the confidence or the error of the estimation.

A performance analysis is a type of process that measures whether the data meets certain goals or objectives, such as targets, benchmarks, or standards. A performance analysis can be used to identify and visualize the gaps, deviations, or variations in the data, as well as to measure the efficiency, effectiveness, or quality of the outcomes. For example, a performance analysis can be used to determine if there is a gap between a student’s test score and their expected score based on their previous performance.

A link analysis is a type of process that determines whether the data is connected to other datapoints, such as entities, events, or relationships. A link analysis can be used to identify and visualize the patterns, networks, or associations among the datapoints, as well as to measure the strength, direction, or frequency of the connections. For example, a link analysis can be used to determine if there is a connection between a customer’s purchase history and their loyalty program status.

Question #63

Different people manually type a series of handwritten surveys into an online database.

Which of the following issues will MOST likely arise with this data? (Choose two.)

  • A . Data accuracy
  • B . Data constraints
  • C . Data attribute limitations
  • D . Data bias
  • E . Data consistency
  • F . Data manipulation

Reveal Solution Hide Solution

Correct Answer: A, E
A, E

Explanation:

Data accuracy refers to the extent to which the data is correct, reliable, and free of errors. When different people manually type a series of handwritten surveys into an online database, there is a high chance of human error, such as typos, misinterpretations, omissions, or duplications. These errors can affect the quality and validity of the data and lead to incorrect or misleading analysis and decisions.

Data consistency refers to the extent to which the data is uniform and compatible across different sources, formats, and systems. When different people manually type a series of handwritten surveys into an online database, there is a high chance of inconsistency, such as different spellings, abbreviations, formats, or standards. These inconsistencies can affect the integration and comparison of the data and lead to confusion or conflicts.

Therefore, to ensure data quality, it is important to have clear and consistent rules and procedures for data entry, validation, and verification. It is also advisable to use automated tools or methods to reduce human error and inconsistency.

Question #64

Which of the following data sampling methods involves dividing a population into subgroups by similar characteristics?

  • A . Systematic
  • B . Simple random
  • C . Convenience
  • D . Stratified

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Stratified sampling is a data sampling method that involves dividing a population into subgroups by similar characteristics, such as age, gender, income, etc. Then, a simple random sample is drawn from each subgroup. This method ensures that each subgroup is adequately represented in the sample and reduces the sampling error.

Reference: CompTIA Data+ Certification Exam Objectives, page 11.

Question #65

A data analyst must separate the column shown below into multiple columns for each component of the name:

Which of the following data manipulation techniques should the analyst perform?

  • A . Imputing
  • B . Transposing
  • C . Parsing
  • D . Concatenating

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Parsing is the data manipulation technique that should be used to separate the column into multiple columns for each component of the name. Parsing is the process of breaking down a string of text into smaller units, such as words, symbols, or numbers. Parsing can be used to extract specific information from a text column, such as names, addresses, phone numbers, etc. Parsing can also be used to split a text column into multiple columns based on a delimiter, such as a comma, space, or dash1. In this case, the analyst can use parsing to split the column by the comma delimiter and create three new columns: one for the last name, one for the first name, and one for the middle initial. This will make the data more organized and easier to analyze.

Question #66

Which of the following descriptive statistical methods are measures of central tendency? (Choose two.)

  • A . Mean
  • B . Minimum
  • C . Mode
  • D . Variance
  • E . Correlation
  • F . Maximum

Reveal Solution Hide Solution

Correct Answer: A, C
A, C

Explanation:

Mean and mode are measures of central tendency, which describe the typical or most common value in a distribution of data. Mean is the arithmetic average of all the values in a dataset, calculated by adding up all the values and dividing by the number of values. Mode is the most frequently occurring value in a dataset. Other measures of central tendency include median, which is the middle value when the data is sorted in ascending or descending order.

Question #67

Which of the following will MOST likely be streamed live?

  • A . Machine data
  • B . Key-value pairs
  • C . Delimited rows
  • D . Flat files

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Machine data is the most likely type of data to be streamed live, as it refers to data generated by machines or devices, such as sensors, web servers, network devices, etc. Machine data is often produced continuously and in large volumes, requiring real-time processing and analysis. Other types of data, such as key-value pairs, delimited rows, and flat files, are more likely to be stored in databases or files and processed in batches.

Question #68

A database consists of one fact table that is composed of multiple dimensions. Depending on the dimension, each one can be represented by a denormalized table or multiple normalized tables.

This structure is an example of a:

  • A . transactional schema.
  • B . star schema.
  • C . non-relational schema.
  • D . snowflake schema.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

star schema is a type of database schema that consists of one fact table that is composed of multiple dimensions. A fact table contains quantitative measures or facts that are related to a specific event or transaction. A dimension table contains descriptive attributes or dimensions that provide context for the facts. A star schema is called so because it resembles a star, with the fact table at the center and the dimension tables radiating from it. A star schema is a type of dimensional schema, which is designed for data warehousing and analytical purposes. Other types of dimensional schemas include snowflake schema and galaxy schema. A snowflake schema is similar to a star schema, except that some or all of the dimension tables are normalized into multiple tables. A galaxy schema consists of multiple fact tables that share some common dimension tables. A transactional schema is a type of database schema that is designed for operational purposes, such as recording day-to-day transactions and activities. A transactional schema is usually normalized to reduce data redundancy and improve data integrity. A non-relational schema is a type of database schema that does not follow the relational model, which organizes data into tables with rows and columns. A non-relational schema can store data in various formats, such as documents, graphs, key-value pairs, etc.

Question #69

A data analyst is designing a dashboard that will provide a story of sales and determine which site is providing the highest sales volume per customer. The analyst must choose an appropriate chart to include in the dashboard.

The following data is available:

Which of the following types of charts should be considered?

  • A . Include a line chart using the site and average sales per customer.
  • B . Include a pie chart using the site and sales to average sales per customer.
  • C . Include a scatter chart using sales volume and average sales per customer.
  • D . Include a column chart using the site and sales to average sales per customer.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A scatter chart using sales volume and average sales per customer is the best type of chart to include in the dashboard. A scatter chart is a type of chart that displays the relationship between two numerical variables using dots or markers. A scatter chart can show how one variable affects another, how strong the correlation is between them, and how the data points are distributed. In this case, a scatter chart can show the story of sales and determine which site is providing the highest sales volume per customer by plotting the sales volume on the x-axis and the average sales per customer on the y-axis. Each dot on the chart will represent a site, and the analyst can easily compare the sites based on their position on the chart. A site with a high sales volume and a high average sales per customer will be in the upper right quadrant, indicating a high performance. A site with a low sales volume and a low average sales per customer will be in the lower left quadrant, indicating a low performance. A site with a high sales volume and a low average sales per customer will be in the lower right quadrant, indicating a high volume but low value. A site with a low sales volume and a high average sales per customer will be in the upper left quadrant, indicating a low volume but high value. A scatter chart can also show if there is a positive or negative correlation between the two variables, or if there is no correlation at all. A positive correlation means that as one variable increases, so does the other. A negative correlation means that as one variable increases, the other decreases. No correlation means that there is no relationship between the two variables.

The other types of charts are not as suitable for this purpose. A line chart is a type of chart that displays the change of one or more variables over time using lines. A line chart can show trends, patterns, and fluctuations in the data. However, in this case, there is no time variable involved, so a line chart would not be appropriate. A pie chart is a type of chart that displays the proportion of each category in a whole using slices of a circle. A pie chart can show how each category contributes to the total and compare the relative sizes of each category. However, in this case, there are two numerical variables involved, so a pie chart would not be able to show their relationship. A column chart is a type of chart that displays the comparison of one or more variables across categories using vertical bars. A column chart can show how each category differs from each other and rank them by size. However, in this case, a column chart would not be able to show the relationship between sales volume and average sales per customer, as it would only show one variable for each site.

Question #70

An analyst needs to conduct a quick analysis.

Which of the following is the FIRST step the analyst should perform with the data?

  • A . Conduct an exploratory analysis and use descriptive statistics.
  • B . Conduct a trend analysis and use a scatter chart.
  • C . Conduct a link analysis and illustrate the connection points.
  • D . Conduct an initial analysis and use a Pareto chart.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The first step the analyst should perform with the data is to conduct an exploratory analysis and use descriptive statistics. Exploratory analysis is a type of analysis that aims to summarize the main characteristics of the data, identify patterns, outliers, and relationships, and generate hypotheses for further investigation. Descriptive statistics are numerical measures that describe the central tendency, variability, and distribution of the data, such as mean, median, mode, standard deviation, range, quartiles, etc. Exploratory analysis and descriptive statistics can help the analyst gain a better understanding of the data and its quality, as well as prepare the data for further analysis.

Question #71

A data analyst has been asked to create a sales report that calculates the rolling 12-month average for sales.

If the report will be published on November 1, 2020, which of the following months shouts the report cover?

  • A . October 1, 2019 to October 31, 2020
  • B . October 31, 2020 to November 1, 2021
  • C . November 1, 2019 to October 31, 2020
  • D . October 31, 2019 to October 31, 2020

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The report should cover the months from October 1, 2019 to October 31, 2020. A rolling 12-month average is a type of moving average that calculates the average of the last 12 months of data for each month. It is useful for smoothing out seasonal fluctuations and identifying long-term trends in the data. To calculate the rolling 12-month average for sales for November 1, 2020, the analyst needs to use the sales data from the previous 12 months, starting from November 1, 2019 and ending on October 31, 2020. The other options are either too short or too long to cover the required period.

Question #72

A data analyst has been asked to merge the tables below, first performing an INNER JOIN and then a LEFT JOIN:

Customer Table –

In-store Transactions C

Which of the following describes the number of rows of data that can be expected after performing both joins in the order stated, considering the customer table as the main table?

  • A . INNER: 6 rows; LEFT: 9 rows
  • B . INNER: 9 rows; LEFT: 6 rows
  • C . INNER: 9 rows; LEFT: 15 rows
  • D . INNER: 15 rows; LEFT: 9 rows

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

An INNER JOIN returns only the rows that match the join condition in both tables. A LEFT JOIN returns all the rows from the left table, and the matched rows from the right table, or NULL if there is no match. In this case, the customer table is the left table and the in-store transactions table is the right table. The join condition is based on the customer_id column, which is common in both tables.

To perform an INNER JOIN, we can use the following SQL query:

SELECT * FROM customer INNER JOIN in_store_transactions ON customer.customer_id = in_store_transactions.customer_id;

This query will return 9 rows of data, as shown below:

customer_id | name | lastname | gender | marital_status | transaction_id | amount | date 1 | MARC | TESCO | M | Y | 1 | 1000 | 2020-01-01 1 | MARC | TESCO | M | Y | 2 | 5000 | 2020-01-02 2 | ANNA | MARTIN | F | N | 3 | 2000 | 2020-01-03 2 | ANNA | MARTIN | F | N | 4 | 3000 | 2020-01-04 3 | EMMA | JOHNSON | F | Y | 5 | 4000 | 2020-01-05 4 | DARIO | PENTAL | M | N | 6 | 5000 | 2020-01-06 5 | ELENA | SIMSON| F| N|7|6000|2020-01-07 6|TIM|ROBITH|M|N|8|7000|2020-01-08 7|MILA|MORRIS|F|N|9|8000|2020-01-09

To perform a LEFT JOIN, we can use the following SQL query:

SELECT * FROM customer LEFT JOIN in_store_transactions ON customer.customer_id = in_store_transactions.customer_id;

This query will return 15 rows of data, as shown below:

customer_id|name|lastname|gender|marital_status|transaction_id|amount|date 1|MARC|TESCO|M|Y|1|1000|2020-01-01 1|MARC|TESCO|M|Y|2|5000|2020-01-02 2|ANNA|MARTIN|F|N|3|2000|2020-01-03 2|ANNA|MARTIN|F|N|4|3000|2020-01-04 3|EMMA|JOHNSON|F|Y|5|4000|2020-01-05 4|DARIO|PENTAL|M|N|6|5000|2020-01-06 5|ELENA|SIMSON||F||N||7||6000||2020-01-07 6||TIM||ROBITH||M||N||8||7000||2020-01-08 7||MILA||MORRIS||F||N||9||8000||2020-01-09 8||JENNY||DWARTH||F||Y||NULL||NULL||NULL

As you can see, the customers who do not have any transactions (customer_id = 8) are still included in the result, but with NULL values for the transaction_id, amount, and date columns. Therefore, the correct answer is C: INNER: 9 rows; LEFT: 15 rows.

Reference: SQL Joins – W3Schools

Question #73

A data analyst needs to create a weekly recurring report on sales performance and distribute it to all sales managers.

Which of the following would be the BEST method to automate and ensure successful delivery for this task?

  • A . Use scheduled report delivery.
  • B . Implement subscription access delivery.
  • C . Print out a copy.
  • D . Upload the report to the server.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Scheduled report delivery is a feature that allows a data analyst to automate the generation and distribution of a report at a specified time and frequency. This would be the best method to ensure that the sales managers receive the weekly report on sales performance without manual intervention. Subscription access delivery is a feature that allows users to subscribe to a report and access it on demand, but it does not automate the delivery. Printing out a copy or uploading the report to the server are manual methods that require more time and effort from the data analyst.

Reference: CertMaster Practice for Data+ Exam Prep – CompTIA

Question #74

Which of the following is an example of a discrete variable?

  • A . The temperature of a hot tub
  • B . The height of a horse
  • C . The time to complete a task
  • D . The number of people in an office

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A discrete variable is a variable that can only take on a finite number of values, such as integers or categories. The number of people in an office is an example of a discrete variable, as it can only be a whole number. The temperature of a hot tub, the height of a horse, and the time to complete a task are examples of continuous variables, as they can take on any value within a range.

Reference: CompTIA Data+ (DA0-001) Practice Certification Exams | Udemy

Question #75

Which of the following data types would a telephone number formatted as XXX-XXX-XXXX be considered?

  • A . Numeric
  • B . Date
  • C . Float
  • D . Text

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A telephone number formatted as XXX-XXX-XXXX would be considered a text data type, as it is composed of alphanumeric characters and symbols. A numeric data type is composed of only numbers, such as integers or decimals. A date data type is composed of values that represent dates or times, such as YYYY-MM-DD or HH:MM:SS. A float data type is composed of numbers with fractional parts, such as 3.14 or 0.5.

Reference: Guide to CompTIA Data+ and Practice Questions – Pass Your Cert

Question #76

The director of operations at a power company needs data to help identify where company resources should be allocated in order to monitor activity for outages and restoration of power in the entire state.

Specifically, the director wants to see the following:

* County outages

* Status

* Overall trend of outages

INSTRUCTIONS:

Please, select each visualization to fit the appropriate space on the dashboard and choose an appropriate color scheme. Once you have selected all visualizations, please, select the appropriate titles and labels, if applicable. Titles and labels may be used more than once.

If at any time you would like to bring back the initial state of the simulation, please click the Reset All button.

Reveal Solution Hide Solution

Correct Answer: This is a simulation question that requires you to create a dashboard with visualizations that meet the director’s needs.

Here are the steps to complete the task:

Drag and drop the visualization that shows the county outages on the top left space of the dashboard. This visualization is a map of the state with different colors indicating the number of outages in each county. You can choose any color scheme that suits your preference, but make sure that the colors are consistent and clear. For example, you can use a gradient of red to show the counties with more outages and green to show the counties with less outages.

Drag and drop the visualization that shows the status of the outages on the top right space of the dashboard. This visualization is a pie chart that shows the percentage of outages that are active, restored, or pending. You can choose any color scheme that suits your preference, but make sure that the colors are distinct and easy to identify. For example, you can use red for active, green for restored, and yellow for pending.

Drag and drop the visualization that shows the overall trend of outages on the bottom space of the dashboard. This visualization is a line graph that shows the number of outages over time. You can choose any color scheme that suits your preference, but make sure that the color is visible and contrasted with the background. For example, you can use blue for the line and white for the background.

Select appropriate titles and labels for each visualization. Titles and labels may be used more than once. For example, you can use “County Outages” as the title for the map, “Status” as the title for the pie chart, and “Trend” as the title for the line graph. You can also use “County”, “Number of Outages”, “Active”, “Restored”, “Pending”, “Time”, and “Number of Outages” as labels for the axes and legends of the visualizations.

Question #77

Q3 2020 has just ended, and now a data analyst needs to create an ad-hoc sales report that demonstrates how well the Q3 2020 promotion went versus last year’s Q3 promotion.

Which of the following date parameters should the analyst use?

  • A . 2019 vs. YTD 2020
  • B . Q3 2019 vs. Q3 2020
  • C . YTD 2019 vs. YTD 2020
  • D . Q4 2019 vs. Q3 2020

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The date parameters that the analyst should use are Q3 2019 vs. Q3 2020, as this will allow the analyst to compare the sales performance of the Q3 2020 promotion with the same period of last year. This will help to eliminate any seasonal or cyclical effects that might affect the sales data. The other options are not relevant for this purpose, as they either compare different quarters or different years.

Reference: CertMaster Practice for Data+ Exam Prep – CompTIA

Question #78

A data analyst has been asked to create an ad-hoc sales report for the Chief Executive Officer (CEO).

Which of the following should be included in the report?

  • A . The sales representatives’ home addresses.
  • B . Line-item SKU numbers.
  • C . YTD total sales.
  • D . The customers’ first and last names.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The report for the CEO should include YTD total sales, as this will provide a high-level overview of the sales performance of the company and show how it is meeting its annual goals. The other options are not appropriate for the CEO, as they are either too detailed or irrelevant for the report. The sales representatives’ home addresses, line-item SKU numbers, and customers’ first and last names are not related to the sales performance and might compromise the privacy and security of the data.

Reference: CompTIA Data+ (DA0-001) Practice Certification Exams | Udemy

Question #79

Which of the following can be used to translate data into another form so it can only be read by a user who has a key or a password?

  • A . Data encryption.
  • B . Data transmission.
  • C . Data protection.
  • D . Data masking.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Data encryption can be used to translate data into another form so it can only be read by a user who has a key or a password. Data encryption is a process of transforming data using an algorithm or a cipher to make it unreadable to anyone except those who have the key or the password to decrypt it. Data encryption is a common method of protecting data from unauthorized access, modification, or theft.

Reference: Guide to CompTIA Data+ and Practice Questions – Pass Your Cert

Question #80

Which of the following is an example of a discrete data type?

  • A . 8in (20cm)
  • B . 5 kids
  • C . 2.5mi (4km)
  • D . 10.7lbs (4.9kg)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

A discrete data type is a data type that can only take on a finite number of values, such as integers or categories. An example of a discrete data type is the number of kids, as it can only be a whole number. The other options are examples of continuous data types, as they can take on any value within a range. The length in inches or centimeters, the distance in miles or kilometers, and the weight in pounds or kilograms are all continuous data types.

Reference: CompTIA Data+ (DA0-001) Practice Certification Exams | Udemy

Question #81

Which of the following contains alphanumeric values?

  • A . 10.1Ε²
  • B . 13.6
  • C . 1347
  • D . A3J7

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Alphanumeric values are values that contain both letters and numbers, such as A3J7. The other options are numeric values, as they contain only numbers, such as 10.1E2, 13.6, and 1347.

Reference: Guide to CompTIA Data+ and Practice Questions – Pass Your Cert

Question #82

A junior web developer is developing a new application where users can upload short videos. The first task is to create a homepage that shows the headline "Upload Your Short Videos" and a clickable button that says "upload now".

Which of the following HTML commands would help the developer to complete the task successfully?

  • A . < span >Upload Your Short Videos< /span >< button >upload now< /button >
  • B . < p >Upload Your Short Videos< /p >< p >upload now< /p >
  • C . < hl >Upload Your Short Videos< /h1 >< button >upload now< /button >
  • D . < hl >Upload Your Short Videos< /h1 >< hl >upload now< /h1 >

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The HTML commands that would help the developer to complete the task successfully are <h1>Upload Your Short Videos</h1> and <button>upload now</button>. The <h1> tag defines a heading level 1, which is the largest and most important heading on a webpage. The <button> tag defines a clickable button that can perform some action when clicked. The other options are not suitable for the task, as they either use the wrong tags or do not create a clickable button. The <span> tag defines a section of text with no specific meaning or formatting. The <p> tag defines a paragraph of text. The <hl> tag does not exist in HTML.

Reference: HTML Tags – W3Schools

Question #83

A web developer wants to ensure that malicious users can’t type SQL statements when they asked for input, like their username/userid.

Which of the following query optimization techniques would effectively prevent SQL Injection attacks?

  • A . Indexing.
  • B . Subset of records.
  • C . Temporary table in the query set.
  • D . Parametrization.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The correct answer is D: Parametrization. Parameterized SQL queries allow you to place parameters in an SQL query instead of a constant value. A parameter takes a value only when the query is executed, allowing the query to be reused with different values and purposes. Parameterized SQL statements are available in some analysis clients, and are also available through the Historian SDK. For example, you could create the following conditional SQL query, which contains a parameter for the collector’s name: SELECT* FROM ExamsDigest WHERE coursename=? ORDER BY tagname SQL Injection is best prevented through the use of parameterized queries.

Question #84

Consider the following dataset which contains information about houses that are for sale:

Which of the following string manipulation commands will combine the address and region name columns to create a full address?

full_address————————- 85 Turner St, Northern Metropolitan 25 Bloomburg St, Northern

Metropolitan 5 Charles St, Northern Metropolitan 40 Federation La, Northern Metropolitan 55a Park St, Northern Metropolitan

  • A . SELECT CONCAT(address, ‘ , ‘ , regionname) AS full_address FROM melb LIMIT 5;
  • B . SELECT CONCAT(address, ‘-‘ , regionname) AS full_address FROM melb LIMIT 5;
  • C . SELECT CONCAT(regionname, ‘ , ‘ , address) AS full_address FROM melb LIMIT 5
  • D . SELECT CONCAT(regionname, ‘-‘ , address) AS full_address FROM melb LIMIT 5;

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The correct answer is A: SELECT CONCAT(address, ‘ , ‘ , regionname) AS full_address FROM melb LIMIT 5; String manipulation (or string handling) is the process of changing, parsing, splicing, pasting, or analyzing strings. SQL is used for managing data in a relational database. The CONCAT () function adds two or more strings together. Syntax CONCAT(stringl, string2,… string_n) Parameter Values Parameter Description stringl, string2, string_n Required. The strings to add together.

Question #85

The ACME Corporation hired an analyst to detect data quality issues in their Excel documents.

Which of the following are the most common issues? (Select TWO)

  • A . Apostrophe.
  • B . Commas.
  • C . Symbols.
  • D . Duplicates.
  • E . Misspellings.

Reveal Solution Hide Solution

Correct Answer: D, E
D, E

Explanation:

Question #85

The ACME Corporation hired an analyst to detect data quality issues in their Excel documents.

Which of the following are the most common issues? (Select TWO)

  • A . Apostrophe.
  • B . Commas.
  • C . Symbols.
  • D . Duplicates.
  • E . Misspellings.

Reveal Solution Hide Solution

Correct Answer: D, E
D, E

Explanation:

Question #85

The ACME Corporation hired an analyst to detect data quality issues in their Excel documents.

Which of the following are the most common issues? (Select TWO)

  • A . Apostrophe.
  • B . Commas.
  • C . Symbols.
  • D . Duplicates.
  • E . Misspellings.

Reveal Solution Hide Solution

Correct Answer: D, E
D, E

Explanation:

Question #88

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This tables show a simple frequency distribution of the retirement age data.

  • A . 56
  • B . 55
  • C . 57
  • D . 54

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each of these measures describes a different indication of the typical or central value in the distribution.

What is the mode?

The mode is the most commonly occurring value in a distribution.

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Question #89

Which of the following value is the measure of dispersion "range" between the scores of ten students in a test.

The scores of ten students in a test are 17, 23, 30, 36, 45, 51, 58, 66, 72, 77.

  • A . 90
  • B . 60
  • C . 70
  • D . 80

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The correct answer is: 60

Range is the interval between the highest and the lowest score.

Range is a measure of variability or scatteredness of the varieties or observations among themselves and does not give an idea about the spread of the observations around some central value. Symbolically R = Hs – Ls.

Where R = Range; Hs is the ‘Highest score’ and Ls is the Lowest Score.

The scores of ten students in a test are: 17, 23, 30, 36, 45, 51, 58, 66, 72, 77.

The highest score is 77 and the lowest score is 17.

So the range is the difference between these two scores Range = 77 – 17 = 60

Question #90

A data scientist wants to see which products make the most money and which products attract the most customer purchasing interest in their company.

Which of the following data manipulation techniques would he use to obtain this information?

  • A . Data append
  • B . Data blending
  • C . Normalize data
  • D . Data merge

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The correct answer is B: Data blending.

Data blending is combining multiple data sources to create a single, new dataset, which can be presented visually in a dashboard or other visualization and can then be processed or analyzed. Enterprises get their data from a variety of sources, and users may want to temporarily bring together different datasets to compare data relationships or answer a specific question. Data append is incorrect. Data append is a process that involves adding new data elements to an existing database. An example of a common data append would be the enhancement of a company’s customer files. A data append takes the information they have, matches it against a larger database of business data, allowing the desired missing data fields to be added. Normalize data is incorrect. Data normalization is the process of structuring your relational customer database, following a series of normal forms. This improves the accuracy and integrity of your data while ensuring that your database is easier to navigate. Data merge is incorrect. Data merging is the process of combining two or more data sets into a single data set.

Question #91

A data analyst wants to create "Income Categories" that would be calculated based on the existing variable "Income".

The "Income Categories" would be as follows:

Income category 1: less than $1.

Income category 2: more than $1 and less than $20,000.

Income category 3: more than $20,001 and less than $40,000.

Income category 4: more than $40,001.

Which of the following data manipulation techniques should the data analyst use to create "Income Categories"?

  • A . Data merge
  • B . Derived variables
  • C . Data blending
  • D . Data append

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The correct answer is B: Derived variables Derived variables are variables that you create by calculating or categorizing variables that already exist in your data set.

Data merge is incorrect. Data merging is the process of combining two or more data sets into a single data set. Data blending is incorrect.

Data blending involves pulling data from different sources and creating a single, unique, dataset for visualization and analysis.

Data append is incorrect. A data append is a process that involves adding new data elements to an existing database.

Question #92

Angela is aggregating data from CRM system with data from an employee system.

While performing an initial quality check, she realizes that her employee ID is not associated with her identifier in the CRM system.

What kind of issues is Angela facing? Choose the best answer.

  • A . ETL process.
  • B . Record linkage.
  • C . ELT process.
  • D . System integration.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

While this scenario describes a system integration challenge that can be solved with ETL or ELT, Angela is facing a Record linkage issue.

Question #93

Andy is a pricing analyst for a retailer. Using a hypothesis test, he wants to assess whether people who receive electronic coupons spend more on average.

What should Andy’s null hypothesis be?

  • A . People who receive electronic coupons spend more on average.
  • B . People who receive electronic coupons spend less on average.
  • C . People who receive electronic coupons do not spend more on average.
  • D . People who do not receive electronic coupons spend more on average.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The null hypothesis presumes the status quo. Andy is testing whether or not people who receive an electronic coupon spend more on average, so, the null hypothesis states that people who receive the coupon do spend more on average.

Question #94

Amanda needs to create a dashboard that will draw information from many other data sources and present it to business leaders.

Which one of the following tools is least likely to meet her needs?

  • A . QuickSight.
  • B . Tableau.
  • C . Power BI.
  • D . SPSS Modeler.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

SPSS Modeler.

QuickSight, Tableau, and Power BI are all powerful analytics and reporting tools that can pull data from a variety of sources. SPSS Modeler is a powerful predictive analytics platform that is designed to bring predictive intelligence to decisions made by individuals, groups, systems and your enterprise.

Question #95

Daniel is using the structured Query language to work with data stored in relational database.

He would like to add several new rows to a database table.

What command should he use?

  • A . SELECT.
  • B . ALTER.
  • C . INSERT.
  • D . UPDATE.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

INSERT

The INSERT command is used to add new records to a database table.

The SELECT command is used to retrieve information from a database. It’s the most commonly used command in SQL because it is used to pose queries to the database and retrieve the data that you’re interested in working with.

The UPDATE command is used to modify rows in the database.

The CREATE command is used to create a new table within your database or a new database on your server.

Question #96

Jhon is working on an ELT process that sources data from six different source systems.

Looking at the source data, he finds that data about the sample people exists in two of six systems.

What does he have to make sure he checks for in his ELT process? Choose the best answer.

  • A . Duplicate Data.
  • B . Redundant Data.
  • C . Invalid Data.
  • D . Missing Data.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Duplicate Data.

While invalid, redundant, or missing data are all valid concerns, data about people exists in two of the six systems. As such, Jhon needs to account for duplicate data issues.

Question #97

Samantha needs to share a list of her organization’s top 50 customers with the VP of sales. She would like to include the name of the customer, the business they represent, their contact information, and their total sales over the past year.

The VP does not have any specialized analytics skills or software but would like to make some personal notes on the dataset.

What would be the best tool for Samantha to use to share this information?

  • A . Power BI.
  • B . Microsoft Excel.
  • C . Minitab.
  • D . SAS.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Microsoft Excel.

This scenario presents a very simple use case where the business leader needs a dataset in an easy-to-access form and will not be performing any detailed analysis.

A simple spreadsheet, such as Microsoft Excel, would be the best tool for this job.

There is no need to use a statistical analysis package, such as SAS or Minitab, as this would likely confuse the VP without adding any value. The same is true of an integrated analytics suite, such as Power BI.

Question #98

Alex wants to use data from his corporate sale, CRM, and shipping systems to try and predict future sales.

Which of the following systems is the most appropriate? Choose the best answer.

  • A . Data mart.
  • B . OLAP.
  • C . Data Warehouse.
  • D . OLTP.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Correct answer: C. Data Warehouse.

Data warehouse bring together data from multiple systems used by an organization.

A data mart is too narrow, as Alex needs data from across multiple divisions.

OLAP is a broad term of analytical processing, and OLTP systems are transactional and not ideal for this task.

Question #99

Analytics reports should follow corporate style guidelines.

  • A . True.
  • B . False.

Reveal Solution Hide Solution

Correct Answer: A
Question #100

Which one of the following is a measure of dispersion?

  • A . Variance.
  • B . Mode.
  • C . Median.
  • D . Mean.

Reveal Solution Hide Solution

Correct Answer: A

Question #101

Which one of the following in NOT a common data integration tool?

  • A . XSS
  • B . ELT
  • C . ETL
  • D . APIs

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Cross-site Scripting (XSS) is a security vulnerability usually found in websites and/or web applications

that accept user input.

XSS is a client-side vulnerability that targets other application users, while SQL injection is a server-side vulnerability that targets the application’s database.

How do I prevent XSS in PHP? Filter your inputs with a whitelist of allowed characters and use type hints or type casting.

Question #102

Which one of the following is a common data warehouse schema?

  • A . Snowflake.
  • B . Square.
  • C . Spiral.
  • D . Sphere.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings. The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop.

Exit mobile version