Which statement describes how the Delta engine identifies which files to load?
A Delta table of weather records is partitioned by date and has the below schema:
date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter:
latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
A . All records are cached to an operational database and then the filter is applied
B . The Parquet file footers are scanned for min and max statistics for the latitude column
C . All records are cached to attached storage and then the filter is applied
D . The Delta log is scanned for min and max statistics for the latitude column
E . The Hive metastore is scanned for min and max statistics for the latitude column
Answer: D
Explanation:
This is the correct answer because Delta Lake uses a transaction log to store metadata about each table, including min and max statistics for each column in each data file. The Delta engine can use this information to quickly identify which files to load based on a filter condition, without scanning the entire table or the file footers. This is called data skipping and it can improve query performance significantly. Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; [Databricks Documentation], under “Optimizations – Data Skipping” section.
In the Transaction log, Delta Lake captures statistics for each data file of the table.
These statistics indicate per file:
– Total number of records
– Minimum value in each column of the first 32 columns of the table
– Maximum value in each column of the first 32 columns of the table
– Null value counts for in each column of the first 32 columns of the table
When a query with a selective filter is executed against the table, the query optimizer uses these statistics to generate the query result. it leverages them to identify data files that may contain records matching the conditional filter.
For the SELECT query in the question, The transaction log is scanned for min and max statistics for the price column
Latest Databricks Certified Professional Data Engineer Dumps Valid Version with 222 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund