Which of the following is true of Delta Lake and the Lakehouse?

Which of the following is true of Delta Lake and the Lakehouse?A . Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.B . Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in...

August 26, 2024 No Comments READ MORE +

Which code block should be used to create the date Python variable used in the above code block?

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)") Which code block should be used to...

August 23, 2024 No Comments READ MORE +

Which approach would simplify the identification of these changed records?

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from...

August 20, 2024 No Comments READ MORE +

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings. The below query is used to create the alert:...

August 19, 2024 No Comments READ MORE +

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings. The team has decided to process all deletions from the previous week as...

August 19, 2024 No Comments READ MORE +

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure. The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the...

August 17, 2024 No Comments READ MORE +

Which of the following likely explains these smaller file sizes?

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were...

August 17, 2024 No Comments READ MORE +

Which statement describes what will happen when the above code is executed?

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database. After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active...

August 17, 2024 No Comments READ MORE +

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?A . spark.sql.files.maxPartitionBytesB . spark.sql.autoBroadcastJoinThresholdC . spark.sql.files.openCostInBytesD . spark.sql.adaptive.coalescePartitions.minPartitionNumE . spark.sql.adaptive.advisoryPartitionSizeInBytesView AnswerAnswer: A Explanation: This is the correct answer because spark.sql.files.maxPartitionBytes is a configuration parameter that directly affects the size of a spark-partition upon ingestion...

August 16, 2024 No Comments READ MORE +

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device. Streaming DataFrame df has the following...

August 16, 2024 No Comments READ MORE +