Which of the following describes why the statement might not have copied any new records into the table?

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".

Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?
A . The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B . The names of the files to be copied were not included with the FILES keyword.
C . The previous day’s file has already been copied into the table.
D . The PARQUET file format does not support COPY INTO.
E . The COPY INTO statement requires the table to be refreshed to view the copied rows.

Answer: C

Explanation:

The COPY INTO statement is an idempotent operation, which means that it will skip any files that have already been loaded into the target table1. This ensures that the data is not duplicated or corrupted by multiple attempts to load the same file. Therefore, if the data engineer runs the same command every day without specifying the names of the files to be copied with the FILES keyword or a glob pattern with the PATTERN keyword, the statement will only copy the first file that matches the source location and ignore the rest. To avoid this problem, the data engineer should either use the FILES or PATTERN keywords to filter the files to be copied based on the date or some other criteria, or delete the files from the source location after they are copied into the table2.

Reference:

1: COPY INTO | Databricks on AWS

2: Get started using COPY INTO to load data | Databricks on AWS

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments