Which of the following tools can the data engineer use to solve this problem?

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.

Which of the following tools can the data engineer use to solve this problem?
A . Unity Catalog
B . Delta Lake
C . Databricks SQL
D . Data Explorer
E . Auto Loader

Answer: E

Explanation:

Auto Loader is a tool that can incrementally and efficiently process new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a Structured Streaming source called cloudFiles, which automatically detects and processes new files in a given input directory path on the cloud file storage. Auto Loader also tracks the ingestion progress and ensures exactly-once semantics when writing data into Delta Lake. Auto Loader can ingest various file formats, such as JSON, CSV, XML, PARQUET, AVRO, ORC, TEXT, and BINARYFILE. Auto Loader has support for both Python and SQL in Delta Live Tables, which are a declarative way to build production-quality data pipelines with Databricks.

Reference: What is Auto Loader?, Get started with Databricks Auto Loader, Auto Loader in Delta Live Tables

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments