In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?
A . Checkpointing and Write-ahead Logs
B . Structured Streaming cannot record the offset range of the data being processed in each trigger.
C . Replayable Sources and Idempotent Sinks
D . Write-ahead Logs and Idempotent Sinks
E . Checkpointing and Idempotent Sinks

Answer: A

Explanation:

Structured Streaming uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. This ensures that the engine can reliably track the exact progress of the processing and handle any kind of failure by restarting and/or reprocessing. Checkpointing is the mechanism of saving the state of a streaming query to fault-tolerant storage (such as HDFS) so that it can be recovered after a failure. Write-ahead logs are files that record the offset range of the data being processed in each trigger and are written to the checkpoint location before the processing starts. These logs are used to recover the query state and resume processing from the last processed offset range in case of a failure.

Reference: Structured Streaming Programming Guide, Fault Tolerance Semantics

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments