Exam4Training

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
A . They can refactor their notebook to process the data in parallel.
B . They can refactor their notebook to use the PySpark DataFrame API.
C . They can refactor their notebook to use the Scala Dataset API.
D . They can refactor their notebook to use Spark SQL.
E . They can refactor their notebook to utilize the pandas API on Spark.

Answer: E

Explanation:

The data scientist can refactor their notebook to utilize the pandas API on Spark (now known as pandas on Spark, formerly Koalas). This allows for the least amount of changes to the existing pandas-based code while scaling to handle big data using Spark’s distributed computing capabilities.

pandas on Spark provides a similar API to pandas, making the transition smoother and faster compared to completely rewriting the code to use PySpark DataFrame API, Scala Dataset API, or Spark SQL.

Reference: Databricks documentation on pandas API on Spark (formerly Koalas).

Exit mobile version