A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?
A . import pyspark.pandas as ps
df = ps.DataFrame(spark_df)
B . import pyspark.pandas as ps
df = ps.to_pandas(spark_df)
C . spark_df.to_sql()
D . import pandas as pd
df = pd.DataFrame(spark_df)
E . spark_df.to_pandas()
Answer: A
Explanation:
To use the pandas API on Spark, which is designed to bridge the gap between the simplicity of pandas and the scalability of Spark, the correct approach involves importing the pyspark.pandas (recently renamed to pandas_api_on_spark) module and converting a Spark DataFrame to a pandas-on-Spark DataFrame using this API. The provided syntax correctly initializes a pandas-on-Spark DataFrame, allowing the data scientist to work with the familiar pandas-like API on large datasets managed by Spark.
Reference
Pandas API on Spark Documentation:
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
Latest Databricks Machine Learning Associate Dumps Valid Version with 74 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund