A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?
A . spark_df[spark_df["price"] > 0]
B . spark_df.filter(col("price") > 0)
C . SELECT * FROM spark_df WHERE price > 0
D . spark_df.loc[spark_df["price"] > 0,:]
E . spark_df.loc[:,spark_df["price"] > 0]
Answer: B
Explanation:
To filter rows in a Spark DataFrame based on a condition, you use the filter method along with a column condition. The correct syntax in PySpark to accomplish this task is spark_df.filter(col("price")
> 0), which filters the DataFrame to include only those rows where the value in the "price" column is greater than 0. The col function is used to specify column-based operations. The other options provided either do not use correct Spark DataFrame syntax or are intended for different types of data manipulation frameworks like pandas.
Reference: PySpark DataFrame API documentation (Filtering DataFrames).
Latest Databricks Machine Learning Associate Dumps Valid Version with 74 Q&As
Latest And Valid Q&A | Instant Download | Once Fail, Full Refund