Which of the following feature engineering tasks will be the least efficient to distribute?

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?
A . One-hot encoding categorical features
B . Target encoding categorical features
C . Imputing missing feature values with the mean
D . Imputing missing feature values with the true median
E . Creating binary indicator features for missing values

Answer: D

Explanation:

Among the options listed, calculating the true median for imputing missing feature values is the least efficient to distribute. This is because the true median requires knowledge of the entire data distribution, which can be computationally expensive in a distributed environment. Unlike mean or mode, finding the median requires sorting the data or maintaining a full distribution, which is more intensive and often requires shuffling the data across partitions.

Reference

Challenges in parallel processing and distributed computing for data aggregation like median calculation: https://www.apache.org