Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.7k views
in Technique[技术] by (71.8m points)

python - LabelEncoder vs. onehot encoding in random forest regressor

I am trying to use RandomForestRegressor using python. I know that for numerical columns there are no need to scale since only one column that lead to most information gain is used to split data. However it seems like we still need to convert categorical values to numbers so that our machine can understand.

I want to compare between labelEncoder and onehot encoding and want to understand reason why one would be preferred.

I am using dataset from https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data and trying to predict PM2.5 value

my dataframe looks like this

        year  month day    hour   PM2.5    PM10  SO2 NO2      CO      O3    TEMP    PRES    DEWP    RAIN    wd  WSPM    station
0       2013    3   1         0     4.0     4.0 4.0  7.0    300.0   77.0    -0.7    1023.0  -18.8   0.0     NNW 4.4    Aotizhongxin
1       2013    3   1         1     8.0     8.0 4.0  7.0    300.0   77.0    -1.1    1023.2  -18.2   0.0     N   4.7    Aotizhongxin
2       2013    3   1         2     7.0     7.0 5.0  10.0   300.0   73.0    -1.1    1023.5  -18.2   0.0     NNW 5.6    Aotizhongxin
3       2013    3   1         3     6.0     6.0 11.0 11.0   300.0   72.0    -1.4    1024.5  -19.4   0.0     NW  3.1    Aotizhongxin
4       2013    3   1         4     3.0     3.0 12.0 12.0   300.0   72.0    -2.0    1025.2  -19.5   0.0     N   2.0     Aotizhongxin

First I use one-hot encoding

ohe_df = pd.get_dummies(data=df, columns=["wd", "station"])

y = ohe_df["PM2.5"].values
X = ohe_df.drop(columns=["PM2.5"]).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf_reg = RandomForestRegressor(n_estimators=100,
                               criterion="mse",
                               n_jobs=-1,
                               random_state=42)

rf_reg.fit(X_train, y_train)

train_pred_y = rf_reg.predict(X_train)
test_pred_y = rf_reg.predict(X_test)

print(f"train_MAE = {mean_absolute_error(y_train, train_pred_y)}")
print(f"test_MAE = {mean_absolute_error(y_test, test_pred_y)}")

>>>train_MAE = 3.7268903322031877
>>>test_MAE = 10.108332295400825

Then using same rf_reg I train and predict after using label encoder

le = LabelEncoder()

le_df = df.copy()
le_df["wd"] = le.fit_transform(df["wd"])
le_df["station"] = le.fit_transform(df["station"])

y = le_df["PM2.5"].values
X = le_df.drop(columns=["PM2.5"]).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf_reg.fit(X_train, y_train)
train_pred_y = rf_reg.predict(X_train)
test_pred_y = rf_reg.predict(X_test)
print(f"train_MAE = {mean_absolute_error(y_train, train_pred_y)}")
print(f"test_MAE = {mean_absolute_error(y_test, test_pred_y)}")

>>>train_MAE = 3.765413599883373
>>>test_MAE = 10.189870188659498

From this comparison one-hot encoding seems to perform better but my question is, is it right method to compare different encoding methods? And If yes, why does labelEncoding performing worse(even though by little bit) than one hot encoding?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The two functions, LabelEncoder and OneHotEncoder, have different targets and they are not interchangeable.

From the OneHotEncoder docs (emphasis mine):

Encode categorical features as a one-hot numeric array.

From the LabelEncoder docs (emphasis mine):

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

So, the correct approach here for encoding the features is to use OneHotEncoder. Using LabelEncoder will additionally impose an order among the categorical features, as 0 is less than 1 etc, which is not correct for categorical features (this is not an issue for the labels in a classification problem, which LabelEncoder is all about).

In case your features are ordinal (does not seem to be the case here with wd and station), you should consider OrdinalEncoder instead of one-hot


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
...