Skip to content

Dimension errors when using sklearn OneHotEncoder with min_frequency parameter #545

@dclaz

Description

@dclaz

The documentation suggests that the sklearn OneHotEncoder should be a viable transformation when using the MimicExplainer, but I'm getting errors if I use it and set the min_frequency parameter to remove category levels with low counts.

If I set up my data preprocessor like this

image

(where I have ~7 categorical features, each with many levels)

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)       
    ],
    remainder="drop",
)

I get the following error
image

However, if I set a different transformer for each categorical feature, the Explainer works, albeit with a Many to one/many maps found in input warning and produces outputs that don't really make sense (Half the features end up having very, very similar SHAP values).

image

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

# Construct list of categorical transformers 
categorical_treatments_list = [(feature, categorical_transformer, [feature]) for feature in categorical_features]

# Construct the data preprocessor
data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        *categorical_treatments_list
    ],
    remainder="drop",
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions