diff --git a/examples/README.md b/examples/README.md index b66ecb7..645fb4c 100644 --- a/examples/README.md +++ b/examples/README.md @@ -190,7 +190,7 @@ weco run --source train.py \ Optimize a tabular fraud-detection pipeline on real Vesta payment data. Reproduces Weco's -[fraud-detection case study](https://weco.ai/blog/framing-the-problem) +[fraud-detection case study](https://weco.ai/blog/framing-the-puzzle-for-autoresearch) (baseline AUC 0.914 → pooled 6-seed mean 0.9305 ± 0.0035 with full instructions at 200 steps). diff --git a/examples/fraud-detection-loose/README.md b/examples/fraud-detection-loose/README.md index c2d7b44..8f77a0f 100644 --- a/examples/fraud-detection-loose/README.md +++ b/examples/fraud-detection-loose/README.md @@ -7,10 +7,10 @@ feature engineering and the LightGBM configuration — to maximize AUC-ROC on a held-out, time-based validation split. This example reproduces the setup from Weco's fraud-detection case study -([blog post](https://weco.ai/blog/framing-the-problem), +([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch), [code](https://github.com/WecoAI/fraud-detection-case-study)). The example's -baseline is **AUC ≈ 0.9102** (deterministic; verifiable via the SHA-256s -in `prepare_data.py`). The case study reported 0.914, which used a slightly +baseline is **AUC ≈ 0.9102** (deterministic; verify by running +`python evaluate.py`, which should print `auc_roc: 0.910171`). The case study reported 0.914, which used a slightly leaky `build_features` (concat-then-groupby on train+val); this example's `train.py` fits all encoders on `train_df` only — no time-leakage. With the bundled `instructions.md` and 200 steps of `gemini-3.1-pro-preview`, expect @@ -144,7 +144,7 @@ different count than non-fraud rows. The baseline `build_features` drops aggregations on a dataframe that still has the label. The case study walks through a real instance where this bug reported AUC 0.9591 that dropped to 0.9154 after a one-line fix — see -. +. **Time leakage** — validation-period statistics leak into train features. This is a time-based split; at serving time you don't have the val period. diff --git a/examples/fraud-detection-loose/prepare_data.py b/examples/fraud-detection-loose/prepare_data.py index d2edda5..8438f31 100644 --- a/examples/fraud-detection-loose/prepare_data.py +++ b/examples/fraud-detection-loose/prepare_data.py @@ -1,8 +1,11 @@ """Download IEEE-CIS data and build the fixed train/val parquets used by train.py. Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and -`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256 -to the parquets used in the published case study. +`data/base_val_small.parquet` (25K rows, time-later subsample). Logically +identical to the parquets used in the published case study; verify by running +`python evaluate.py`, which should reproduce the baseline AUC stated in this +example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions, +so file hashes are not comparable across environments.) Usage: # 1. Put your Kaggle API token at ~/.kaggle/kaggle.json @@ -13,7 +16,7 @@ Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files. -Pipeline (must stay byte-identical to the originals — see SHAs in the README): +Pipeline (must stay faithful to the original case-study recipe): 1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID. 2. Time-based 80/20 split on TransactionDT (last 20% by time = validation). 3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with @@ -184,9 +187,9 @@ def main() -> None: print(f"[write] {train_out}") print(f"[write] {val_out}") print() - print("Expected SHA-256 (matches the published case study parquets):") - print(" train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae") - print(" val: 8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753") + print("Sanity check: `python evaluate.py` should print the baseline AUC from this") + print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer") + print("versions, so don't compare file hashes across environments.)") if __name__ == "__main__": diff --git a/examples/fraud-detection/README.md b/examples/fraud-detection/README.md index a712260..0ff5a83 100644 --- a/examples/fraud-detection/README.md +++ b/examples/fraud-detection/README.md @@ -8,7 +8,7 @@ Kaggle dataset (real Vesta payment transactions). Weco rewrites two files — to maximize AUC-ROC on a held-out time-based validation split. This example reproduces Weco's fraud-detection case study -([blog post](https://weco.ai/blog/framing-the-problem), +([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch), [code](https://github.com/WecoAI/fraud-detection-case-study)) with an **API that makes train/val leakage impossible by construction** — see the "Why this design" section below. diff --git a/examples/fraud-detection/prepare_data.py b/examples/fraud-detection/prepare_data.py index d2edda5..8438f31 100644 --- a/examples/fraud-detection/prepare_data.py +++ b/examples/fraud-detection/prepare_data.py @@ -1,8 +1,11 @@ """Download IEEE-CIS data and build the fixed train/val parquets used by train.py. Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and -`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256 -to the parquets used in the published case study. +`data/base_val_small.parquet` (25K rows, time-later subsample). Logically +identical to the parquets used in the published case study; verify by running +`python evaluate.py`, which should reproduce the baseline AUC stated in this +example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions, +so file hashes are not comparable across environments.) Usage: # 1. Put your Kaggle API token at ~/.kaggle/kaggle.json @@ -13,7 +16,7 @@ Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files. -Pipeline (must stay byte-identical to the originals — see SHAs in the README): +Pipeline (must stay faithful to the original case-study recipe): 1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID. 2. Time-based 80/20 split on TransactionDT (last 20% by time = validation). 3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with @@ -184,9 +187,9 @@ def main() -> None: print(f"[write] {train_out}") print(f"[write] {val_out}") print() - print("Expected SHA-256 (matches the published case study parquets):") - print(" train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae") - print(" val: 8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753") + print("Sanity check: `python evaluate.py` should print the baseline AUC from this") + print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer") + print("versions, so don't compare file hashes across environments.)") if __name__ == "__main__":