Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ weco run --source train.py \

Optimize a tabular fraud-detection pipeline on real Vesta payment data.
Reproduces Weco's
[fraud-detection case study](https://weco.ai/blog/framing-the-problem)
[fraud-detection case study](https://weco.ai/blog/framing-the-puzzle-for-autoresearch)
(baseline AUC 0.914 → pooled 6-seed mean 0.9305 ± 0.0035 with full
instructions at 200 steps).

Expand Down
8 changes: 4 additions & 4 deletions examples/fraud-detection-loose/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ feature engineering and the LightGBM configuration — to maximize AUC-ROC on a
held-out, time-based validation split.

This example reproduces the setup from Weco's fraud-detection case study
([blog post](https://weco.ai/blog/framing-the-problem),
([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch),
[code](https://github.com/WecoAI/fraud-detection-case-study)). The example's
baseline is **AUC ≈ 0.9102** (deterministic; verifiable via the SHA-256s
in `prepare_data.py`). The case study reported 0.914, which used a slightly
baseline is **AUC ≈ 0.9102** (deterministic; verify by running
`python evaluate.py`, which should print `auc_roc: 0.910171`). The case study reported 0.914, which used a slightly
leaky `build_features` (concat-then-groupby on train+val); this example's
`train.py` fits all encoders on `train_df` only — no time-leakage. With the
bundled `instructions.md` and 200 steps of `gemini-3.1-pro-preview`, expect
Expand Down Expand Up @@ -144,7 +144,7 @@ different count than non-fraud rows. The baseline `build_features` drops
aggregations on a dataframe that still has the label. The case study walks
through a real instance where this bug reported AUC 0.9591 that dropped to
0.9154 after a one-line fix — see
<https://weco.ai/blog/framing-the-problem>.
<https://weco.ai/blog/framing-the-puzzle-for-autoresearch>.

**Time leakage** — validation-period statistics leak into train features.
This is a time-based split; at serving time you don't have the val period.
Expand Down
15 changes: 9 additions & 6 deletions examples/fraud-detection-loose/prepare_data.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Download IEEE-CIS data and build the fixed train/val parquets used by train.py.

Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and
`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256
to the parquets used in the published case study.
`data/base_val_small.parquet` (25K rows, time-later subsample). Logically
identical to the parquets used in the published case study; verify by running
`python evaluate.py`, which should reproduce the baseline AUC stated in this
example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions,
so file hashes are not comparable across environments.)

Usage:
# 1. Put your Kaggle API token at ~/.kaggle/kaggle.json
Expand All @@ -13,7 +16,7 @@

Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files.

Pipeline (must stay byte-identical to the originals — see SHAs in the README):
Pipeline (must stay faithful to the original case-study recipe):
1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID.
2. Time-based 80/20 split on TransactionDT (last 20% by time = validation).
3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with
Expand Down Expand Up @@ -184,9 +187,9 @@ def main() -> None:
print(f"[write] {train_out}")
print(f"[write] {val_out}")
print()
print("Expected SHA-256 (matches the published case study parquets):")
print(" train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae")
print(" val: 8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753")
print("Sanity check: `python evaluate.py` should print the baseline AUC from this")
print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer")
print("versions, so don't compare file hashes across environments.)")


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion examples/fraud-detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Kaggle dataset (real Vesta payment transactions). Weco rewrites two files —
to maximize AUC-ROC on a held-out time-based validation split.

This example reproduces Weco's fraud-detection case study
([blog post](https://weco.ai/blog/framing-the-problem),
([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch),
[code](https://github.com/WecoAI/fraud-detection-case-study)) with an
**API that makes train/val leakage impossible by construction** — see the
"Why this design" section below.
Expand Down
15 changes: 9 additions & 6 deletions examples/fraud-detection/prepare_data.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Download IEEE-CIS data and build the fixed train/val parquets used by train.py.

Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and
`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256
to the parquets used in the published case study.
`data/base_val_small.parquet` (25K rows, time-later subsample). Logically
identical to the parquets used in the published case study; verify by running
`python evaluate.py`, which should reproduce the baseline AUC stated in this
example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions,
so file hashes are not comparable across environments.)

Usage:
# 1. Put your Kaggle API token at ~/.kaggle/kaggle.json
Expand All @@ -13,7 +16,7 @@

Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files.

Pipeline (must stay byte-identical to the originals — see SHAs in the README):
Pipeline (must stay faithful to the original case-study recipe):
1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID.
2. Time-based 80/20 split on TransactionDT (last 20% by time = validation).
3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with
Expand Down Expand Up @@ -184,9 +187,9 @@ def main() -> None:
print(f"[write] {train_out}")
print(f"[write] {val_out}")
print()
print("Expected SHA-256 (matches the published case study parquets):")
print(" train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae")
print(" val: 8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753")
print("Sanity check: `python evaluate.py` should print the baseline AUC from this")
print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer")
print("versions, so don't compare file hashes across environments.)")


if __name__ == "__main__":
Expand Down
Loading