WecoAI · ZhengyaoJiang · Jun 12, 2026
diff --git a/examples/README.md b/examples/README.md
@@ -190,7 +190,7 @@ weco run --source train.py \
 
 Optimize a tabular fraud-detection pipeline on real Vesta payment data.
 Reproduces Weco's
-[fraud-detection case study](https://weco.ai/blog/framing-the-problem)
+[fraud-detection case study](https://weco.ai/blog/framing-the-puzzle-for-autoresearch)
 (baseline AUC 0.914 → pooled 6-seed mean 0.9305 ± 0.0035 with full
 instructions at 200 steps).
 

diff --git a/examples/fraud-detection-loose/README.md b/examples/fraud-detection-loose/README.md
@@ -7,10 +7,10 @@ feature engineering and the LightGBM configuration — to maximize AUC-ROC on a
 held-out, time-based validation split.
 
 This example reproduces the setup from Weco's fraud-detection case study
-([blog post](https://weco.ai/blog/framing-the-problem),
+([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch),
 [code](https://github.com/WecoAI/fraud-detection-case-study)). The example's
-baseline is **AUC ≈ 0.9102** (deterministic; verifiable via the SHA-256s
-in `prepare_data.py`). The case study reported 0.914, which used a slightly
+baseline is **AUC ≈ 0.9102** (deterministic; verify by running
+`python evaluate.py`, which should print `auc_roc: 0.910171`). The case study reported 0.914, which used a slightly
 leaky `build_features` (concat-then-groupby on train+val); this example's
 `train.py` fits all encoders on `train_df` only — no time-leakage. With the
 bundled `instructions.md` and 200 steps of `gemini-3.1-pro-preview`, expect
@@ -144,7 +144,7 @@ different count than non-fraud rows. The baseline `build_features` drops
 aggregations on a dataframe that still has the label. The case study walks
 through a real instance where this bug reported AUC 0.9591 that dropped to
 0.9154 after a one-line fix — see
-<https://weco.ai/blog/framing-the-problem>.
+<https://weco.ai/blog/framing-the-puzzle-for-autoresearch>.
 
 **Time leakage** — validation-period statistics leak into train features.
 This is a time-based split; at serving time you don't have the val period.

diff --git a/examples/fraud-detection-loose/prepare_data.py b/examples/fraud-detection-loose/prepare_data.py
@@ -1,8 +1,11 @@
 """Download IEEE-CIS data and build the fixed train/val parquets used by train.py.
 
 Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and
-`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256
-to the parquets used in the published case study.
+`data/base_val_small.parquet` (25K rows, time-later subsample). Logically
+identical to the parquets used in the published case study; verify by running
+`python evaluate.py`, which should reproduce the baseline AUC stated in this
+example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions,
+so file hashes are not comparable across environments.)
 
 Usage:
     # 1. Put your Kaggle API token at ~/.kaggle/kaggle.json
@@ -13,7 +16,7 @@
 
 Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files.
 
-Pipeline (must stay byte-identical to the originals — see SHAs in the README):
+Pipeline (must stay faithful to the original case-study recipe):
 1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID.
 2. Time-based 80/20 split on TransactionDT (last 20% by time = validation).
 3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with
@@ -184,9 +187,9 @@ def main() -> None:
     print(f"[write] {train_out}")
     print(f"[write] {val_out}")
     print()
-    print("Expected SHA-256 (matches the published case study parquets):")
-    print("  train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae")
-    print("  val:   8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753")
+    print("Sanity check: `python evaluate.py` should print the baseline AUC from this")
+    print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer")
+    print("versions, so don't compare file hashes across environments.)")
 
 
 if __name__ == "__main__":

diff --git a/examples/fraud-detection/README.md b/examples/fraud-detection/README.md
@@ -8,7 +8,7 @@ Kaggle dataset (real Vesta payment transactions). Weco rewrites two files —
 to maximize AUC-ROC on a held-out time-based validation split.
 
 This example reproduces Weco's fraud-detection case study
-([blog post](https://weco.ai/blog/framing-the-problem),
+([blog post](https://weco.ai/blog/framing-the-puzzle-for-autoresearch),
 [code](https://github.com/WecoAI/fraud-detection-case-study)) with an
 **API that makes train/val leakage impossible by construction** — see the
 "Why this design" section below.

diff --git a/examples/fraud-detection/prepare_data.py b/examples/fraud-detection/prepare_data.py
@@ -1,8 +1,11 @@
 """Download IEEE-CIS data and build the fixed train/val parquets used by train.py.
 
 Produces `data/base_train_small.parquet` (100K rows, stratified by fraud) and
-`data/base_val_small.parquet` (25K rows, time-later subsample). Identical SHA-256
-to the parquets used in the published case study.
+`data/base_val_small.parquet` (25K rows, time-later subsample). Logically
+identical to the parquets used in the published case study; verify by running
+`python evaluate.py`, which should reproduce the baseline AUC stated in this
+example's README. (Parquet byte hashes vary with pandas/pyarrow writer versions,
+so file hashes are not comparable across environments.)
 
 Usage:
     # 1. Put your Kaggle API token at ~/.kaggle/kaggle.json
@@ -13,7 +16,7 @@
 
 Runtime: ~2-3 minutes on a modern laptop. Produces ~150MB of parquet files.
 
-Pipeline (must stay byte-identical to the originals — see SHAs in the README):
+Pipeline (must stay faithful to the original case-study recipe):
 1. Merge `train_transaction.csv` + `train_identity.csv` on TransactionID.
 2. Time-based 80/20 split on TransactionDT (last 20% by time = validation).
 3. V-feature correlation pruning: sample 10_000 rows from the FULL merged df with
@@ -184,9 +187,9 @@ def main() -> None:
     print(f"[write] {train_out}")
     print(f"[write] {val_out}")
     print()
-    print("Expected SHA-256 (matches the published case study parquets):")
-    print("  train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae")
-    print("  val:   8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753")
+    print("Sanity check: `python evaluate.py` should print the baseline AUC from this")
+    print("example's README. (Parquet byte hashes vary with pandas/pyarrow writer")
+    print("versions, so don't compare file hashes across environments.)")
 
 
 if __name__ == "__main__":