Isolation Forest on transactional features. STL+ESD anomalies on daily amounts/volumes by branch/product.
Weeks 7–8 – Supervised Prototype
If labels exist (historic exceptions/frauds), train LightGBM. Calibrate with isotonic regression.
Weeks 9–10 – UI & Workflow
Deliver dashboard with alert queue, drill‑downs, evidence bundle export (PDF/CSV), assignments and SLA timers.
Weeks 11–12 – Hardening
Backtesting, threshold tuning, monitoring dashboards, playbooks for incidents, and go‑live checklist.
10) Sample Data Model (Warehouse Views)
-- Fact table: transactions
CREATE VIEW vw_transactions AS
SELECT
t.txn_id,
t.member_id,
t.account_id,
t.product_type,
t.amount,
t.currency,
t.posted_at,
t.channel,
t.branch_id,
t.user_id,
t.is_reversal,
v.invoice_no,
v.vendor_id
FROM raw.transactions t
LEFT JOIN raw.vendor_invoices v USING (txn_id);
-- Dimensions
CREATE VIEW dim_member AS SELECT member_id, join_date, kyc_risk, branch_id FROM raw.members;
CREATE VIEW dim_staff AS SELECT user_id, role, department, is_maker, is_checker FROM raw.staff;
CREATE VIEW dim_vendor AS SELECT vendor_id, tax_pin, created_at, is_related_party FROM raw.vendors;
-- Feature view example: rolling stats by member
CREATE VIEW feat_member_txn_30d AS
SELECT
member_id,
COUNT(*) AS txn_30d,
SUM(amount) AS amt_30d,
AVG(amount) AS avg_amt_30d,
STDDEV(amount) AS sd_amt_30d,
MAX(posted_at) AS last_txn_at
FROM raw.transactions
WHERE posted_at >= CURRENT_DATE - INTERVAL '30 day'
GROUP BY 1;
11) Rule Examples (Deterministic CCM)
-- R1: Split purchases just under approval threshold (per vendor per day)
SELECT vendor_id, DATE(posted_at) d, COUNT(*) n, SUM(amount) total
FROM raw.vendor_payments
WHERE amount BETWEEN 0.9*${approval_threshold} AND ${approval_threshold}
GROUP BY 1,2 HAVING n >= 3;
-- R2: Duplicate invoices (same vendor, invoice_no, amount)
SELECT vendor_id, invoice_no, amount, COUNT(*) n
FROM raw.vendor_invoices
GROUP BY 1,2,3 HAVING n > 1;
-- R3: After‑hours postings by staff outside allowed window
SELECT txn_id, user_id, posted_at
FROM raw.transactions t JOIN dim_staff s USING (user_id)
WHERE EXTRACT(HOUR FROM posted_at) NOT BETWEEN s.allowed_start AND s.allowed_end;
12) Python Prototype (Training & Scoring)
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import average_precision_score, precision_recall_curve
# Load your engineered features (joined views)
X = pd.read_parquet('features/txn_features.parquet')
# ---- Unsupervised: Isolation Forest ----
iso = IsolationForest(n_estimators=300, contamination=0.01, random_state=42)
iso.fit(X)
unsup_score = -iso.decision_function(X) # higher => more anomalous
# ---- Supervised (if labels exist) ----
labels = pd.read_parquet('labels/txn_labels.parquet') # columns: txn_id, y
XY = X.join(labels.set_index('txn_id'), on='txn_id').dropna(subset=['y'])
X_train, X_valid, y_train, y_valid = train_test_split(XY.drop(columns=['y']), XY['y'], stratify=XY['y'], test_size=0.2, random_state=42)
clf = LGBMClassifier(n_estimators=600, learning_rate=0.03, max_depth=-1, subsample=0.8, colsample_bytree=0.8)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_valid)[:,1]
pr_auc = average_precision_score(y_valid, proba)
# Blend example (weights to be tuned)
blend = 0.5*(unsup_score.loc[X_valid.index]) + 0.5*proba
# Threshold at top‑K
K = 1000
topk_idxs = blend.sort_values(ascending=False).head(K).index
alerts = X_valid.loc[topk_idxs]
alerts['risk_score'] = blend.loc[topk_idxs]
alerts.to_parquet('outputs/alerts_topk.parquet')
13) Explainability & Evidence Bundle
For each alert:
Rule hits (IDs and descriptions) + parameters used.
Feature snapshot (values and global percentiles).
Model version, training data timestamp, SHAP top contributors (for supervised models).
Raw data links (transaction, member, loan, vendor records) with immutable IDs.
Decision log (who viewed, actions taken, comments) for full traceability.