SACCO Internal Auditor – AI Blueprint

A concrete end‑to‑end plan to design, build, and deploy an AI assistant for a SACCO’s Internal Audit function.
Version 1.0 (HTML)

1) Objectives & Target Outcomes

Primary goals

Measurable outcomes (first 90–120 days)

2) High‑Value Audit Use Cases

  1. Transactions & Payments
    • Duplicate/round‑sum payments, weekend/after‑hours postings, rapid reversals, manual overrides, off‑cycle journals.
    • Benford’s Law deviations across GL and vendor payments.
  2. Loans & Credit
    • Ghost members, related‑party loans, unusual restructuring, delinquency risk, staging anomalies (IFRS 9 context).
  3. Savings & Member Accounts
    • Structuring/smurfing patterns, linked accounts, abnormal cash‑in/cash‑out velocity.
  4. Vendors & Procurement
    • Shell vendors, split purchases below approval thresholds, outdated KYC/CRB, conflicts of interest.
  5. Channels & Ops
    • Teller overrides, mobile/app anomalies, device/location mismatch, insider risk signals.
  6. Policy Compliance (NLP)
    • Scan narratives/emails/tickets for policy breach indicators and missing documentation.

3) Data Sources & Minimum Viable Dataset (MVD)

Data window: 24–36 months history where possible; minimum 12 months for seasonality.

PII handling: Tokenize member IDs; keep reversible mapping in a secure vault for investigations.

4) Features & Red‑Flag Indicators (Examples)

Transaction‑level

Benford & Ratio tests

Behavioral & Network

Credit risk

NLP signals

5) Model Portfolio (Complementary, not either/or)

  1. Rules & Controls (Deterministic)
    • Encodes policy and regulatory thresholds; fast and explainable; seeds labels for ML.
  2. Unsupervised anomaly detection
    • Isolation Forest / One‑Class SVM for transaction outliers.
    • Time‑series: Seasonal‑Hybrid ESD, STL residual anomalies on volumes/amounts.
  3. Supervised risk scoring (where historical cases exist)
    • Gradient Boosted Trees (XGBoost/LightGBM) or Logistic Regression with monotonic constraints for stability.
  4. Graph analytics
    • Community detection, link prediction for collusion/related parties.
  5. NLP
    • Domain keyword patterns + lightweight transformer (e.g., MiniLM) for policy‑breach classification; zero‑shot for cold‑start.

Blended score: risk_score = w1*rules + w2*unsupervised + w3*supervised + w4*graph + w5*NLP with calibrated weights to meet precision/recall targets.

6) System Architecture (Target)

Ingestion/Storage

Processing & Feature Store

Modeling

Serving

Explainability & Evidence

Audit UI

Security

7) Evaluation & Monitoring

Offline

Online

Human‑in‑the‑loop

8) Governance & Compliance

9) Starter Implementation Plan (12 Weeks)

Weeks 1–2 – Discovery & Data

Weeks 3–4 – Feature Engineering & Rules

Weeks 5–6 – Unsupervised & Time‑Series

Weeks 7–8 – Supervised Prototype

Weeks 9–10 – UI & Workflow

Weeks 11–12 – Hardening

10) Sample Data Model (Warehouse Views)

-- Fact table: transactions
CREATE VIEW vw_transactions AS
SELECT
  t.txn_id,
  t.member_id,
  t.account_id,
  t.product_type,
  t.amount,
  t.currency,
  t.posted_at,
  t.channel,
  t.branch_id,
  t.user_id,
  t.is_reversal,
  v.invoice_no,
  v.vendor_id
FROM raw.transactions t
LEFT JOIN raw.vendor_invoices v USING (txn_id);

-- Dimensions
CREATE VIEW dim_member AS SELECT member_id, join_date, kyc_risk, branch_id FROM raw.members;
CREATE VIEW dim_staff  AS SELECT user_id, role, department, is_maker, is_checker FROM raw.staff;
CREATE VIEW dim_vendor AS SELECT vendor_id, tax_pin, created_at, is_related_party FROM raw.vendors;

-- Feature view example: rolling stats by member
CREATE VIEW feat_member_txn_30d AS
SELECT
  member_id,
  COUNT(*) AS txn_30d,
  SUM(amount) AS amt_30d,
  AVG(amount) AS avg_amt_30d,
  STDDEV(amount) AS sd_amt_30d,
  MAX(posted_at) AS last_txn_at
FROM raw.transactions
WHERE posted_at >= CURRENT_DATE - INTERVAL '30 day'
GROUP BY 1;

11) Rule Examples (Deterministic CCM)

-- R1: Split purchases just under approval threshold (per vendor per day)
SELECT vendor_id, DATE(posted_at) d, COUNT(*) n, SUM(amount) total
FROM raw.vendor_payments
WHERE amount BETWEEN 0.9*${approval_threshold} AND ${approval_threshold}
GROUP BY 1,2 HAVING n >= 3;

-- R2: Duplicate invoices (same vendor, invoice_no, amount)
SELECT vendor_id, invoice_no, amount, COUNT(*) n
FROM raw.vendor_invoices
GROUP BY 1,2,3 HAVING n > 1;

-- R3: After‑hours postings by staff outside allowed window
SELECT txn_id, user_id, posted_at
FROM raw.transactions t JOIN dim_staff s USING (user_id)
WHERE EXTRACT(HOUR FROM posted_at) NOT BETWEEN s.allowed_start AND s.allowed_end;

12) Python Prototype (Training & Scoring)

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import average_precision_score, precision_recall_curve

# Load your engineered features (joined views)
X = pd.read_parquet('features/txn_features.parquet')

# ---- Unsupervised: Isolation Forest ----
iso = IsolationForest(n_estimators=300, contamination=0.01, random_state=42)
iso.fit(X)
unsup_score = -iso.decision_function(X)  # higher => more anomalous

# ---- Supervised (if labels exist) ----
labels = pd.read_parquet('labels/txn_labels.parquet')  # columns: txn_id, y
XY = X.join(labels.set_index('txn_id'), on='txn_id').dropna(subset=['y'])
X_train, X_valid, y_train, y_valid = train_test_split(XY.drop(columns=['y']), XY['y'], stratify=XY['y'], test_size=0.2, random_state=42)

clf = LGBMClassifier(n_estimators=600, learning_rate=0.03, max_depth=-1, subsample=0.8, colsample_bytree=0.8)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_valid)[:,1]
pr_auc = average_precision_score(y_valid, proba)

# Blend example (weights to be tuned)
blend = 0.5*(unsup_score.loc[X_valid.index]) + 0.5*proba

# Threshold at top‑K
K = 1000
topk_idxs = blend.sort_values(ascending=False).head(K).index
alerts = X_valid.loc[topk_idxs]
alerts['risk_score'] = blend.loc[topk_idxs]
alerts.to_parquet('outputs/alerts_topk.parquet')

13) Explainability & Evidence Bundle

For each alert:

14) Dashboard & Workflow (MVP requirements)

15) Security & Access Controls

16) Risks & Mitigations

17) Resourcing & Tools (suggested)

18) Next Steps Checklist