An AI Agent That Learned
to Do a Human's Job.

Built from 21,000 real task records and nine years of WhatsApp conversations — a production agentic system before the industry had a name for the architecture.

Daniel Manzela · Oct 2020 — Dec 2023
The Problem

A Human Dispatcher in Every Conversation

Hasherut ran a task-fulfillment service entirely over WhatsApp. Every client request still required a person to read it, classify it, price it, and assign it. The institutional knowledge lived in chat history.

No Structured Intake

Clients messaged the company on WhatsApp; taskers fulfilled. No ticketing system. No intake form. Every request passed through a dispatcher reading messages in real time.

Classification by Memory

Hundreds of active client threads spanning 15 service categories — bureaucratic errands, vehicle work, insurance filings, repairs, deliveries. Categorization, pricing, and routing all sat in one person’s head.

Knowledge Trapped in Text

Nine years of pricing logic, escalation rules, and edge-case handling existed only as informal Hebrew chat — no database, no knowledge base, no way to onboard a new dispatcher.

Engineering objective. Convert nine years of unstructured Hebrew/English WhatsApp chat into a structured corpus, fine-tune a language model on it, and put that model behind the same WhatsApp number — replacing the dispatcher loop entirely.

How It Was Built

Four Layers: Classify, Retrieve, Execute, Verify

The same architecture now called “agentic AI with tool use” — designed and shipped in 2020–2023, before the terminology existed.

LAYER 01

Understanding · Intent Classification

Now called Intent Classification + Slot Filling
21,102
Labeled Tasks
153
Unique Clients
1,561
Action-Verb Patterns
30
Metadata Fields / Task

From 21M+ WhatsApp messages, the agent learned to recognize 1,561 unique intents.

Deep dive

Every inbound WhatsApp message routes through a classifier that places it into one of 15 service categories, auto-completes any missing fields it can infer, and asks the user to confirm before proceeding. The structural backbone the model learned to recognize was extracted from the company’s nine years of chat history.

Bureaucratic4,018
Insurance3,761
Secretarial3,433
Errands1,602
Vehicle1,490
Computer / IT1,436
Secretarial A.M.941
Purchasing876
Repairs764
Delivery718
Technician672
Professionals671
Escort393
Transportation266
Misc61
How the training data was built
01 · EXTRACT
WhatsApp text exports
Programmatic export of client threads spanning 2011–2020.
02 · TAG ROLES
Client vs. tasker per message
Speaker classification across mixed Hebrew / English threads.
03 · TAXONOMY
15 categories, 1,561 verbs
Hebrew keyword sets aligned to each service category.
04 · Q&A PAIRS
SQuAD-format records
VBA + Excel macros generated context / question / answer triples.
The Hebrew NLP problem In 2021 there was no off-the-shelf open-source toolchain for production Hebrew — tokenizers, NER, and labeled corpora were all sparse. The pipeline above is what getting it to work in production looked like.
LAYER 02

Knowledge · Retrieval-Augmented Generation

Now called Retrieval-Augmented Generation (RAG)

Fine-tuned weights handled style.
The RAG corpus handled truth.

145
Procedures
67
Formal Protocols
97
Unique Workflows
21
Procedure Categories

Constrain the model to the company’s own rules. Phrase, but don’t invent.

Deep dive

Once an intent is classified, the agent retrieves a matching procedure from an authored corpus and injects it into the prompt at request time — constraining replies to the company’s own rules instead of letting the model invent policy.

Before “RAG” was a label Retrieve grounding text, inject it into the prompt at request time, let the model phrase but not invent — that pattern was the working solution for keeping a fine-tuned Hebrew model honest in 2021–2022, years before the field had a name for it.
LAYER 03

Execution · Tool Use & Multi-Step Workflows

Now called Tool Use / Function Calling

Purchasing protocol

Quote → approval → fulfillment → receipt.

Vehicle service protocol

Schedules an appointment with an authorized garage, threading vehicle & client metadata.

Insurance claim filing

Collects evidence and stages the government / insurer form packet.

Monthly billing

Closes the open task ledger, reconciles, generates the period invoice.

Dissatisfied client (5 steps)

Detect → triage → remediate → escalate → close-the-loop.

Follow-up tracking

Scheduled nudges to external parties holding up an open task.

The agent runs the workflow. It doesn’t just retrieve text.

Deep dive

Each procedure is a multi-step workflow the agent runs — not a paragraph it reads back. API calls, sequential coordination, asking the user for missing information. Every variant of what to ask and when was mapped and trained.

Model evolution Fine-tuning ran across two platforms over 12+ months: OpenAI (Babbage → Davinci → GPT-3.5 as each became available) and HF AutoTrain. Hundreds of cycles tightened Hebrew fidelity, PII redaction, conversational state, and procedure adherence — not parameter counts.
LAYER 04

Safety Gate · Human-in-the-Loop Override

Now called Human-in-the-Loop / Guardrails

Probabilistic output never escaped without a deterministic decision behind it.

Five minutes for a human to override. Otherwise, a deterministic fallback.

Deep dive

After the agent prepares a reply, a 5-minute human-override window opens. A human operator watches the session through a Debug Dashboard and can edit, reclassify, or approve. If no one intervenes, the system falls back to a deterministic message: “We’ll get back to you soon.”

REAL-TIME VIEW

Live session inspector

Inbound message, detected intent, retrieved procedure, and staged reply — all synchronized.

EDIT

Editable response field

Correct the auto-completion in place; the corrected text replaces what would have been sent.

RECLASSIFY

Intent override dropdown

If Layer 1 misjudged, pick the correct category and the workflow re-routes from there.

DEBUG

Code logs in-line

Per-session decision trace — every retrieval, every branch, every API call.

Deterministic-gate-first, in production In 2021, most production AI shipped raw model output straight to users. Putting every probabilistic decision behind a deterministic check — with a human-readable trace and an explicit fallback — was the design choice that later scaled into the multi-agent DAGs at the Pipeline Observatory.
System Architecture

The Production Flow

WhatsApp in, four layers in sequence, then either a task created in Monday.com or a deterministic fallback message back to the client — served by AWS Lambda over Twilio.

PRODUCTION FLOW WhatsApp message in Hebrew · informal · multimodal references LAYER 01 Understanding Classify into 1 of 15 categories · auto-complete · ask user to confirm 21,102 examples Intent Detection Auto-Complete Confirm w/ user LAYER 02 Knowledge Retrieve the matching procedure · inject into prompt context 145 procedures RAG Retrieval 67 protocols · 97 workflows style ≠ truth LAYER 03 Execution Run the multi-step workflow · call APIs · gather missing info from the user tool use Plan steps API calls Sequential / async tasks LAYER 04 Safety Gate 5-minute human override window · default fallback if no intervention 5-min window Debug Dashboard live session view Edit / Reclassify human override Send or fallback deterministic gate Task created in Monday.com type · urgency · client · attachments · via API WhatsApp completion message sent through the same client number PRODUCTION ACTIVE · OCT 2020 — DEC 2023

Serverless Compute

AWS Lambda (Node.js) handles every Twilio webhook — orchestrates retrieval, calls the model, posts the reply. Pay-per-invocation matched the conversational workload.

Inference Layer

Fine-tuned OpenAI model for in-domain replies, with retrieval against the 145-procedure corpus injected into prompt context at request time.

Operational Integration

Monday.com for task creation, Wix as the client-facing CRM surface, Connecteam for tasker management.

Stack
WhatsApp / Twiliochannel AWS Lambdacompute Node.js · Serverlessruntime OpenAI Fine-Tuningmodel HF AutoTrainalt training Monday.com APItasks / CRM Wixcustomer surface Connecteamoperations
Architecture note · why serverless

Lambda + Twilio kept the production surface a thin proxy — no servers to maintain, no idle cost between conversations, and a clean re-deploy path each time a new fine-tuned model checkpoint replaced its predecessor. The agent could be updated independently of the messaging surface, the retrieval layer, or the operational integrations.

Reflection

What Three Years Taught

Three architectures evaluated. One went to production.

Approach Fine-tunable on Hebrew? Encodes 145 procedures? Training ↔ Inference unified? Verdict
Dante AI no-code chatbot builder · native Twilio APPROACH 01 NoBlack-box platform. NoNo surface to inject dispatcher logic. N/A — no training surface. Ruled out.Fastest pilot in week one; generalized at the cost of every specialization that mattered.
AutoTrain + ChatGPT HF AutoTrain → ChatGPT inference APPROACH 02 YesIterated dataset versions and snapshots. YesVia prompt context at inference. NoTwo platforms; manual loop end-to-end. Ruled out.No automated quality loop. Cycle friction killed iteration speed.
AWS Lambda + OpenAI Lambda · Twilio · fine-tuned OpenAI · RAG APPROACH 03 · PRODUCTION YesOpenAI fine-tuning API. YesRAG corpus injected per request. YesOne platform for both. Cycle friction eliminated. Shipped.Production Oct 2020 – Dec 2023.

Ruled-out approaches were never wasted — the AutoTrain branch’s accumulated learnings fed directly into the production stack.

Three lessons that outlasted the chatbot
01
Data quality eats model quality. The bottleneck was never the model. It was the SQuAD pipeline producing clean training pairs from informal multilingual chat — and a Hebrew NLP toolchain that didn’t exist off-the-shelf in 2021.
02
Style and truth are different problems. Fine-tuning shapes how the agent talks; retrieval determines what it’s allowed to say. Conflate them and operational policy gets violated in ways the team only discovers after a client complaint.
03
The safety gate is the architecture. A 5-minute override isn’t a feature you bolt on — it’s the gate that lets a probabilistic system run in production without a brand-trust failure. The deterministic fallback was what made the model safe to ship, not the model itself.

The agent handled live client requests via WhatsApp through December 2023 — over three years in research and development, replacing the full-time human dispatcher loop.

This four-layer pattern — classify, retrieve, execute, verify — became the foundation for the deterministic-gate-first architecture later applied at scale in the Pipeline Observatory and the multi-agent DAGs of agent-dag-pipeline.