Tasko.ai — A Production Agentic System

The Problem

A Human Dispatcher in Every Conversation

Hasherut ran a task-fulfillment service entirely over WhatsApp. Every client request still required a person to read it, classify it, price it, and assign it. The institutional knowledge lived in chat history.

No Structured Intake

Clients messaged the company on WhatsApp; taskers fulfilled. No ticketing system. No intake form. Every request passed through a dispatcher reading messages in real time.

Classification by Memory

Hundreds of active client threads spanning 15 service categories — bureaucratic errands, vehicle work, insurance filings, repairs, deliveries. Categorization, pricing, and routing all sat in one person’s head.

Knowledge Trapped in Text

Nine years of pricing logic, escalation rules, and edge-case handling existed only as informal Hebrew chat — no database, no knowledge base, no way to onboard a new dispatcher.

Engineering objective. Convert nine years of unstructured Hebrew/English WhatsApp chat into a structured corpus, fine-tune a language model on it, and put that model behind the same WhatsApp number — replacing the dispatcher loop entirely.

How It Was Built

Four Layers: Classify, Retrieve, Execute, Verify

The same architecture now called “agentic AI with tool use” — designed and shipped in 2020–2023, before the terminology existed.

LAYER 01

Understanding · Intent Classification

Now called Intent Classification + Slot Filling

21,102

Labeled Tasks

153

Unique Clients

1,561

Action-Verb Patterns

30

Metadata Fields / Task

From 21M+ WhatsApp messages, the agent learned to recognize 1,561 unique intents.

Deep dive

Every inbound WhatsApp message routes through a classifier that places it into one of 15 service categories, auto-completes any missing fields it can infer, and asks the user to confirm before proceeding. The structural backbone the model learned to recognize was extracted from the company’s nine years of chat history.

Bureaucratic4,018

Insurance3,761

Secretarial3,433

Errands1,602

Vehicle1,490

Computer / IT1,436

Secretarial A.M.941

Purchasing876

Repairs764

Delivery718

Technician672

Professionals671

Escort393

Transportation266

Misc61

How the training data was built

01 · EXTRACT

WhatsApp text exports

Programmatic export of client threads spanning 2011–2020.

02 · TAG ROLES

Client vs. tasker per message

Speaker classification across mixed Hebrew / English threads.

03 · TAXONOMY

15 categories, 1,561 verbs

Hebrew keyword sets aligned to each service category.

04 · Q&A PAIRS

SQuAD-format records

VBA + Excel macros generated context / question / answer triples.

The Hebrew NLP problem In 2021 there was no off-the-shelf open-source toolchain for production Hebrew — tokenizers, NER, and labeled corpora were all sparse. The pipeline above is what getting it to work in production looked like.

LAYER 02

Knowledge · Retrieval-Augmented Generation

Now called Retrieval-Augmented Generation (RAG)

Fine-tuned weights handled style.
The RAG corpus handled truth.

145

Procedures

67

Formal Protocols

97

Unique Workflows

21

Procedure Categories

Constrain the model to the company’s own rules. Phrase, but don’t invent.

Deep dive

Once an intent is classified, the agent retrieves a matching procedure from an authored corpus and injects it into the prompt at request time — constraining replies to the company’s own rules instead of letting the model invent policy.

Before “RAG” was a label Retrieve grounding text, inject it into the prompt at request time, let the model phrase but not invent — that pattern was the working solution for keeping a fine-tuned Hebrew model honest in 2021–2022, years before the field had a name for it.

LAYER 03

Execution · Tool Use & Multi-Step Workflows

Now called Tool Use / Function Calling

Purchasing protocol

Quote → approval → fulfillment → receipt.

Vehicle service protocol

Schedules an appointment with an authorized garage, threading vehicle & client metadata.

Insurance claim filing

Collects evidence and stages the government / insurer form packet.

Monthly billing

Closes the open task ledger, reconciles, generates the period invoice.

Dissatisfied client (5 steps)

Detect → triage → remediate → escalate → close-the-loop.

Follow-up tracking

Scheduled nudges to external parties holding up an open task.

The agent runs the workflow. It doesn’t just retrieve text.

Deep dive

Each procedure is a multi-step workflow the agent runs — not a paragraph it reads back. API calls, sequential coordination, asking the user for missing information. Every variant of what to ask and when was mapped and trained.

Model evolution Fine-tuning ran across two platforms over 12+ months: OpenAI (Babbage → Davinci → GPT-3.5 as each became available) and HF AutoTrain. Hundreds of cycles tightened Hebrew fidelity, PII redaction, conversational state, and procedure adherence — not parameter counts.

LAYER 04

Safety Gate · Human-in-the-Loop Override

Now called Human-in-the-Loop / Guardrails

Probabilistic output never escaped without a deterministic decision behind it.

Five minutes for a human to override. Otherwise, a deterministic fallback.

Deep dive

After the agent prepares a reply, a 5-minute human-override window opens. A human operator watches the session through a Debug Dashboard and can edit, reclassify, or approve. If no one intervenes, the system falls back to a deterministic message: “We’ll get back to you soon.”

REAL-TIME VIEW

Live session inspector

Inbound message, detected intent, retrieved procedure, and staged reply — all synchronized.

EDIT

Editable response field

Correct the auto-completion in place; the corrected text replaces what would have been sent.

RECLASSIFY

Intent override dropdown

If Layer 1 misjudged, pick the correct category and the workflow re-routes from there.

DEBUG

Code logs in-line

Per-session decision trace — every retrieval, every branch, every API call.

Deterministic-gate-first, in production In 2021, most production AI shipped raw model output straight to users. Putting every probabilistic decision behind a deterministic check — with a human-readable trace and an explicit fallback — was the design choice that later scaled into the multi-agent DAGs at the Pipeline Observatory.

System Architecture

The Production Flow

WhatsApp in, four layers in sequence, then either a task created in Monday.com or a deterministic fallback message back to the client — served by AWS Lambda over Twilio.

Serverless Compute

AWS Lambda (Node.js) handles every Twilio webhook — orchestrates retrieval, calls the model, posts the reply. Pay-per-invocation matched the conversational workload.

Inference Layer

Fine-tuned OpenAI model for in-domain replies, with retrieval against the 145-procedure corpus injected into prompt context at request time.

Operational Integration

Monday.com for task creation, Wix as the client-facing CRM surface, Connecteam for tasker management.

Stack

WhatsApp / Twiliochannel AWS Lambdacompute Node.js · Serverlessruntime OpenAI Fine-Tuningmodel HF AutoTrainalt training Monday.com APItasks / CRM Wixcustomer surface Connecteamoperations

Architecture note · why serverless

Lambda + Twilio kept the production surface a thin proxy — no servers to maintain, no idle cost between conversations, and a clean re-deploy path each time a new fine-tuned model checkpoint replaced its predecessor. The agent could be updated independently of the messaging surface, the retrieval layer, or the operational integrations.

Reflection

What Three Years Taught

Three architectures evaluated. One went to production.

Approach	Fine-tunable on Hebrew?	Encodes 145 procedures?	Training ↔ Inference unified?	Verdict
Dante AI no-code chatbot builder · native Twilio APPROACH 01	NoBlack-box platform.	NoNo surface to inject dispatcher logic.	N/A — no training surface.	Ruled out.Fastest pilot in week one; generalized at the cost of every specialization that mattered.
AutoTrain + ChatGPT HF AutoTrain → ChatGPT inference APPROACH 02	YesIterated dataset versions and snapshots.	YesVia prompt context at inference.	NoTwo platforms; manual loop end-to-end.	Ruled out.No automated quality loop. Cycle friction killed iteration speed.
AWS Lambda + OpenAI Lambda · Twilio · fine-tuned OpenAI · RAG APPROACH 03 · PRODUCTION	YesOpenAI fine-tuning API.	YesRAG corpus injected per request.	YesOne platform for both. Cycle friction eliminated.	Shipped.Production Oct 2020 – Dec 2023.

Ruled-out approaches were never wasted — the AutoTrain branch’s accumulated learnings fed directly into the production stack.

Three lessons that outlasted the chatbot

01

Data quality eats model quality. The bottleneck was never the model. It was the SQuAD pipeline producing clean training pairs from informal multilingual chat — and a Hebrew NLP toolchain that didn’t exist off-the-shelf in 2021.

02

Style and truth are different problems. Fine-tuning shapes how the agent talks; retrieval determines what it’s allowed to say. Conflate them and operational policy gets violated in ways the team only discovers after a client complaint.

03

The safety gate is the architecture. A 5-minute override isn’t a feature you bolt on — it’s the gate that lets a probabilistic system run in production without a brand-trust failure. The deterministic fallback was what made the model safe to ship, not the model itself.

The agent handled live client requests via WhatsApp through December 2023 — over three years in research and development, replacing the full-time human dispatcher loop.

This four-layer pattern — classify, retrieve, execute, verify — became the foundation for the deterministic-gate-first architecture later applied at scale in the Pipeline Observatory and the multi-agent DAGs of agent-dag-pipeline.

An AI Agent That Learned to Do a Human's Job.

A Human Dispatcher in Every Conversation

No Structured Intake

Classification by Memory

Knowledge Trapped in Text

Four Layers: Classify, Retrieve, Execute, Verify

Understanding · Intent Classification

Knowledge · Retrieval-Augmented Generation

Execution · Tool Use & Multi-Step Workflows

shopping_cartPurchasing protocol

build_circleVehicle service protocol

descriptionInsurance claim filing

request_quoteMonthly billing

support_agentDissatisfied client (5 steps)

schedule_sendFollow-up tracking

Safety Gate · Human-in-the-Loop Override

Live session inspector

Editable response field

Intent override dropdown

Code logs in-line

The Production Flow

Serverless Compute

Inference Layer

Operational Integration

What Three Years Taught

An AI Agent That Learned
to Do a Human's Job.

Purchasing protocol

Vehicle service protocol

Insurance claim filing

Monthly billing

Dissatisfied client (5 steps)

Follow-up tracking