All articles
Voice Matching

How ForthWrite Learns Your Email Voice: RAG, Edit-Distance Scoring, and a Phrasing Miner

A technical walkthrough of the feedback loop ForthWrite uses to learn your email voice: pgvector RAG with MMR re-ranking, edit-distance scoring, and a deterministic phrasing miner.

5 min read·

I have been chasing a specific problem for about a year: email assistants write a professional email, not your email. The tool produces something correct, polished, and obviously not from you. You edit it. You send it. You edit the next one. Nothing improves.

The gap is not the underlying model. It is the feedback loop. Most tools do not have one.

Here is how ForthWrite closes it.

The core problem with "learns your voice"

Every AI email tool claims to learn your voice. The claim describes very different things depending on the tool.

Prompt engineering means you write a system prompt describing your style. "Direct tone. No em-dashes. Under 100 words." The model follows instructions but does not update. You hit the same ceiling on every draft.

Fine-tuning actually updates model weights on your data. It could work, but it costs significant compute per user and does not update dynamically as your writing evolves. Not viable at consumer scale.

RAG (retrieval-augmented generation) is what ForthWrite uses: pull real examples of your writing into every prompt at generation time, so the model is imitating your actual sent email rather than a description of it.

RAG alone is not enough either. The retrieval has to be good, and "good" retrieval changes as your habits change. That requires a feedback mechanism.

How ForthWrite builds your corpus

On first connect, we batch-import your Gmail Sent folder. Each message is normalized (HTML stripped, forwarded sections and signatures removed), embedded with text-embedding-3-small, and stored in Postgres with pgvector.

This gives the model hundreds of real examples of your voice before you send a single AI-assisted reply. Most tools ask you to paste a few writing samples manually. We do it automatically, and we do it at scale.

RAG retrieval on every draft

When you hit Draft, ForthWrite:

  1. Embeds the incoming email thread
  2. Queries pgvector for a candidate pool of cosine-similar matches from your sent history
  3. Re-ranks using MMR (Maximal Marginal Relevance) to avoid returning near-duplicate examples that would add redundant signal

The final output is a small block of few-shot examples injected into the system prompt. We enforce a minimum confidence threshold before injecting anything. If no match clears the bar, we inject nothing rather than risk the model imitating an irrelevant old reply.

The MMR step is important. Cosine similarity alone tends to cluster: the top five results are often slight variations of the same email. MMR trades off similarity against diversity, so the few-shot block covers more of your stylistic range rather than over-representing one thread type.

Edit-distance scoring: the feedback loop

Every time you edit an AI draft and hit Send, ForthWrite captures both versions.

We compute a normalized Levenshtein edit-distance score between the generated draft and what you actually sent:

  • Score near 1.0: you sent it nearly verbatim. The draft was right.
  • Score near 0: you rewrote most of it. The draft missed.

These (generated, sent) pairs accumulate in a table. Low-scoring pairs (heavy edits) feed an async optimizer that analyzes the gap and suggests system-prompt updates. High-scoring pairs (near-verbatim sends) go into the retrieval corpus as positive few-shot signals, so future RAG pulls lean toward what already worked.

The phrasing miner

RAG handles situational context well. But there is a different signal hiding in near-miss drafts, the ones where you kept 95% of the text and only swapped two words.

A cosine diff on two emails that differ by "no worries at all" versus "no worries" looks identical. Those pairs never surface in a "find bad drafts" pass. Standard edit-distance bucketing misses them too, because the overall score is high (most of the draft was fine).

So we run a separate miner: a word-level LCS (longest common subsequence) alignment between the AI draft and your sent version. We collect all deletions and substitutions, count which ones recur across your corpus, and measure consistency.

Of the 12 times the AI wrote "circle back," how many did you remove? Of the 8 times it wrote "I hope this finds you well," how many did you delete? Anything above a frequency and consistency threshold gets written to your system prompt as an explicit avoid or prefer rule.

No LLM involved in this step. Pure deterministic diff. It catches the small, recurring patterns that you would never think to describe but always clean up by hand.

The convergence problem

This creates a non-obvious dynamic: as the model gets better, you make fewer edits, which produces fewer correction signals. The optimizer starves itself on success.

We handle it by running phrasing mining on a time-based cadence independent of edit-distance score. Users in a steady state (high scores, few edits) still get periodic refinement passes over recent sent email. The feedback loop does not collapse just because it is working.

It is an open problem in a broader sense. If you are doing something similar with RLHF or preference modeling on sparse positive feedback, I would be curious how you have approached it.

Stack

  • Chrome extension (esbuild)
  • Next.js 16 API backend
  • Supabase (Postgres + pgvector)
  • Anthropic Claude as primary model with prompt caching on the system block for cost efficiency
  • OpenAI text-embedding-3-small for retrieval

The prompt caching piece matters more than it sounds. The system block (your persona, phrasing rules, few-shot examples) is the same across multiple drafts in a session. Anthropic's cache TTL means we pay for that block once per hour rather than on every call. At scale, this is the difference between an affordable product and one that costs more to run than it earns.

Try it

ForthWrite is a Chrome extension for Gmail. 14-day free trial, no card required to start.

forthwrite.ai

Happy to answer questions about the retrieval architecture, the phrasing miner, or the caching design.

More in Voice Matching

Free tool

Ready to stop sounding like everyone else?

Build a first-person persona prompt that captures your voice in under 5 minutes. No account required.

Generate my prompt