tech journal

AI contract playbooks 101

I got Claude (with my input based on my own experiences) to help me write an explainer on how to design AI-assisted contract playbooks work. Here is the post, below.

How AI contract playbooks actually work

You've probably seen the demos. Upload a contract, point the AI at your playbook, and it tells you what's wrong. Maybe it even redlines the document for you. Clean, fast, impressive.

What you probably haven't seen is an honest explanation of what's happening underneath. Most of the material out there is either vendor marketing or academic papers. This post is neither. It's a plain-language walkthrough of the engineering problem that every AI contract review tool has to solve, the constraints it works under, and why some approaches produce better results than others.

If you're evaluating these tools, building a business case for one, or just tired of nodding along in demos without understanding what you're buying, this should help.

The constraint that shapes everything

Every AI contract review tool is built around one limitation: the context window.

A context window is the amount of text an AI model can "see" at one time. Think of it as a desk. Whatever's on the desk, the AI can read and reason about. Whatever's not on the desk doesn't exist.

Modern models advertise context windows of 128,000 or 200,000 tokens. A token is roughly three-quarters of a word, so 200,000 tokens works out to about 150,000 words, or 300-400 pages. Some models now claim context windows of 1 million tokens or more. That's north of 700,000 words. You could fit several novels in there. More than enough for any contract and playbook, surely?

No.

200,000 tokens of attention is not 200,000 tokens of comprehension

Here's the part the marketing leaves out. A model can technically accept 200,000 tokens of input. Or a million. But its ability to pay equal attention to all of that text degrades as you stuff more in. The term for this is "attention dilution."

The analogy is reading. Give someone a 10-page memo and quiz them on it. They'll do well. Give them a 350-page document and quiz them the same way. They'll remember the beginning. They'll remember the end, because it was recent. The middle? Hazy at best. Now imagine handing them three novels and asking about a specific paragraph in chapter 14 of the second one. Good luck.

AI models behave the same way. Research consistently shows that models perform best on material near the start and end of their context window, with a degraded zone in the middle (sometimes called the "lost in the middle" problem). A 1-million-token context window doesn't fix this. It just makes the middle bigger. You can fit more text in, but the model's ability to attend to any specific part of that text doesn't scale at the same rate. It's a bigger desk, but the person sitting at it still has the same pair of eyes.

So the effective window of strong recall is much smaller than what's advertised. Exactly how much smaller depends on the model and the task, and newer models are getting better at this. But the principle holds: the more you stuff in, the less reliably the model attends to any given part of it. Even with the best current models, you wouldn't want to bet on perfect recall across 300 pages of dense legal text. Let alone 1,500.

A serious commercial contract is 60-100 pages. A thorough playbook might be another 30-50. Together, they push well past the point where you can trust the model to catch everything, even if the text technically fits within the context window.

The other constraint: LLMs have no memory

There's a second limitation that's easy to miss if you've only used ChatGPT casually.

Large language models don't remember anything between calls. Each time you send the model a request, it starts completely fresh. It has no recollection of what you asked it five seconds ago unless you explicitly include that prior exchange in the new request.

If you use ChatGPT or Claude and it seems to "remember" your earlier messages, that's because the application is quietly re-sending your entire conversation history with each new message. The model itself has no persistent memory. It's reading the whole conversation from scratch every time.

For contract review, this creates a specific problem. If you ask the model to review clause 47 of a contract, it doesn't inherently "know" what was in clauses 1 through 46 unless you include them. And if your playbook says "the limitation of liability clause must be consistent with the indemnity clause," the model can't check that unless both clauses are in front of it at the same time.

Everything the model needs to know for a given task must be placed on the desk for that specific call. Nothing carries over.

So what's the actual problem?

You have a playbook (potentially long). You have a contract (potentially long). The AI can only look at a limited amount of text at once with full attention, and it forgets everything between calls.

Getting reliable, thorough review out of this setup is an engineering problem, not an intelligence problem. The AI is smart enough to compare a clause against a playbook instruction. The hard part is getting the right pieces of text in front of it at the right time, within the constraints of what it can reliably process.

Every AI contract review tool is solving some version of this problem. They differ in how.

Before any AI gets involved: parsing the document

There's a step that comes before any of the approaches below, and it's easy to overlook because it sounds mundane: the system has to break the contract into its component clauses.

This is harder than it sounds. Contracts don't follow a standard format. One agreement numbers its clauses 1, 2, 3. Another uses 1.1, 1.1.1, 1.1.2. A third buries operative provisions inside schedule appendices. Some clauses are a single sentence. Others run for two pages with nested sub-clauses, provisos, and carve-outs.

The system needs to decide: what counts as a "unit" to send to the AI? Too granular (individual sentences) and the AI loses the thread of what a clause is actually saying. Too coarse (whole sections) and you're back to stuffing too much into the context window.

Getting this wrong has downstream consequences. If the parser splits a clause in the wrong place, the AI reviews an incomplete thought. If it merges two separate clauses into one chunk, the AI might miss that one of them has a playbook issue.

Enterprise tools spend serious engineering effort on this parsing step. Some use rule-based approaches (looking for numbering patterns, heading styles, indentation). Some use separate AI models trained specifically on document structure. Many use a combination.

The docx problem

But there's a deeper issue that most people don't think about at all: the file format itself.

Contracts almost always arrive as .docx files. A .docx file looks like a neatly formatted Word document to a human. Underneath, it's a zip archive containing XML files. The structure of that XML often bears little resemblance to what you see on screen. A paragraph that looks like a single clause to your eyes might be split across multiple XML nodes for reasons that have nothing to do with legal structure and everything to do with formatting history, tracked changes, or how Word chose to serialise the content.

The specification that governs .docx files is called OOXML (Office Open XML), maintained as ECMA-376. It runs to about 6,000 pages. That's not a typo. Six thousand pages of specification for a document format. And Microsoft's own implementation doesn't fully conform to the standardised version, which means that the actual behaviour of a Word document in practice can differ from what the spec says it should do.

Why does this matter for AI contract review? Because the AI never sees the .docx file directly. It works with text. So someone (or something) has to extract the text from the XML, figure out what the structure is, and present it to the AI in a way that preserves the logical structure of the contract. That extraction step is where things go wrong more often than people realise.

The second problem: writing changes back

Parsing the document is only half the challenge. Once the AI has identified issues and suggested amendments, those changes need to be written back into the .docx file. And this is where things get really difficult.

The AI thinks in text. It can tell you "this indemnity clause should say X instead of Y." But to turn that into a redlined Word document with tracked changes, the system needs to find the exact location in the underlying XML, modify the right nodes, insert the tracked-change markup in the correct format, and do all of this without breaking the rest of the document's formatting, styles, numbering, or cross-references.

This requires an intermediate representation, or IR. The system needs to build an internal model of the document that it can both read from (for the AI to analyse) and write to (to apply the AI's suggestions). Building a good IR for .docx files is genuinely hard engineering. Get it wrong and you end up with documents that look corrupted when opened in Word, or tracked changes that don't display properly, or formatting that subtly shifts in ways that make lawyers nervous.

This is a problem that tools like SuperDoc (an open-source document editing library) are specifically designed to solve. SuperDoc works with the .docx format natively rather than converting to HTML and back, which means it can open a contract, let an AI make changes programmatically, and export a redlined .docx with proper tracked changes, without mangling the underlying document structure. It runs headless (no user interface required), so an AI agent can use it as part of an automated pipeline: open contract, apply AI suggestions as tracked changes, export the result for a lawyer to review in Word.

The point isn't to promote any particular tool. It's to make visible a layer of complexity that most discussions about AI contract review skip entirely. The gap between "the AI identified an issue" and "here's a properly redlined Word document you can send to the counterparty" is wider than it appears, and the quality of the engineering at this layer directly affects whether the output is something a lawyer can actually use.

Approach 1: put everything in at once

The simplest approach is brute force. Take the entire playbook and the entire contract, put them both in the context window, and ask the model to review the contract against the playbook.

If your playbook is 5 pages and your contract is 15 pages, this can work. The combined text fits comfortably within the reliable attention zone. One call, one pass, done.

In practice, this almost never works for real commercial contracts with real playbooks. The documents are too long. You end up in the "lost in the middle" zone, and the model starts missing things. It might catch the issues in the first few clauses and the last few clauses, but quietly skip problems in the middle of the document.

It's worth knowing this approach exists because it's the baseline. If someone tells you their tool "just sends the whole document to the AI," now you know why that's a problem for longer documents.

Approach 2: clause by clause, with the full playbook

If the playbook alone fits within the reliable attention zone (say, under 30 pages), you can take a different approach. Instead of sending everything at once, you walk through the contract one clause at a time.

Each call looks something like this: here is the full playbook, and here is clause 7 of the contract. Does this clause comply with the playbook? If yes, note that and move on. If no, flag the issue and suggest amendments. If the clause isn't relevant to any playbook item, skip it.

You do this for every clause (or batch of related clauses) in the contract. At the end, you check: are there playbook items that nothing matched? Those are your missing clauses. The playbook says the contract should have a force majeure clause, but nothing in the contract triggered that playbook item. That's a gap.

The results from each call get written to a running record (engineers call this a "state file"). When all clauses have been processed, you have a complete picture.

This approach is thorough. The model sees the full playbook every time, so it has good context for what to look for. And because it only sees one clause (or a small batch) at a time, attention dilution is minimal.

The downside is cost and speed. A 60-page contract with 80 clauses means 80 separate AI calls, each one carrying the full playbook. That's a lot of processing time and a lot of API spend. For a single review it's manageable. For a team running hundreds of contracts through the system, it adds up.

Approach 3: smarter matching

When the brute-force clause-by-clause approach is too slow or expensive, or when the playbook itself is too long to fit in the reliable window, you need the system to be more selective about what it compares.

The goal is to avoid checking every clause against every playbook item. Instead, you pre-match: figure out which playbook items are probably relevant to which clauses before you involve the main AI model. Then you only send the AI the pairings that matter.

Several techniques exist for this, and a well-built system typically uses more than one:

Taxonomy-based matching. You categorise both your playbook items and contract clauses by type: indemnity, limitation of liability, termination, confidentiality, and so on. Then you only compare items in the same category. An indemnity playbook rule only gets checked against clauses that look like indemnity clauses. This is crude but effective for well-structured documents.

Semantic similarity. This uses a separate, smaller AI model (called an "embedder") to measure how close two pieces of text are in meaning. The embedder converts text into a numerical representation, and you can then calculate how "similar" any two texts are, the way you might plot two points on a map and measure the distance between them. If a playbook item about "limitation of liability" scores highly against clause 12 but poorly against clause 30, you know to pair the playbook item with clause 12 and skip clause 30. This is more flexible than taxonomy matching because it doesn't depend on rigid categories. It can catch cases where a limitation of liability concept shows up inside a clause that's technically labelled as something else.

Term overlap. The simplest version: count how many important words two texts share. If your playbook item mentions "indemnify," "hold harmless," and "third party claims," and a contract clause contains all three phrases, there's probably a match worth investigating. Engineers sometimes use a measure called "Jaccard similarity" for this, which is just a ratio of shared terms to total terms. It's fast, doesn't require any AI, and works as a good first-pass filter.

A well-built system often stacks these techniques. A fast term-overlap check eliminates obvious non-matches. Semantic similarity ranks the remaining candidates. The main AI model does the actual substantive review on the top matches only. Each layer filters more noise, so the expensive AI calls happen only where they're genuinely needed.

Approach 4: the orchestrator pattern

For the most complex scenarios (very long playbooks, very long contracts, lots of cross-references and defined terms), some systems use a hierarchical approach.

Instead of a single AI doing everything, you have a structure like this: one "orchestrator" model that understands the full document at a high level, and multiple smaller "sub-agent" models that handle specific tasks.

The orchestrator's job is to maintain the big picture. It knows the contract's overall structure, where the key definitions are, how clauses cross-reference each other, and what the playbook requires at a high level. It breaks the review into tasks and assigns them to sub-agents.

Each sub-agent gets a narrow assignment: "Review clauses 12-15 against playbook items 3 and 7. Note: 'Supplier' is defined in clause 1 as [definition]. The indemnity cap is addressed separately in clause 22." The sub-agent does its work within a small, focused context window where attention is high and recall is reliable.

The orchestrator then collects the sub-agents' outputs, reconciles any conflicts, and handles document-level concerns. Did two sub-agents interpret the same defined term differently? Did one sub-agent flag an issue that's actually resolved by a clause that a different sub-agent reviewed? The orchestrator catches these.

This is the most complex approach to build. It's also the one that best handles the messiest, most cross-referenced documents. The analogy to a senior lawyer managing junior associates isn't accidental: the architecture mirrors how large-scale document review already works in practice, just faster.

The thing nobody talks about: your playbook matters more than the AI

All of the above assumes something that usually goes unsaid: the playbook is clear, consistent, and well-structured.

Most playbooks aren't.

Corporate playbooks tend to accumulate over years. Different lawyers add different items. Positions shift but old entries don't get removed. You end up with a playbook that contradicts itself in places. Maybe one section says the liability cap should be 100% of contract value while another section, added two years later for a different deal type, says 200%. Or a playbook instruction says "ensure the indemnity is reasonable" without defining what "reasonable" means in context.

An AI model won't flag these inconsistencies. It will just pick an interpretation, apply it, and move on. It doesn't know your playbook contradicts itself. It doesn't know that "reasonable" means something different to your M&A team than to your procurement team. It will confidently apply whatever reading seems most plausible given the text. You'll never know it made a choice unless you catch it in review.

None of the engineering approaches above fix this. You can have the most elegant orchestrator architecture in the world, and the results will still be unreliable if the playbook is a mess.

If you're serious about deploying AI contract review, the highest-value activity is one that has nothing to do with AI: cleaning up your playbook. Make instructions specific. Remove contradictions. Define your terms. Structure it so categories are clear. It isn't exciting work, but it determines whether the AI produces useful output or expensive noise.

What to take away from all this

You don't need to become an engineer to evaluate these tools. But understanding the constraints makes you a better buyer and a better user.

When a vendor tells you their tool "uses the latest AI model with a 200,000-token context window," you now know that context window size alone says very little about review quality. What matters is how the tool manages attention within that window.

When a demo processes a 5-page NDA flawlessly, you know that's the easy case. Ask what happens with a 90-page outsourcing agreement against a 40-page playbook. Ask how it handles documents that exceed the reliable attention zone. Ask whether it checks for missing clauses or only reviews what's present. Ask how it manages defined terms that appear in one clause but are relevant to a dozen others.

And take an honest look at your playbook. Most organisations that struggle with AI contract review assume the problem is the technology. Usually, it's the playbook.