Test Driven Writing (or Test Driven Documentation)

November 26, 2024 8 minutes read | 1542 words by Ruben Berenguel

During the writing (and rewriting) of my last two posts I had this idea: what if we could leverage LLMs for a closed-loop rewriting cycle?

Or more precisely: What if we could use LLMs as the evaluator of a test-driven cycle, for writing?

Test-driven development (TDD) is a development technique where you first write a test and then write the code that fulfills that test. To come with a non-code life example, writing the test is like measuring the space you have for a cabinet: fixing a set of requirements. Then, going to IKEA. If you find a cabinet that fills the space perfectly, you are done and the test passes. In TDD, you keep refining the tests. Following with the example, you actually want a beige cabinet: finding one that fits is not enough.

If we extend the concept to writing, we could think of the test(s) as at least knowing what we will write about and what points we want to cover. Then, we need to write something that talks about that while covering those points. And refine.

If this doesn’t sound like a new idea, it is because it really is not. In technical writing, it is a relatively common recommendation to always state your target user(s), writing goal, and potentially a list of points to cover at the beginning of a document, and then to compare your writing to that. Even in essay writing, starting with an outline is traditional. What I thought is about pushing it further.

It looks like I’m not the first one to think about Test driven writing. When I had the thoughts above, I went searching and found several articles where people mentioned something similar:

Test-Driven Writing (training slides by M. Herman)
Test-Driven Writing : TDD appliqué à l’écriture (by X. Pigeon, in French)
Publish Iteratively (by C. Tietze)

How can we test-drive writing?

The basics

The following is the process I would follow if I had infinite time.

Write a first draft, enough that I have an outline and an idea of where I want to go.
At the beginning of the document, write a document-level test, like:

This post should explain how we could do something similar to test-driven development for writing. It should have some examples and give enough ideas to readers to let them start with this process. The key idea the post should end up with is that there are ways to systematize writing.

For each section, add a section-level test, like:

This section should cover the “how” of the TDW (test-driven writing) process. It should cover all steps from document idea to publishing.

Rewrite each section until it passes the test. If needed, rewrite or refine the test to make sure new material is added, like:

(…) This section should also cover how we can automate part of this process.

To speed the process up, use an LLM like Gemini to evaluate your tests against each section, and against the full document.
Remember that you are the writer, and the tone, structure, and fact checking rely on you.

How can we write “good” tests?

To be fair, this is hard. I’m still not sure what is easier: writing tests for LLMs or for humans. Here are some rules of thumb I have found so far that apply to both cases:

Tests for sections should start loose, and add more details later. For example, for a document on eggs and a section on frying, you would go from a test saying “This section should cover how to fry an egg” to one with “This section should cover how to best crack an egg, how to estimate the temperature of a pan, and how much oil to use”.
Tests for the whole document can be left more open. Aim to have “something the document needs to say” covered on it. In the example above, you could have something similar to “This document should cover the basic methods of cooking an egg, including at the very least frying, poaching, and boiling”.
Try to think in reverse. If your cat started deleting text from a section, you want the test to let you know something has been removed. Note: the cat might be you in a future rewriting of your document.

Trying the process for a README, with full LLM help

In my previous post I covered how I created a Chrome extension to annotate screenshots, Goita. I have rapidly added a lot of features, and keeping the README up to date has been a struggle. Could I leverage TDW to keep it more or less consistent?

This particular scenario is relatively simple:

Documentation is just a README and a single JavaScript file constructing a dynamic help popup.
The codebase is small, at less than 50k tokens.
A few years ago I wrote a tool to convert full repositories to Markdown documents (motllo).

My first thought was about using Motllo to merge all files in the repository, feed that to Gemini together with the README and iterate.

On second thought, there are a lot of files there that are irrelevant. It is easier to list those that are useful for documenting:

Main code and interaction files.
Styling files.
Test files.

I quickly assembled a very small golang tool I called Gotllo (Motllo + Go, for now it is available in this gist) that could create a single Markdown file with path and specific file exclusion. For the Goita repository, it would be invoked like this:

gotllo -exclude=white.js,mocha.js,chai.js,show.js,testImage.js,tdw README.md src/*.js tests/*.js memes/*.js src/style.css tests/index.html

Then I added a main prompt for the README. I’m still not sure if this is the best possible prompt, for now it is a prompt.

This is a test driven writing README. Each section has some tests that this document needs to pass. You are the test evaluator.

Each comment with “Test:” before a section defines an individual test for that section.

For each section with a “Test:” comment, determine if the content of that section satisfies the test, given the provided code.

The provided code starts when this README ends, after a HTML comment line with the content “README ends here”. Anything after that is code, and should not be treated as part of this readme, just as additional code context to understand this.

You should cross-reference this README file with the provided code snippets after it to ensure consistency and accuracy.

When identifying discrepancies, provide specific examples from the code and this document to justify your findings, and make sure they are real.

The test definitions are the minimum to pass, going into detail or adding more information is also a passing grade.

Important: Do not make bogus suggestions that are already present in the README.

Important: Do not be pedantic, and assume common sense from the reader and the writer. Minor deviations from the wording are fine, as is providing more details. You should aim for not reducing details unless extremely necessary.

Important: You should provide no suggestions about sections that have no test comment after their header.

Provide a summary report in markdown a format that looks like the following. Lines in C-style comment are directives for you.

// For the failing tests, write in the following format. If there are any failures:
// In header:
# Failures

// As a subheader:
## header of the failing test section.
// The following, as a list
- Reason: REASON // Failure reason
- Fix by: CHANGES // List of specific changes needed

// For passing tests:
- suggestions, if any. Otherwise it has to be omitted.

Then, I added tests to most section, like the following test for the section about functionality, where I detail how to trigger each tool and what it does.

Test: This section should mention all functionality provided by the individual elements and what is available in shortcuts.js Note that ctrl + wheel is explicitly stated here for resizing images, as well as ctrl + . and , and /, and pasting via Cmd/Ctrl + v–>

As you can see, This is not ideal. Gemini kept having issues with the ctrl + wheel (and ctrl + whatever) shortcuts, regardless of what text was available in the README. On the flip side, if I removed any tool from the document the test as “run” by Gemini would caught the missing functionality (for example, by removing the shortcut for “ellipse”).

I think this is not there yet but given the current evolution of LLMs, we might get there in a couple of days or weeks.

Conclusion

The concept of Test Driven Writing looks useful, particularly in scenarios where more than one person may edit the text, or when many revisions might happen. In those scenarios it can easily prevent content from being lost.

For relatively small contexts (like a document or a few documents) LLMs already can provide a great experience, by reviewing document or section-level tests and suggesting improvements and highlighting where we humans might have missed the mark.

For larger contexts (like the README example above), it looks like there is some more work, but we are almost there when checking the document against the code will be automated. Imagine a day when the documentation matches almost exactly the code!