Synthetic Coding LLMs

📅 2025-07-02 • ✍️ Mira

Synthetic Coding LLMs

LLMs consume data during training. DeepSeek v3 used about 15 trillion tokens¹. Future models may demand more tokens than exist. This article discusses ways to generate tokens from a programming language’s definition rather than from humans.

Modeling a dataset and sampling to generate an improved dataset causes a permanent loss of the information not captured. A Lower Approximator contains less entropy than its source: Its limit object is a constant. An Upper Approximator contains more entropy than its source: Its limit object is a uniformly randomly-chosen sequence. An approximator trained to convergence against a single datapoint is a constant, so models based on cloning fixed datasets are lower approximators. They are practical and predictable, but lack coverage. Add more data from the target distribution to increase coverage. Alternatively, begin with the distribution that makes the least assumptions(uniformly-random), and add assumptions until it becomes useful. This has coverage, but lacks predictability. Add more rules to increase predictability.

Rules from Programming Languages

A programming language is a set of strings together with an interpreter. Modeling the language requires modeling the set and the interpreter. The limit distribution of “all possible equally-likely strings” is too unwieldy to be practical; but a language can have a Grammar that acts as both generator and recognizer. Starting from this, one can rejection-sample using any rule.

Validators associated with a programming language include: * Grammar * Type System * Variable Lifetimes * Borrow Checker * Trace Execution to observe unwanted behaviors * Tests written for specific programs * Heurisics that replace any of these

In principle, one could train a code-focused base LLM using only the grammar and constraints. This is interesting but such a model won’t understand natural language, and programs contain names and comments.

This project proposes to train a language model using a corpus as usual, but for most training to be synthetic using the above validators.

C is chosen as the first language because it’s a simple language with a long history. It should be easy to generate random valid C programs to test our approach.

Our second test will likely be Bend or Kind by the Higher Order Company. They are relatively simple languages with a small grammar, but expressive: A positive result means that we’ll be able to train quality models for every other programming language in existence.

Join our Discord if you’re interested in following along.

“We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens”↩︎