A small controlled language for CodeAct

When I started building lash a few months ago, I planned to use Python as the language for CodeAct.Wang et al., Executable Code Actions Elicit Better LLM Agents (2024). For industry framing see Apple’s writeup and Cloudflare’s CodeMode docs. This article explains why I ended up using a custom DSL instead.

Python looked ideal. LLMs are great at it, it is expressive, compact, interpreted. What’s not to like? It is also a common choice among the (admittedly few) other implementations.Khattab et al., DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (2023). And the more recent Recursive Language Models (Zhang et al., 2025).

So I started there. I first ran an actual Python subprocess, then switched to an embedded CPython with PyO3 for better portability. So now, forgoing the horrors of packaging, the model can write Python. It can write Python, and it can even reference variables in a persistent REPL. That is great. But then you realize that the model can now WRITE PYTHON.

This is honestly great for a use case such as a local coding agent which, let’s be honest, we all run in dangerous mode anyway. But for lash I wanted control over what it could do and where it could run. So basically, your only choice is to sandbox it. And that is what most implementations do.

So now you have this clunky setup where you have an agent in a sandbox running Python that can do whatever, but it is in the sandbox so it can’t damage anything. But it can freely confuse itself when it gets an angry error trying to look around using pathlib. Also, all of it is inefficient. All we usually want is for the LLM to string a few tool calls together with some control flow. Running Python in a sandbox to get that is overkill. So what is the alternative?

Well, that is where I came across Monty. I thought this was it: no more hacky Python installations, no need for sandboxes: just a crate and a subset of Python that I can fully control. And most of that is true — it is a better direction than the previous setups.

But that is where the next wrinkle appeared. The agents know Python very well. So what happens when you want them to, for example, do some stats? Or if you want them to do a web search? Well, unsurprisingly, they gravitate towards parts of the standard library and even 3rd party libraries that Monty doesn’t support (and probably shouldn’t) and are definitely not what you want running on the host.

So I found myself negative space prompting. You ONLY have the standard library + asyncio. Do not use pandas. Use the web tool, not requests. You get the idea.

What’s the alternative? The one thing you are not supposed to do — a DSL. That is how lashlang was born.

The case for a DSL

No more negative space prompting: the agent can only do what the language lets it do, so there is nothing to forbid. You shape the language around the use case: small, portable, with agent primitives as first-class statement forms rather than retrofitted library calls. And it is customizable on the fly: if you don’t tell the model about a part of the stdlib, it doesn’t exist.

And the runtime is yours to optimize. A small in-process interpreter — around 22 µs to parse, compile, and execute a small script, ~5 µs once compiled — and cheap snapshotting. These are properties you design in when you own the language, and properties an embedded Python won’t give you cleanly. The agent workload pays the cost of every one of them, repeatedly.

The language docs have to ride in the system prompt every turn — about 2.5k tokens for lashlang’s execution section — and the model has to write the DSL correctly from in-context learning alone. In practice, medium-to-large models like Sonnet, Opus, and GPT 5.4+ do not seem to struggle with lashlang at all; benchmarks to follow.

Designing lashlang

Assume the case is made. How do we do it?

Well, it needs to be small. It needs to be familiar enough syntax-wise that the models don’t struggle. But also not so familiar that they start thinking it is something else and hallucinating. Other than that, we need all the normal basics: variables, control flow, loops, etc. A thin stdlib with math and text-parsing primitives. And some simple async / parallel functionality is key. Oh, and types! So we can give the model nice types for the tools and variables that it can reason with. What about those “agent primitives”?

Concurrent tool calls and a typed final value, in five lines:

parallel {
  files = call glob { pattern: "src/**/*.rs" }
  cargo = call read_file { path: "Cargo.toml" }
}
submit { file_count: len(files), cargo_chars: len(cargo) }

The first agent primitive is submit: how the agent ends a turn. Make it typed, and the schema for a turn’s output becomes part of the contract. A malformed submission turns into a typed error the agent can retry against, not free-text confusion. print is the same shape running the other direction: the agent uses it to commit only the slices of data it actually needs back into its own context.

doc = (call read_file { path: "notes.md" })?
print { intro: slice(doc, 0, 400) }

The host has primitives too. Projected bindings let you expose a variable that only materializes when the agent references it, which is useful when an upstream fetch is expensive or variables are large. Typed images carry multimodal payloads as first-class values, which the host can disable on text-only models without the agent’s code changing.

Could you bolt these onto Python? parallel could be asyncio.gather, typed submit could be a Pydantic-validated return, projected bindings could be a custom __repr__. There are two ways to get there with Python.

The Monty route — a Python subset with a runtime you own — lets you define the semantics, but signs you up to mirror Python forever. Every library the model has seen, every standard-library edge case, every version: the facade has to keep matching what the model expects, or you slip into negative-space prompting by another door.

Hacking the primitives into stock Python is worse. The model sees familiar syntax, expects the familiar behavior, and the runtime quietly does something else. Foreign syntax tells the model to read the docs; almost-Python tells the model to guess. The mix of familiar and unfamiliar costs more than a clean DSL does.

For the full grammar, runtime, and benchmark numbers see the lashlang post.