CoT-ICL Lab

A synthetic framework for studying chain-of-thought learning from in-context demonstrations.

The CoT-ICL Lab pipeline. $K$ in-context examples are sampled from the generator — the causal DAG $G$ (a graph), the embedding matrix $E$, and the token processor $H$ (a small MLP) — and fed to a transformer. Its attention map sharpens from diffuse to a structured pattern, recovering the directed edges of $G$ (the same graph drawn in the generator): that is how the model learns the causal structure in context, before predicting the held-out answer. Click the figure to regenerate.

Chain-of-thought (CoT) prompting works, but it is hard to study why in the wild: real datasets tangle together reasoning structure, surface language, task difficulty, and pretraining priors. CoT-ICL Lab is a synthetic test bed that pulls those threads apart, so we can pose precise questions about how transformers learn to reason from in-context demonstrations, and run controlled experiments to answer them.

This page is an interactive tour of the framework behind two papers. The first1 introduces the lab and asks: can transformers learn chain-of-thought from in-context examples, and what governs it? The second2 uses the lab to surface a counterintuitive result: meta-training on too many CoT examples can quietly hurt downstream reasoning. It then proposes a fix, the CoT-Recipe.

The core idea is a clean separation of concerns. Every in-context example is generated in two stages:

Because $G$ and $H$ are independent knobs, the lab gives fine-grained control over the difficulty of in-context examples without changing the surface task. The model never sees $G$, $H$, or the embedding table behind the tokens; it must infer them from the demonstrations alone.

It helps to keep one concrete example in mind as we go, at two levels of abstraction. For training transformers from scratch, tokens are opaque integer ids and $H$ is a small random network over a fixed embedding matrix — a maximally controllable setup whose difficulty is set by the vocabulary size and activation. To check that the findings carry over to real models, the same DAG is re-rendered with made-up words and a human-readable string rule, so a pretrained LLM (Qwen2.5) can be asked to solve it. It is one and the same construction, shown two ways, and we build up both representations below.

1The causal structure

Start with the skeleton. Given $N$ input tokens and a chain of length $C$, the DAG has $N + C$ nodes. The first $N$ are inputs $x_1,\dots,x_N$; the next $C$ are the chain ("thought") tokens $y_1,\dots,y_C$. Each chain token $y_i$ draws $M = n_{\text{parents}}$ distinct parents from everything generated so far (the inputs and all earlier chain tokens):

$$\text{pa}(y_i) \subseteq \{x_1,\dots,x_N,\; y_1,\dots,y_{i-1}\}, \qquad |\text{pa}(y_i)| = M.$$

Because parents always come from earlier nodes, dependencies point strictly left-to-right and the graph stays acyclic. Even a single random graph already produces thoughts that range from shallow (built from one or two inputs) to deep (integrating most of the graph through several hops). This causal structure is a faithful port of RandomDAG.generate_adj_list; rather than show it in isolation, we let you reshape it directly in the next figure, which realizes the DAG with embeddings and a token processor and exposes sliders for $N$, $M$, and $C$.

2Realizing the tokens

The DAG says who depends on whom. The token processor $H$ says what value each chain token takes. Let us make this concrete: first as the abstract tokens a model trains on, then as the same chain rendered in plain words.

2.1Abstract tokens: what the model trains on

In the tokenized pipeline, tokens are just integer ids with no inherent meaning. To produce chain token $y_i$, the lab looks up the embeddings of its parents, pushes each through a small random fully-connected network $H$, averages, applies an activation $\sigma$, and projects back to the vocabulary by taking an inner product with the (tied) embedding matrix:

$$h_i = \sigma\!\Big(\tfrac{1}{M}\textstyle\sum_{p \in \text{pa}(y_i)} H\big(E[t_p]\big)\Big), \qquad y_i = \arg\max_{v}\; \langle h_i,\; E_v\rangle.$$

The result is deliberately opaque: the produced id carries no surface meaning. A model only ever sees token sequences, never the machinery behind them, so to predict the next token it must jointly infer three hidden pieces of ground truth from the in-context demonstrations: the causal structure $G$, the embedding matrix $E$, and the transformation $H$. The widget below makes these explicit: embeddings propagate along the DAG edges ($G$), are transformed and combined ($H$), and the resulting vector is decoded back to a token through $E^{\top}$. Difficulty is tuned by the vocabulary size and activation, the paper's TokenCoverage metric, which measures how often chain tokens collide. Step through one example below, and reshape the DAG itself with the $N$, $M$, $C$ sliders.

Figure 2.1. Abstract token construction, porting get_output_token. The embedding matrix $E$ and the FCN $H$ are fixed Gaussians; only the inputs change per example. The model sees only token sequences and must infer $G$, $E$ and $H$ together; the n_inputs, n_parents, and chain_length sliders reshape the underlying DAG ($G$). This is the data the custom Llama models in the paper actually meta-train on.

2.2The same chain, in words: what a pretrained LLM can read

To validate findings on real models, the lab re-instantiates the very same DAG with made-up words — random strings of letters. Each chain word is built deterministically: take the second half of every parent word, concatenate, then Caesar-shift every character by a fixed char_offset.

$$\text{chain word} = \text{shift}_{\text{offset}}\big(\;\text{concat}_{p \in \text{pa}(y_i)} \text{secondHalf}(w_p)\;\big).$$

Now the chain is human-readable, so a pretrained model (the paper uses Qwen2.5) can infer the transformation rule and apply it. One subtlety: this word version sorts each parent list, so the rule is order-independent. The right panel shows the actual chat-template framing.

Figure 2.2. The word construction, porting create_icl_example and the chat framing in SymbolicDataModule. It is exactly the chain from §2.1, now legible.

3Framing the prompt

An in-context prompt is a sequence of $K$ such examples. Each example can be shown with its chain of thought or as a direct answer, and the prompt can express this in two ways. One framing wraps every field in dedicated integer special tokens; the other uses a chat template with think / final answer markers and a \boxed{} answer at the end.

The hybrid_special_token strategy decides per example, by a coin flip with probability cot_example_prob, whether to include the chain. Drag the slider to change the mix, then flip the framing to watch that very same mixture appear in either representation, unchanged.

Figure 3. A single $K$-example prompt. The counters mirror num_cot_examples / num_standard_examples. At cot_example_prob = 1 every example thinks; at $0$ none do. Why mix at all? At test time CoT may be scarce, so a model that has only ever seen fully-worked examples can struggle to answer directly, which is exactly the issue that §4 goes on to examine.

4To think or not to think: the CoT-Recipe

Here is the surprise. Intuitively, more chain-of-thought supervision should only help. But when you meta-train across many tasks, flooding the prompts with CoT examples can degrade performance on novel tasks, especially when little or no CoT is available in-context at test time. Thinking, it turns out, has a hidden cost.

The fix is not to abandon CoT but to ration it. The CoT-Recipe assigns each prompt a cot_example_prob from its index with a power law:

$$p_{\text{CoT}}(\text{idx}) = \min\!\big(\text{scale}\cdot x^{\alpha} + p_0,\; p_1\big), \qquad x = \frac{\text{idx}}{N_{\text{prompts}}}.$$

One subtlety matters for reading this correctly: after these probabilities are assigned, all prompts are shuffled before training. So despite the index in the formula, there is no early-to-late curriculum that the model moves through. What the recipe actually controls is the overall fraction of CoT supervision in the dataset, which on average is $\tfrac{1}{\alpha + 1}$: at $\alpha = 0$ every prompt is CoT (the "always think" baseline), while large $\alpha$ leaves only a sprinkling, all in random order. Build and shuffle the dataset below to see how $\alpha$ reshapes the mix.

Figure 4. The CoT-Recipe, porting PowerLawRecipe. Each prompt's CoT probability is assigned from its index (the curve), then all prompts are shuffled (the second row), so $\alpha$ controls the overall fraction of CoT supervision rather than a training-time schedule. The same recipe governs both the tokenized and the word-based data.

How much does it matter? In the controlled synthetic setting, careful modulation via the CoT-Recipe raises transformer accuracy on novel tasks by up to 300%, even when there are no CoT examples available in-context. The effect transfers: on symbolic reasoning with pretrained Qwen2.5 models, the recipe yields gains of up to 130%. The lesson is not "don't think," but "don't over-train on thinking."

5Using the framework

Everything above is driven by a single Args dataclass. A dataset is generated lazily, on the fly, so training never hits an I/O bottleneck:

from tokenized_cot_icl.core.args import Args
from tokenized_cot_icl.core.data import TokenizedDataset

args = Args(
    vocab_size=1024,          # size of the token vocabulary
    n_inputs=4,               # input tokens per example (N)
    n_parents=2,              # parents per chain token (M)
    chain_length=3,           # length of the reasoning chain (C)
    n_examples=1,             # examples per training sequence
    enable_cot=True,          # include the chain-of-thought tokens
    prompt_strategy="cot",    # how each prompt is framed
    activation="leaky_relu",  # nonlinearity inside H
    n_tasks=10,               # number of distinct tasks
)
dataset = TokenizedDataset(args=args)
print(dataset[0])

Experiments are collated into a TASK_CARD and launched with torch DDP; the CoT-Recipe sweep is just another card:

cot_example_prob_recipe_info = {
    "type": "power_law",   # recipe family
    "initial_prob": 0.0,   # p0: floor probability
    "final_prob": 1.0,     # p1: cap probability
    "alpha": 2,            # exponent; CoT fraction ~ 1/(alpha+1)
    "scale": 1.0,          # multiplier on x^alpha
}

Trained checkpoints can be evaluated with the HuggingFace generation loop during training, or at scale with vLLM / SGLang, including the inference-time technique of dropping CoT spans from in-context examples to probe robustness. See the repository for the full task cards, model registry, and evaluation harness.

Citations

@inproceedings{Kothapalli2025CoTICLLAB,
  title={CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought
         Learning from In-Context Demonstrations},
  author={Vignesh Kothapalli and Hamed Firooz and Maziar Sanjabi},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

@article{kothapalli2025think,
  title={To Think or Not to Think: The Hidden Cost of Meta-Training
         with Excessive CoT Examples},
  author={Kothapalli, Vignesh and Fatahibaarzi, Ata and Firooz, Hamed
          and Sanjabi, Maziar},
  journal={arXiv preprint arXiv:2512.05318},
  year={2025}
}