Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

Anthropic has never published a technical paper on Claude Mythos. This has not stopped the research community from theorizing. A new open source project called OpenMythos, Released on GitHub by K Gomezattempts something ambitious: a first-principles theoretical reconstruction of what the architecture of the Cloud Mythos might actually be like, built entirely in PyTorch and grounded in peer-reviewed research.

The project is not a leaked model, an improvement or a distillation. It’s a hypothesis presented in code, and the hypothesis is specific enough to be falsifiable, which is what makes it interesting.

Key Claim: Cloud Mythos is a redundant depth converter

OpenMythos suggests that Claude Mythos belongs to a class of architectures called Redundant depth transducers (RDTs)also referred to in the literature as toroidal transformers. This concept differs significantly from standard transformer kits.

In a traditional transformer – GPT, LLaMA, Mistral – the model passes the input through a series of unique layers, one after another, each with its own independent weights. More power generally means more layers and more parameters. In a recursive depth converter, a fixed set of weights is applied repeatedly over T-loop steps within a single forward pass. The same weights are worked several times. The depth of inference does not depend on the number of stored parameters, but rather on the number of iterations run at inference time.

Think of it less like reading a book and more like refining a draft: the model returns to the same computational block over and over again, improving its internal representation with each pass.

How is architecture organized?

OpenMythos builds this as a three-part structure: Prelude → Recurrent Block → Coda. Prelude and Coda are standard compiler layers that run exactly once. The recursive block is the computational kernel, which is iterated up to T = 16 times.

At each loop step t, the hidden state is updated using the following rule:

ht+1 = A·ht + B·e + Transformer(ht, e)

here, ht It is the hidden state after iteration of the loop t and e It is the encrypted input from the front – which is re-injected at each step. Reinjection is done intentionally: without it, the hidden state would drift away from the original input signal through the deep loops. The learned matrices A and B control how much of the previous hidden state and encoded input persists at each step.

The FFN inside the repetitive block is not a standard feedforward layer. OpenMythos replaces it with A mixture of experts (Ministry of Education) Layer follows the design presented in DeepSeekMoE: a large group of finely directed experts, with a sparse subset of top-K activated for each token, along with a small group of experts that are always active Joint experts That absorbs common patterns between domains. Importantly, the router selects distinct specialized subsets at the depth of each loop, which means that each iteration is computationally distinct despite sharing the same basic weights. The Ministry of Education provides breadth of scope; Loops provide depth to inference.

Default attention to Multiple latent interest of DeepSeek-V2, which stores low-rank latent KV tensors instead of full key/value tensors, resulting in a 10–20× reduction in KV memory at production scale.

Thinking about continuous latent space

One of the most important properties of this architecture is that inference occurs entirely in a continuous latent space. There is no intermediate symbolic emission between the steps of the loop – the model does not produce text in mid-thought and then re-read it. This is structurally different from thought-series motivation, where reasoning is externalized into symbolic sequences, and has been formally analyzed in both Saunshi et al. (2025) and Coconut (2024).

Saunchi et al. (2025) It formally states that each iteration of the loop in RDT is functionally equivalent to one step of the chain of thoughts, but operates on real-valued vectors rather than discrete tokens. Continuous latent thoughts can also encode several alternative next steps at once, allowing something akin to an extended first search in thought space within a single forward pass.

This also explains the advantage of tangible ability. A standard transformer trained on 5-hop reasoning chains fails when tested on 10-hop reasoning chains at inference time – and has no mechanism to expand its depth beyond what it saw during training. The recursive depth transformer handles this naturally: running more inference time loops expands the inference chain without any retraining. Harder problems receive more computation; The simpler ones come out early.

Solve the stability problem

Training loop models has historically been fragile. The hidden state ht It can grow infinitely across iterations – a failure mode called residual explosion. OpenMythos handles this using a Linear time constant (LTI) The injection restriction is borrowed from Parky Architecture (Prairie et al., 2026): The spectral radius of A, denoted as ρ(A), is forced to be less than 1 by construction, ensuring stability regardless of learning rate or gradient noise.

There is also a second failure mode at the other end: beyond a certain loop depth, excessive repetition leads to deterioration of predictions – the hidden state drifts beyond the solution and enters a state of noise. This is the problem of “overthinking.” Adaptive calculation time (ACT) Stopping handles it using a per-position learned scalar that dynamically decides when to stop iteration. Situations that are difficult to address receive more accounts; Tokens that have already converged stop early.

finally, Depth wise LoRA converters Introducing a small rank-r adaptive matrix at each iteration depth, giving each step in the loop a slightly distinct behavior without adding intrinsic parameters – bridging the gap between pure weight binding layers and fully distinct layers.

Why parameter efficiency matters

the Parkii paper (Prairie et al., 2026) It provides an empirical basis for the efficiency claim. At 770 million parameters, RDT matches a standard 1.3B transformer trained on identical data – roughly half the parameters for equivalent downstream quality. Both optimal frequency and optimal feature counts follow power laws with consistent exponents across scales, creating the first predictable scaling laws for loop training.

The implication is important: measure the depth of inference using the calculation of inference time, not the number of stored parameters. This reformulates one of the prevailing assumptions in the enlargement debate. The relevant axis may not be the number of parameters in training, but the depth of the loop at inference.

What OpenMythos Contributes

OpenMythos provides four concrete research tools: a fully configurable PyTorch implementation of the RDT hypothesis with MoE FFN and Multi-Latent Attention; Stable repetitive injection LTI is integrated as a first-class training primitive; Deep LoRA transformers that enable behavioral differentiation of each iteration; and a reproducible research baseline for studying the dynamics of toroidal transformers and the depth of inference at inference time.

Whether Mythos is actually an RDT or not, OpenMythos gives the research community something tangible and operable — an implementation of an architecture class that the literature increasingly suggests is underexplored, and which may represent a radically different path to capable AI than simply training larger models.


verify Complete codes with notebook here. Also, feel free to follow us on twitter Don’t forget to join us 130k+ ml SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply