MoC: a sparse LLM architecture that beat MoE at 500M for $200

TL;DR

I'm Francisco, a 17-year-old independent AI researcher in Brazil. I built Mixture of Collaborators (MoC), a sparse LLM architecture. In a matched 498.8M-parameter head-to-head trained on 9.83B tokens, MoC outperformed a standard MoE baseline on every metric I measured.

Results:

- Validation perplexity: 20.40 (MoC) vs 20.70 (MoE). 2.5% absolute improvement.

- Max routing Gini coefficient: 0.153 vs 0.283. 45.9% reduction in worst-layer routing imbalance.

- Zero dead experts in either model across all 10 layers.

- Total compute cost: ~$200 on AMD MI300X (donated credits).

Everything is public.

Paper: https://github.com/Auren-Research/lunaris/blob/main/paper/main.pdf

Code: https://github.com/Auren-Research/lunaris

Experiment logs: https://wandb.ai/smeryylle-moon-cloud-services-/lunaris-moc-validation

I'm requesting $5K minimum, $15K goal to continue this research for 6 months. Without funding I have about 2 months of runway left and will not be able to continue full-time.

---

What is MoC?

Standard Mixture-of-Experts routing treats selected experts as conditionally independent paths whose outputs are merged only at the end. MoC adds three mechanisms on top of standard sparse routing.

1. Mediator-based collaboration. After top-k routing, selected experts exchange information through a shared learned mediator state. Per-token cost stays linear in top-k experts (O(K)), not quadratic in total experts (O(E²)). This is the key architectural contribution.

2. Iterative Reasoning Layers. Each expert performs weight-shared multi-step refinement, giving deep computation without parameter growth.

3. Adaptive compute gates. Per-token learned gates choose how many reasoning steps and collaboration rounds to run. Validated as a behavioral result at small scale; full inference benchmarks are future work.

At 500M scale, MoC retains the full MoE efficiency story (215.6M active params per token, same as baseline) while both lowering perplexity and substantially improving load balancing. The current implementation pays a ~2× wall-clock penalty, which is a solvable systems bottleneck (Python expert loop, no grouped GEMM), not an architectural one.

---

Why this matters for AI safety

Sparse architectures are now standard in frontier open models (Mixtral, DeepSeek, Qwen-MoE). They are also one of the hardest parts of a modern LLM to study mechanistically. Expert routing is an opaque gating function, and real-world MoE models suffer from dead experts and routing collapse that corrupt any interpretability analysis you try to run on top of them.

MoC's mediator cross-attention is an explicit, inspectable communication channel between selected experts. This channel is absent in vanilla MoE, where experts never directly interact. That gives interpretability research a concrete structural affordance:

- Routing stability (Gini 0.153, zero dead experts) makes per-expert mechanistic analysis tractable where vanilla MoE routing collapse typically blocks it.

- The mediator attention matrix is a first-class object. You can probe it, visualize it, edit it.

- Entropy regularization on the gate gives a controllable knob on route determinism vs exploration.

I do not claim MoC "solves interpretability for sparse models." I claim it is a strictly more probeable substrate for the interpretability work that is already happening on MoE.

Capability risk from this specific work is bounded. A 2.5% PPL improvement at sub-1B scale does not move the frontier. Sparse architectures are already deployed at scale; MoC is an interpretability-favorable substitute for one of their blocks, not a new capability class.

---

What this grant funds (6 months)

Month 1–2: Scale MoC to 1B parameters. Reproduce the 500M wins at larger scale. Multi-seed trials (3 seeds) on the 500M MoC vs MoE comparison to give the first confidence intervals.

Month 3–4: Architectural ablations — mediator depth, expert count (8 → 16 → 32), entropy-regularization beta sweep. Systems optimization: replace the Python expert loop with grouped GEMM or fused kernels to fix the 2× wall-time penalty.

Month 5–6: Release MoC-1B weights, training recipe, and an interpretability toolkit (mediator attention probes, routing trajectory visualization, expert specialization across layers). Downstream evaluation on MMLU, HumanEval, GSM8K. Public technical report and code release.

Everything released open-source under the existing repo. No proprietary holdback.

---

Budget (total $15K goal)

- Living costs (Pirapora, Brazil): $1,200/mo × 6 months = $7,200

- Complementary compute beyond donated AMD credits: ~$4,000 (H100 hours for multi-seed 500M runs, 1B scaling, downstream eval)

- Tooling, storage, misc (W&B team plan, HuggingFace storage, domain renewals, ~$150/mo): $900

- Buffer for unexpected systems costs / kernel debugging on non-AMD hardware: ~$2,900

At $5K minimum, I cover ~3-4 months of living costs and continue MoC-1B scaling on existing donated AMD credits only. No additional hardware, no systems optimization work, no downstream eval.

At $15K goal, the full 6-month plan including multi-seed runs, grouped-GEMM implementation, and downstream benchmarks becomes feasible.

---

About me and track record

I started AI research at 15. No university, no institutional affiliation, no advisor. I built the MoC architecture, training infrastructure, 1T-token data pipeline, and custom Triton kernels for AMD MI300X solo.

External validation signals:

- OVHcloud Startup Program: accepted at 15 based on working code alone, no pitch deck, no credentials. ~€10K in cloud credits, long since spent.

- Lambda Cloud Research Program: $1K compute grant.

- LTFF: application currently under review with Caleb Parikh.

- Z Fellows: interviewed, feedback "really impressed", final decision pending after 2-month check-in.

- Emergent Ventures: had a prior call with Tyler Cowen; follow-up re-engagement in progress.

Public technical artifacts:

- Working 500M-parameter training infrastructure on AMD MI300X with FSDP, W&B integration, and MFU estimation, built from scratch.

- Custom Triton kernels (fused RMSNorm+residual, SwiGLU, weighted residual update) with correct autograd backwards and AMD/NVIDIA autotuning.

- ~1T-token pretraining data pipeline with manifest-based resume, adaptive memory management, validated on 16GB hardware.

Links:

- Project: https://github.com/Auren-Research/lunaris

- Personal: https://github.com/MeryylleA

- LinkedIn: https://www.linkedin.com/in/francisco-antonio-0434aa284/

- HuggingFace: https://huggingface.co/meryyllebr543

---

Limitations and honest risks

Of the research itself:

- The 500M comparison is single-seed (seed 1337). I have not yet run confidence intervals. That is explicitly Month 1 of this grant.

- Evaluation is currently perplexity-only. No downstream benchmarks yet. Perplexity wins don't always translate cleanly to task performance.

- Current implementation has a 2× wall-time penalty vs MoE. Systems problem, not architectural, but not yet fixed.

- Long-context behavior (>512 tokens) is untested.

Of this grant specifically:

- MoC-1B might fail to reproduce the 500M wins. My honest prior is ~15%. The 500M result survived 30× more tokens than the 64M pilot, which is a decent robustness signal, but another doubling of scale is not free.

- Systems optimization (grouped GEMM for experts) might take longer than planned. If it does, I will reallocate the efficiency work to a later grant and focus remaining time on multi-seed and downstream-eval deliverables.

- My runway is ~2 months. If this grant and the LTFF both fall through, I will need to take contract work and research cadence will drop.

---

Why I'm raising on Manifund

Institutional grants (LTFF, Foresight, NLnet) have decision cycles of 1–3+ months. My runway is 2 months. Manifund regrantors can decide in days when the case is clear, and I believe this case is clear: results are public, reproducible, and paired against a fair baseline.

I will continue this work regardless of funding. The code is open, the paper will get published either way. The question is whether I do it full-time from Brazil on a concentrated 6-month sprint, or part-time over 2+ years around contract work.

Happy to answer technical questions about MoC, the 500M run, the systems bottleneck, or anything else. Comments welcome.