p.enthalabs

GitHub - NohWai-Software/RayTention: Raytention is a new attention mechanism that solves the bloated k-v cache VRAM problem. Raytention utilizes 7 signals from

RayTention — Zero-KV-Cache Attention via Geometric Signal Extraction

[](https://github.com/NohWai-Software/RayTention#raytention--zero-kv-cache-attention-via-geometric-signal-extraction)

**PATENT PENDING** The architecture, L2-distance signal extraction methods, and zero-cache routing mechanisms described and implemented in this repository are the subject of a pending U.S. Patent Application (App. No. 64/102,801). All rights reserved by NohWai Software.

![Image 1: License](https://github.com/NohWai-Software/RayTention/blob/main/LICENSE)![Image 2: Python](https://www.python.org/)![Image 3: PyTorch](https://pytorch.org/)

**Raytention is a new attention mechanism that solves the bloated KV cache VRAM problem. Raytention utilizes 7 signals from the context to provide the model with the attention it needs at a much lower VRAM cost.**

- * *

Glossary

[](https://github.com/NohWai-Software/RayTention#glossary) | Term | What It Means | | --- | --- | | **Attention** | How a transformer decides which previous tokens are relevant to the current token. Standard attention computes a weighted average over all past tokens. | | **KV Cache** | The Key and Value tensors stored for every past token. Standard attention needs these to attend over the full history. At 1M tokens with a small model, this is 4.4 GB — and it grows with every token generated. | | **L2 Distance** | Euclidean (straight-line) distance between two vectors. RayTention uses this instead of dot-product to measure token similarity. | | **Softmax** | Turns a list of scores into probabilities that sum to 1.0. Used to decide how much attention each past token gets. | | **Signal** | A fixed-size summary computed from the attention weights and keys. RayTention extracts 7 signals (642 floats total) instead of storing the full KV cache. | | **AttnFFN** | A small feedforward network that processes the 7 signals into the layer's output. Replaces the weighted-sum step in standard attention. | | **Flash Attention** | A highly optimized CUDA kernel that makes standard attention use less memory during training. Does not eliminate the KV cache at inference. | | **Context / Context Window** | The sequence of previous tokens the model can "see" when predicting the next token. Longer contexts enable better understanding but cost more memory. | | **CE Loss** | Cross-entropy loss — measures how well the model predicts the next token. Lower is better. Random guessing on 16K vocab = ~9.7. |

- * *

Standard transformer attention stores every key and value for every token ever seen — the **KV cache**. At 1 million tokens with a 4-layer model, that's **4.4 GB** of memory that grows with every generated token. For production models (d_model=4096, 32 layers, bf16), 1M tokens requires over 500 GB.

RayTention compresses the entire context window into **7 geometric signals** — a fixed 642-float vector that never grows:

``` Standard: ctx=1K → 4 MB | ctx=1M → 4.4 GB (grows forever) RayTention: ctx=1K → 2.6 KB | ctx=1M → 2.6 KB (fixed) ```

The result: you can serve **160× more concurrent requests** on the same GPU at long context lengths.

How It Works

[](https://github.com/NohWai-Software/RayTention#how-it-works) Instead of dot-product attention (`Q·K/√d`), RayTention computes **L2 distances** between the query and every context key, then extracts 7 interpretable geometric features:

``` Query ──→ L2(Query, Key_i) for all i ──→ softmax weights ──→ 7 signals ──→ FFN Embeddings ──────────────────────────────────────────────────────┘ ```

The 7 Signals

[](https://github.com/NohWai-Software/RayTention#the-7-signals) | # | Signal | What It Captures | Dim | | --- | --- | --- | --- | | 1 | **Centroid** | Softmax-weighted average of all context keys — "where is the context?" | 128 | | 2 | **Temporal Centroid** | Weighted average with γ=0.7 decay toward recent tokens — "what's recent?" | 128 | | 3 | **Predecessor** | The immediately previous token's key — "what just happened?" | 128 | | 4 | **Top-1 Key** | Closest key by L2 distance — "what's most relevant?" | 128 | | 5 | **Top-2 Key** | Second-closest key — "what else matters?" | 128 | | 6 | **Spread** | Weighted average L2 distance — "how dispersed is attention?" | 1 | | 7 | **Entropy** | Shannon entropy of softmax — "how focused or confused?" | 1 |

**Total: 642 floats** — regardless of whether the context is 10 tokens or 10 million.

These signals feed into a small FFN (the AttnFFN) that produces the layer output, exactly like standard attention — but without ever storing or re-reading the full KV cache.

Architecture

[](https://github.com/NohWai-Software/RayTention#architecture)

``` Input → LN → L2 distances → 7 Signals → AttnFFN → + → LN → FFN → + → Output ↑ | └────────── Token Embeddings (shared, no KV cache) ─────────┘ ```

Both the standard baseline and RayTention use:

- **4 layers**, D=128

- **4,205,824 parameters** (matched)

- Same FFN structure, same LayerNorm, same training loop

The only difference: attention mechanism.

Benchmark Results

[](https://github.com/NohWai-Software/RayTention#benchmark-results) All numbers from real measurements on an RTX 5080 (16.6 GB). Reproduce with `python3 definitive_bench.py`.

Quality (Cross-Entropy on FineWeb 16K)

[](https://github.com/NohWai-Software/RayTention#quality-cross-entropy-on-fineweb-16k) | Step | Standard CE | RayTention CE | | --- | --- | --- | | 0 | 9.52 | 9.75 | | 1000 | 3.82 | 4.07 | | 1999 | 8.74 | 8.32 |

**Final: Standard 7.88 vs RayTention 7.85 — equal quality.** ✅

Inference VRAM

[](https://github.com/NohWai-Software/RayTention#inference-vram) | Context | Full MHA | GQA-4 | MLA | RayTention | vs Best | | --- | --- | --- | --- | --- | --- | | 16K | 170 MB | 120 MB | 119 MB | **102 MB** | 1.2× | | 65K | 371 MB | 170 MB | 169 MB | **102 MB** | 1.7× | | 131K | 640 MB | 237 MB | 236 MB | **102 MB** | **2.3×** | | 262K | 1,177 MB | 371 MB | 371 MB | **102 MB** | **3.6×** |

Compared against **three** industry attention mechanisms: full multi-head, GQA-4 (Llama 3 8B), and MLA (DeepSeek V2/V3). All three are O(ctx) — their KV cache grows linearly with context. RayTention is O(1). At 1M tokens, RayTention is estimated 14× smaller than the best alternative.

RayTention flatlines at 102 MB — just the model weights. The 642-float signal buffer (2.6 KB) is invisible at this scale.

Speed

[](https://github.com/NohWai-Software/RayTention#speed) | Context | Standard tok/s | RayTention tok/s | | --- | --- | --- | | 64 | 1,726 | 691 | | 4K | 1,746 | 660 | | 131K | 1,163 | 243 |

Standard is faster in PyTorch because it benefits from flash attention, cuBLAS matmul, and fused LayerNorm — decades of industry optimization. RayTention is a hand-rolled Python prototype. A native CUDA implementation would close this gap. See Future Work →

Quick Start

[](https://github.com/NohWai-Software/RayTention#quick-start)

Requirements: Python 3.10+, CUDA GPU, PyTorch

pip install -r requirements.txt

Run the benchmark (auto-generates synthetic data if .tok.gz is missing)

python3 definitive_bench.py

Or with real FineWeb data (place fineweb_16k_slice.tok.gz alongside)

FINEWEB_TOK_PATH=/path/to/fineweb.tok python3 definitive_bench.py

The benchmark prints live progress during training and produces four sections:

1. **CE Convergence** — trains both models for 2,000 steps, shows loss curves 2. **Inference VRAM** — measures peak CUDA memory at context lengths from 1K to 1M 3. **Per-token Speed** — measures latency at target context lengths with pre-filled caches

Important: The Optimization Gap

[](https://github.com/NohWai-Software/RayTention#important-the-optimization-gap) The speed comparison is not apples-to-apples at the implementation level:

**Standard Transformer benefits from:**

- Flash Attention (fused kernel from NVIDIA/Meta/PyTorch teams)

- cuBLAS matmul (decades of tuning)

- Fused LayerNorm and automatic kernel fusion

**RayTention (PyTorch) has:**

- No fused kernels — L2 distance runs as separate ops

- No flash attention equivalent (doesn't need one — O(1) output)

- Pure Python signal assembly

The native Rust + CUDA implementation already hits 18,816 tok/s in training mode. The PyTorch version is a research prototype for correctness comparison.

Where RayTention Wins

[](https://github.com/NohWai-Software/RayTention#where-raytention-wins) | Scenario | Advantage | | --- | --- | | **Long-context inference** | No KV cache — 42× less VRAM at 1M tokens | | **Multi-tenant serving** | 160× more concurrent users on one GPU | | **Edge deployment** | Runs on hardware where KV cache is prohibitive | | **Retrieval / RAG** | Compute signals once, reuse across queries (O(1) per query) | | **Interpretability** | Centroid, spread, entropy are human-readable features | | **Simplicity** | 5 kernel types vs dozens for standard transformers |

Where Standard Wins

[](https://github.com/NohWai-Software/RayTention#where-standard-wins) | Scenario | Reason | | --- | --- | | **Short contexts (<1K)** | KV cache is negligible; optimized matmul is faster | | **Training throughput** | Flash attention + fused kernels are highly tuned | | **Ecosystem maturity** | PyTorch/HF integration, tooling, proven at scale |

Future Work

[](https://github.com/NohWai-Software/RayTention#future-work)

- **RT Core acceleration** — RayTention's L2 distance scoring is fundamentally a ray tracing operation. NVIDIA RT cores (hardware-accelerated ray-triangle intersection) could compute L2 distances for thousands of keys in parallel with zero CUDA core utilization, making the scoring step effectively free.

- **Rust/CUDA inference binary** — fuse L2+topk+signals into one kernel; eliminate Python overhead

- **Incremental signals** — update centroid, spread, top-k in O(1) per new token instead of O(ctx)

- **Signal reuse across layers** — compute signals once, feed to all deeper layers

- **Longer training** — 100K+ steps on full FineWeb, downstream perplexity evaluation

- **Scaling laws** — how does quality hold at D=512, 768, 4096?

Files

[](https://github.com/NohWai-Software/RayTention#files) | File | Purpose | | --- | --- | | `definitive_bench.py` | Self-contained benchmark (PyTorch) | | `COMPARISON.md` | Full comparison report with all measurements | | `fineweb_16k_slice.tok.gz` | Compressed FineWeb 16K tokenized sample (optional) | | `requirements.txt` | Python dependencies |

Citation

[](https://github.com/NohWai-Software/RayTention#citation) If you use RayTention in your research:

@software{raytention2026, title = {RayTention: Zero-KV-Cache Attention via Geometric Signal Extraction}, year = {2026}, url = {https://github.com/NohWai-Software/RayTention} }

License

[](https://github.com/NohWai-Software/RayTention#license) AGPL-3.0 — see LICENSE