p.enthalabs

GitHub - Kanchisaw03/axiom

Linux preempts transformer inference mid-layer and allocates tensor state with general-purpose 4 KB pages, which causes cache disruption and memory behavior that is mismatched to inference on constrained machines.

AXIOM is a bootable Rust no_std kernel built to evaluate a different approach: make inference-critical OS primitives first-class and remove general-purpose abstractions from the hot path.

AXIOM is not a general-purpose operating system. It is an inference substrate.

Research Claim

[](https://github.com/Kanchisaw03/axiom#research-claim) AXIOM demonstrates that inference-specific OS primitives can be implemented in a bootable bare-metal kernel:

- tensor-native memory allocation

- layer-boundary scheduling

- double-buffered weight streaming

In current measurements, streaming overhead dropped from approximately 1.4 seconds per layer to approximately 42 microseconds per layer after prefetch-path corrections. Current end-to-end throughput is still bottlenecked by compute and by VM-based storage emulation. The intended evaluation target is memory-constrained 7B-class inference on bare metal NVMe.

Problem Statement

[](https://github.com/Kanchisaw03/axiom#problem-statement) Transformer inference has runtime structure that is predictable at layer granularity and tensor-shape granularity. The default Linux stack is optimized for broad multiprogrammed workloads, not this structure.

Two failure modes motivate AXIOM:

1. Scheduler granularity mismatch. Linux CFS timer preemption is typically on the order of 4 ms. Attention and FFN segments in a single layer can span similar or longer intervals. Mid-layer preemption can evict the active working set and force repeated cache re-warm on every layer transition.

2. Memory abstraction mismatch. Linux buddy allocation and page-level management are generic. KV cache and layer weights are not generic: they are large, shape-stable tensors with predictable access. Under memory pressure (for example, 7B on 4 GB class systems), swap fallback dominates latency and makes interactive inference impractical.

AXIOM addresses both issues at kernel level, not in userspace.

Scope

[](https://github.com/Kanchisaw03/axiom#scope) AXIOM currently boots, initializes memory/interrupt/scheduler/streaming/runtime subsystems, runs quantized transformer inference, emits per-layer timing and benchmark telemetry, and halts.

AXIOM intentionally excludes:

- general userspace

- filesystem services

- networking stack

- process model beyond inference task control

Architecture

[](https://github.com/Kanchisaw03/axiom#architecture)

``` +------------------------- AXIOM Boot Sequence -------------------------+ | main.rs | | serial::init() | | -> memory::init() | | -> interrupts::init() | | -> scheduler::init() | | -> streaming::init() | | -> TransformerRuntime::new() | +---------------------------------------------------------------------+

Per-token Execution +---------------------------------------------------------------------+ | forward_one_token() | | 1) streamer.begin_layer(L) | | - compute uses buffer A | | - prefetch loads L+1 into buffer B | | 2) scheduler.begin_layer() -> LayerLock acquire | | - timer preemption deferred until layer boundary | | 3) compute path | | - rmsnorm -> attention -> residual | | - rmsnorm -> ffn (proj/act/down) -> residual | | 4) LayerLock release -> yield_at_layer_boundary() | | 5) streamer.end_layer(L) | +---------------------------------------------------------------------+

Physically Contiguous Tensor Region (Pre-reserved) +------------------------+-----------------------+--------------------+ | KVCachePool (~45%) | WeightPool (~45%) | Activation (~10%) | | fixed slabs | slot state machine | triple ring buffer | | no hot-path alloc | U->S->R->C->R/U | index = pos % 3 | +------------------------+-----------------------+--------------------+

Legend: U=Unloaded, S=Streaming, R=Resident, C=Computing ```

``` Virtual Memory Map (current QEMU VM configuration, not to scale)

+------------------------------+--------------------------------------+ | Address Range | Region | +------------------------------+--------------------------------------+ | 0x444444440000-0x444444540000| Kernel heap (1 MiB) | | 0x20000000 | Tensor region end | | 0x1e666000-0x20000000 | ActivationPool (~10%, triple ring) | | 0x17333000-0x1e666000 | WeightPool (~45%, layer slots) | | 0x10000000-0x17333000 | KVCachePool (~45%, fixed slabs) | | 0x10000000 | Tensor region base (256 MiB window) | +------------------------------+--------------------------------------+ ```

1) Tensor-Native Allocator (replaces generic page-centric allocation)

[](https://github.com/Kanchisaw03/axiom#1-tensor-native-allocator-replaces-generic-page-centric-allocation) Linux behavior replaced:

- generic buddy allocator over 4 KB pages

- no tensor-shape awareness

- fragmentation and page-level indirection for long-lived inference tensors

AXIOM behavior:

- pre-reserved physically contiguous tensor region

- three dedicated pools:

- KVCachePool: fixed slabs, no hot-path relocation, no allocator churn

- WeightPool: per-layer slots with explicit state machine (Unloaded, Streaming, Resident, Computing)

- ActivationPool: triple ring buffer, zero allocation in forward hot path

Why this is better for inference:

- tensor lifetimes and shapes are known

- allocator work is moved out of the token path

- memory layout matches access pattern instead of process-centric abstractions

2) LayerLock Scheduler (replaces time-slice preemption during layer compute)

[](https://github.com/Kanchisaw03/axiom#2-layerlock-scheduler-replaces-time-slice-preemption-during-layer-compute) Linux behavior replaced:

- periodic preemption independent of transformer layer boundaries

AXIOM behavior:

- explicit suppression of preemption while one layer executes

- scheduling decisions at YieldAtLayerBoundary()

Why this is better for inference:

- preserves layer-local cache residency

- removes scheduler-induced mid-layer interruption

- makes timing behavior easier to reason about at model-layer granularity

3) Double-Buffered Weight Streaming (replaces blocking layer fetch)

[](https://github.com/Kanchisaw03/axiom#3-double-buffered-weight-streaming-replaces-blocking-layer-fetch) Linux/application behavior replaced:

- userspace-driven weight loading with limited control over kernel I/O scheduling overlap

- frequent compute stalls waiting on next-layer weights

AXIOM behavior:

- layer N computes from buffer A while layer N+1 is loaded into buffer B

- explicit begin_layer/end_layer integration between runtime and streamer

Why this is better for inference:

- directly overlaps I/O and compute at the layer boundary

- creates a controlled pipeline for models that do not fully fit in memory

Repository Layout

[](https://github.com/Kanchisaw03/axiom#repository-layout)

``` axiom/ |- Cargo.toml |- x86_64-axiom.json |- .cargo/config.toml |- tools/ | |- pack_weights.py | |- run_qemu.ps1 | |- run_qemu_wsl_safe.sh |- benchmarks/ | |- compare.py | |- results.csv |- src/ |- main.rs |- serial.rs |- interrupts.rs |- memory/ | |- frame.rs | |- paging.rs | |- heap.rs | |- tensor.rs | |- mod.rs |- scheduler/ | |- layer_lock.rs | |- mod.rs |- streaming/ | |- virtio.rs | |- prefetch.rs | |- mod.rs |- runtime/ |- matmul.rs |- attention.rs |- rope.rs |- rmsnorm.rs |- ffn.rs |- sampling.rs |- tokenizer.rs |- mod.rs ```

Build Environment

[](https://github.com/Kanchisaw03/axiom#build-environment)

- Rust nightly

- target: x86_64-unknown-none (no_std)

- bootloader: 0.9.34

- target features: +avx2, +fma, +sse2 (with SSE/AVX state enabled in kernel before inference)

- tested models: SmolLM2-135M Q4, TinyLlama-1.1B Q4

Typical flow:

rustup toolchain install nightly rustup component add rust-src --toolchain nightly cargo install bootimage python tools/pack_weights.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output axiom_weights.img --quant q4 cargo +nightly run --release

Benchmark Progression (Single Optimization Session)

[](https://github.com/Kanchisaw03/axiom#benchmark-progression-single-optimization-session) Environment for this progression:

- KVM on WSL2

- SmolLM2-135M Q4

- iterative kernel/runtime optimizations in one session

This table is the central empirical story: not one final number, but cumulative impact of architectural and kernel-path changes.

| Stage | TPS | TTFT (ms) | Mean Layer Time (us) | | --- | --- | --- | --- | | Baseline TCG | 0.023 | 43120 | 1445125 | | + KVM enabled | 0.049 | 19855 | 681083 | | + Prefetch fix | 0.185 | 5005 | 180583 | | + Cold start fix | 0.238 | 3795 | 139791 | | + FFN rewrite | 0.289 | 3300 | 115041 | | + LUT optimization | 0.325 | 3355 | 102208 | | + AVX2 Q4 unpack | ~0.340 | 2860 | ~98000 |

Overall improvement across this sequence is approximately 14.8x TPS (0.023 to ~0.340), with major gains coming from removing I/O stalls and reducing FFN projection cost.

Primary Profiling Result

[](https://github.com/Kanchisaw03/axiom#primary-profiling-result) Before the prefetch-path correction:

- stream_us was approximately 1200000 to 1700000 us per layer

- stream accounted for approximately 99.5% of layer time

After the prefetch-path correction:

- stream_us is approximately 42 us per layer in trace logs

- streaming overhead is no longer the dominant term

Interpretation:

- the double-buffer design is validated as an architectural direction

- once streaming overhead collapsed, compute kernels became the limiting factor

Current Bottleneck Breakdown (0.340 TPS State)

[](https://github.com/Kanchisaw03/axiom#current-bottleneck-breakdown-0340-tps-state) Representative per-layer timing (current state):

- stream_us: ~42 us

- attn_us: ~9500 us

- ffn_proj_us: ~29000 us

- ffn_down_us: ~13000 us

- ffn_act_us: ~250 us

- rms_us: ~25 us

- layer_us total: ~52000 us

Dominant bottleneck is FFN projection/down-projection compute, not streaming.

Limitations

[](https://github.com/Kanchisaw03/axiom#limitations) These are current facts, not caveats to hide.

- Benchmarks are on KVM with emulated virtio-blk storage, not bare metal NVMe.

- SmolLM2-135M fits in RAM; at this size, streaming overhead is not the target benefit.

- VM noise is material; two similar runs have shown 0.278 vs 0.323 TPS. Report medians over repeated runs, not single samples.

- AVX2 intrinsic path was attempted and regressed relative to scalar LUT in this environment; scalar Q4 LUT path remains default.

- No llama.cpp comparison is presented here because AXIOM targets memory-constrained workloads where full in-memory baseline assumptions do not hold.

- Current runtime still contains a simplified final logit projection path in kernel code; this is functional for bring-up but not a final-quality output head implementation.

Intended Evaluation Target

[](https://github.com/Kanchisaw03/axiom#intended-evaluation-target) The evaluation target is not small models that fit comfortably in memory. The target is:

- 7B-class models on constrained RAM footprints (for example 4 GB class systems)

- conditions where full model residency is impossible

- where overlap of streaming and compute determines feasibility

This is where OS-level control over scheduling, memory layout, and I/O pipeline is expected to matter most. This statement is a projected hypothesis until measured on the target hardware.

Current Status and Future Work

[](https://github.com/Kanchisaw03/axiom#current-status-and-future-work) Done now:

- bootable Rust no_std kernel with end-to-end inference path

- tensor-native allocator integrated with runtime

- layer-boundary scheduling primitive integrated with timer path

- double-buffered streaming integrated with layer execution

- measured reduction of stream overhead from second-scale to microsecond-scale per layer

- repeatable fast-benchmark runs around ~0.34 TPS in current VM setup

Pending next:

- bare-metal NVMe evaluation on memory-constrained 7B workloads

- stronger statistical protocol (median, dispersion, repeated trials per configuration)

- compute kernel optimization focused on FFN projection/down-projection

- resolution of AVX2 path regressions with architecture-aware dispatch/tuning

- replacement of simplified output-head path with full production logits path

What bare metal changes:

- removes virtio emulation overhead and a major source of VM jitter

- gives direct visibility into storage and cache behavior under realistic constraints

- provides the first valid setting to evaluate AXIOM's central workload claim: feasible low-memory 7B inference with bounded latency