nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	75288bd12f	Wire prefill FMHA into production.py and single_shot - Add dsv4_attention_mixed_fp8_prefill to production.py - _run_production_fmha_mixed now dispatches to prefill kernel for T>1 - Remove decode-only T==1 restriction - Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted	2026-06-03 03:49:57 +00:00
biondizzle	af58f2c5b2	Add B1 weight/format verification at L0 in single_shot	2026-06-03 01:52:55 +00:00
biondizzle	b9243fe40a	B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k - New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4 - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3 - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA) - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu) - Modified: single_shot_inference.py - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16) - Consumes FP8 indexer keys from cache without BF16 dequantization - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode) - FP32 einsum fallback retained only for T>1 (prefill) - Removed 'Intentional first-pass limits' section from B1 doc (those limits ARE the correct production design, not shortcuts)	2026-06-02 23:18:54 +00:00
biondizzle	a9d5e09f4c	B1: mixed FP8/BF16 decode FMHA integration - New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel) - New: fmha_mixed_fp8_capi.cu (C ABI launcher) - New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge) - New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels) - New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers) - Modified: production.py (dsv4_attention_mixed_fp8_decode API) - Modified: single_shot_inference.py (B1 gather + FMHA path) - Modified: __init__.py (export mixed FP8 API) - New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16. No global FP8->BF16 KV staging before FMHA. Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64. CUDA compile/runtime validation pending on B200.	2026-06-02 22:53:14 +00:00
biondizzle	9d4a014fad	Fix NameError: dequantize_nvfp4 not in scope in forward_attention The B3 fused q_a_norm path used dequantize_nvfp4 but it was only imported in forward_layer, not forward_attention. Added local import.	2026-06-02 21:52:29 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	7e42b5e090	A1: Add ◇ (think_start) priming after Assistant token DSV4 is a reasoning model. The standard prompt format is: BOS <\|User\|> prompt <\|Assistant\|> ◇ Without the ◇ priming, the model is out-of-distribution — it expects to be inside a thinking block but never received the sentinel. This causes degenerate output from step 0 (France instead of Paris, looping on newlines/repeated tokens). With ◇, the model will: 1. Generate thinking content (reasoning) 2. Emit ◇ (think_end=128822) to close the thinking block 3. Produce the actual answer 4. Emit EOS (token 1) This matches the pattern described in the Kimi K2 accuracy blog: https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed prompt formatting is the #1 cause of degenerate output in chat-tuned reasoning models.	2026-06-02 20:23:47 +00:00
biondizzle	ecd48ab65e	A1: Add explicit stop set for DSV4 turn-end tokens Previously only stopped on tokenizer.eos_token_id. DSV4 uses special turn-end tokens (<\|end_of_sentence\|>, USER_TOKEN=128803) that indicate the assistant turn is complete. Missing these caused decode to continue past the model's natural stopping point, producing degenerate output. Also increased diagnostic logging (every step for first 20 steps) to catch turn-end token emissions.	2026-06-02 19:59:52 +00:00
biondizzle	eb5ef93bf1	Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize - Added --ab-compare flag to run both fused and unfused paths for first 3 layers - Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv) - Added --no-fused-rmsnorm to disable P4 and use unfused path - This will help diagnose the correctness regression introduced by P4	2026-06-02 18:49:30 +00:00
biondizzle	7bb3207347	P4: Integrate fused RMSNorm+quantize into single_shot (attention path) - forward_layer: use rmsnorm_quantize_nvfp4 for attn_norm - forward_attention: accept x_quant, use run_from_quantized for q_a/kv - Dequantize for compressor/indexer (still saves 2+ launches per site) - FFN path kept unfused — MoE internal quantization needs refactoring (P5) - _use_fused_rmsnorm_quantize flag to toggle (default True)	2026-06-02 16:38:44 +00:00
biondizzle	82294fc21e	Fix nope_dim UnboundLocalError — hoist to function scope	2026-06-02 11:18:58 +00:00
biondizzle	c89762ecdd	Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage	2026-06-02 10:20:26 +00:00
biondizzle	1f69f61363	Add detailed comment: why compressed KV uses FP8 not NVFP4 We tried NVFP4 (Blackwell native FP4→MMA). Three approaches. cos=0.995 round-trip seems fine in isolation but 4.5 effective bits compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective bits gives cos=0.9997 — that 0.4% difference is the margin between working and broken. Kernels exist, path is proven, precision isn't.	2026-06-02 10:19:54 +00:00
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	7ef6402936	KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys Architecture: - Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa) - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4 - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA - No BF16 intermediate in the write path - Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale) - Write path: compress→FP32 → quantize FP32→FP8_E4M3 - Read path: dequant_fp8_e4m3 → BF16 for scoring - SWA: remains BF16 (8MB total, fits in L2) New kernels in kv_quantize.cu: - compute_amax_gsa_fp32: per-row gsa from FP32 input - quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer - quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys - dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16 - rope_fp32: FP32 GPT-J interleaved RoPE (no BF16) Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused): Kernel 1: amax_gsa (GPU-only) Kernel 2: quantize from buffer (GPU gsa) No shared memory bugs. No cross-CTA race conditions. KVCache updated: - comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16) - comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16) - comp_kv property: dequant NVFP4→BF16 on demand - comp_kv_selective: dequant only top-k entries (bandwidth savings) - comp_idx_kv property: dequant FP8→BF16 on demand Removed: compressor_reduce_quant.cu (buggy single-kernel approach)	2026-06-02 10:00:50 +00:00
biondizzle	3c295f225a	P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated _apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call) instead of PyTorch ops (5-6 kernels per call). Total: 183 RoPE calls × (5-1) = 732 launches saved per token. With fallback to PyTorch if CUDA kernel fails.	2026-06-02 09:08:07 +00:00
biondizzle	553275d810	feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group)	2026-06-02 08:25:52 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	e9506e0c20	perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from (max_context + compress_ratio - 1) // ratio per layer type. CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries. No more hardcoded 65536 that wastes memory on HCA layers. C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache. Gather writes compressed+SWA into this buffer via slice assignment. Zero torch.cat allocations on the hot decode path. C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed views. Caller copies into gather_buf so no aliasing risk.	2026-06-02 06:18:06 +00:00
biondizzle	617da29a5b	fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only The indexer silently returning None caused CSA layers to attend over only the SWA window (128 tokens), not the compressed sparse KV. This went undetected because the model still produced plausible output at short context. The assert makes any future indexer regression immediately visible.	2026-06-02 06:14:23 +00:00
biondizzle	5b4c496512	fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum 1. Indexer.load: weights at .indexer.kv_proj not .indexer.compressor.kv_proj 2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim 3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd); einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads (paper's c_I = ihd = 128, one vector per compressed block) Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP, RESHAPE FAILURE, indexer load state) — served their purpose.	2026-06-02 05:53:10 +00:00
biondizzle	8162c586c3	probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete	2026-06-02 05:38:44 +00:00
biondizzle	5be31d8582	fix: indexer compressor weight path — weights are at .indexer.kv_proj not .indexer.compressor.kv_proj	2026-06-02 05:25:44 +00:00
biondizzle	fdfcca918c	probe: verify indexer compressor load state	2026-06-02 05:17:00 +00:00
biondizzle	fb0ed87626	probe: add indexer compressor early-return and buffering diagnostics	2026-06-02 05:06:18 +00:00
biondizzle	06c92f208f	INDEXER PROBE: instrumentation prints for compressed key width investigation	2026-06-02 04:44:47 +00:00
biondizzle	f0dec9f6bd	profile: fine-grained attention component timing	2026-06-02 03:08:34 +00:00
biondizzle	7114c48575	fix: parenthesize profile_detail condition	2026-06-02 02:56:13 +00:00
biondizzle	4734e894c7	profile: add per-layer attn vs ffn timing with CUDA sync	2026-06-02 02:46:35 +00:00
biondizzle	4017ef2f16	fix: accurate profile sync + remove paris_tids 129K iteration	2026-06-01 23:55:26 +00:00
biondizzle	73ae9393da	FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536	2026-06-01 23:18:37 +00:00
biondizzle	36f9782bad	Add thinking/Paris token logit check on step 0 for quality debugging	2026-06-01 23:14:24 +00:00
biondizzle	ef7e0d63bb	Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches	2026-06-01 23:04:44 +00:00
biondizzle	008e59eb90	Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)	2026-06-01 23:03:46 +00:00
biondizzle	e53645654d	Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float	2026-06-01 22:33:03 +00:00
biondizzle	6f4bbc997a	Add sync after sampler for step<3 to catch async CUDA errors early	2026-06-01 22:32:40 +00:00
biondizzle	5493a8727e	P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16	2026-06-01 22:29:56 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	8767c263ab	Add cuda.synchronize + better logits validation after lm_head Catch CUDA errors at the source instead of seeing them surfaced at torch.topk. Print logits stats every step.	2026-06-01 22:06:41 +00:00
biondizzle	2a6f9a10b1	lm_head: fall back to BF16 F.linear for stability NVFP4 quantize_from_buffer produces CUDA error on large-magnitude inputs (\|X\|>500 at L60 output). BF16 lm_head is correct and only runs once per decode step — not a bottleneck. TODO: debug the NVFP4 path for large activations and re-enable.	2026-06-01 22:05:22 +00:00
biondizzle	9bad30c777	Add logits validation debug before topk sampling	2026-06-01 21:59:23 +00:00
biondizzle	e3412cf913	P5: In-place RoPE — no x.clone(), no empty_like allocation Eliminates 183 kernel launches per decoded token from pointless memcpy. Operates on rope dims in-place via views instead of cloning the full tensor and allocating an empty_like buffer.	2026-06-01 21:18:41 +00:00
biondizzle	230d28e562	Fix KVCache constructor call — device as keyword arg, not positional KVCache signature has max_comp before device, so positional pass of dev was hitting max_comp parameter instead of device.	2026-06-01 21:11:01 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	4f698baa5d	Production fused CUDA sampler + decode loop optimizations - Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty + top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs - Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference - Update single_shot_inference.py: - Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95) - Pre-allocate decode buffers (no per-step torch.tensor allocation) - Track thinking tokens (128821/128822) — not garbage for reasoning model - Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20) - Add --top-k and --top-p CLI args - Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)	2026-06-01 20:29:57 +00:00
biondizzle	2830a3ee7c	Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout quantize_weight_to_nvfp4 returns (K_packed, N) but Nvfp4Linear expects (N, K_packed) from the checkpoint format. Transpose both fp4 and sf.	2026-06-01 19:51:21 +00:00
biondizzle	16b72b9581	PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head 1. o_a_proj (Nvfp4GroupedLinear): Added load_nvfp4_weight() method that loads checkpoint NVFP4 weights directly — no more dequant→BF16→requant. Each group's weight is transposed from (N, K_packed) checkpoint layout to (K_packed, N) layout expected by the grouped GEMM. 2. lm_head: Quantize BF16 weight to NVFP4 at load time, use production Nvfp4Linear GEMM instead of F.linear. Runtime gsa for activation. Frees the 1.8GB BF16 weight after quantization. 3. Hash router (L0-2): Already optimal — tid2eid is an int32 lookup, no GEMM to accelerate.	2026-06-01 19:41:21 +00:00

1 2 3 4 5

212 Commits