nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	f259d63930	CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path	2026-06-06 07:01:52 +00:00
biondizzle	ae26f6b83c	Fix dense router BF16 dispatch: use torch.matmul instead of F.linear - F.linear(x, W) computes x @ W.T which caused shape mismatch when W_gate was pre-transposed to [E, H] - Use torch.matmul(x, W_gate) instead — computes x @ W directly, no transpose needed, no FP32 conversion, fully graph-capturable - W_gate stays as [H, E] (original checkpoint shape)	2026-06-04 05:58:24 +00:00
biondizzle	e46b615873	Fix dense router BF16 dispatch for CUDA graph capture - Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported - Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture - Convert only logits output to FP32 for sqrt(softplus) numerical stability - This fixes the graph capture failure at layer 0 Graph B	2026-06-04 05:50:13 +00:00
biondizzle	119e6d471e	Add safety check for swizzled buffers: fall through to Python path if None	2026-06-04 04:32:00 +00:00
biondizzle	5487a58df4	Fix NameError: add rows/cols variables to MoE swizzle	2026-06-04 03:14:27 +00:00
biondizzle	a434545d12	Blackwell swizzle CUDA kernel for CUDA graph capture Python view operations (reshape, transpose, permute) are not graph-capturable — they cause cudaErrorStreamCaptureUnsupported. Added: - dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle - to_blocked(): detects graph capture, uses CUDA kernel instead of Python views - MoE _assemble_scales_cudagraph_safe: same treatment - Shared expert _assemble_scales_single_group: same treatment - Linear _assemble_scales_single_group: same treatment - Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like) The CUDA kernel writes to a pre-allocated buffer — no per-step allocations. Eager path unchanged (still uses fast Python view operations).	2026-06-04 03:03:02 +00:00
biondizzle	e7766254b7	Pre-allocate ALL GEMM output buffers for CUDA graph capture Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated buffer. During CUDA graph capture, torch.zeros() allocations are forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors. Added: - shared_expert: _l2_out_buf for L2 GEMM - shared_expert: pass out= for both L1 and L2 GEMM calls - moe: _l2_out_buf for L2 GEMM - moe: pass out= for unfused L1 GEMM (fused L1 already had it) - moe: pass out= for L2 GEMM - linear: _gemm_out_buf for all GEMM calls - linear: pass out= for both run() and run_from_quantized() paths grouped_linear already had _output_buf_padded — no changes needed.	2026-06-04 02:41:59 +00:00
biondizzle	676a0448c0	CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption The L1 GEMM produces gate+up combined output with 2intermediate_size BF16 columns, but _l1_out_buf was only allocated with intermediate_size columns. The GEMM wrote past the buffer boundary, corrupting GPU memory and causing cudaErrorInvalidValue on subsequent operations. This was the root cause of ALL the cudaErrorInvalidValue errors in the shared expert and MoE L2 paths — the corrupted memory from the L1 buffer overflow propagated downstream. Fix: _l1_out_buf shape (max_rows, 2intermediate_size) instead of (max_rows, intermediate_size). Applied to both shared_expert.py and moe.py. Also removed all DEBUG sync/print statements from quantize.py and shared_expert.py — the bug was not in the quantize kernels, it was the buffer overflow.	2026-06-04 02:06:18 +00:00
biondizzle	0890e578f4	DEBUG: print l1_out shape before gate/up split	2026-06-04 01:49:12 +00:00
biondizzle	8546ed725f	DEBUG: check SE input magnitude	2026-06-04 01:38:24 +00:00
biondizzle	26ecf96328	DEBUG: check intermediate magnitude before SE L2	2026-06-04 01:30:29 +00:00
biondizzle	5303d6a82f	DEBUG: test copy_ with contiguous slice vs scalar assign for gsa	2026-06-04 01:27:25 +00:00
biondizzle	ccbc713658	DEBUG: check gsa values and pinpoint exact failing operation	2026-06-04 01:16:37 +00:00
biondizzle	e77455c3ba	DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors	2026-06-04 01:05:47 +00:00
biondizzle	55def5eef9	Restore A/B split + gsa scalar fix (error is pre-existing, not regression)	2026-06-04 01:03:36 +00:00
biondizzle	59eccd04ab	REVERT: test if cudaErrorInvalidValue is pre-existing or regression	2026-06-04 00:53:09 +00:00
biondizzle	5e3ced0b60	DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path	2026-06-04 00:41:28 +00:00
biondizzle	b314fde9b7	Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment The pattern causes cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view (e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1). Root cause: copy_() from an expanded/reshaped view can fail when the source tensor has non-standard strides. The expand() operation creates a view with stride-0 dimensions that copy_() may not handle correctly on all CUDA versions. Fix: Replace all gsa copy_ patterns with scalar assignment: self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable This is simpler, avoids view issues, and is CUDA-graph-compatible. Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py	2026-06-04 00:30:21 +00:00
biondizzle	993bb345d1	DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug	2026-06-04 00:15:38 +00:00
biondizzle	f0f87df906	DEBUG: add sync + shape prints to shared_expert L2 gsa copy	2026-06-04 00:05:08 +00:00
biondizzle	a468f72a0e	CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step torch.zeros() allocation during CUDA graph capture.	2026-06-03 23:17:43 +00:00
biondizzle	f57de06eb5	Fix grouped_linear GEMM output buffer shape and extraction - _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output - Extraction: groups are stacked vertically, not horizontally - Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns	2026-06-03 22:26:40 +00:00
biondizzle	b32713c302	grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) Add _output_buf_padded for the flat GEMM output, pass as out= parameter to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.	2026-06-03 22:02:01 +00:00
biondizzle	518a1d3f95	CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount 1. scatter_add_ requires int64 indices — ensure sorted_ids is .long() 2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern 3. Both code paths now use pre-allocated _tokens_per_expert_buf	2026-06-03 17:53:40 +00:00
biondizzle	f13a81d48b	CUDA graph: Fix per-call allocations in grouped_linear and quantize 1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle - Same fix as linear.py — avoids torch.zeros per call - Uses correctly-sized view for pad_and_swizzle_single 2. quantize.py: Replace torch.zeros_like with scalar 0.0 - torch.zeros_like allocates a full tensor every call - torch.where(cond, 0.0, x) broadcasts scalar — no allocation	2026-06-03 17:39:20 +00:00
biondizzle	84655d066a	CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4 ) 1. Replace torch.bincount with scatter_add_ into pre-allocated buffer - bincount produces data-dependent shapes → breaks graph capture - scatter_add_ with pre-allocated _tokens_per_expert_buf (fixed shape) - Pre-allocated _ones_buf to avoid per-call torch.ones() 2. Replace torch.full for l1_gsa with pre-allocated buffer + fill_ - torch.full allocates every call → breaks graph capture - Use self._l1_gsa_buf.fill_(l1_gs) instead	2026-06-03 17:37:03 +00:00
biondizzle	df05289d6f	CUDA graph: Fix remaining sync violations from B200 detector run 2 1. grouped_linear.py: Remove conditional host read of GPU tensor - 'if group_offsets[0] != 0' reads GPU value on host → sync - Fix: unconditionally update offsets every call (GPU-only multiply) 2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer - dec_tid_buf[0] = python_int → CPU→GPU sync - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable) 3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)	2026-06-03 17:20:34 +00:00
biondizzle	e07d79868f	CUDA graph: Fix _assemble_scales_single_group swizzle size The pre-allocated buffer is max-sized, but pad_and_swizzle_single operates on the full buffer dimensions. Fix: pass a correctly-sized view (buf[:padded_rows, :padded_cols]) so the swizzle produces the right output size. Same fix applied to both linear.py and shared_expert.py.	2026-06-03 17:02:34 +00:00
biondizzle	0ca7bed0e1	CUDA graph: Fix sync violations found by B200 detector Fixes from running Section A detector on B200: 1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer - dec_tid_buf[0] = python_int causes CPU→GPU sync - Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable) 2. grouped_linear.py: Fix expert_offsets Python loop - expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration - Fixed: element-wise multiply with pre-allocated range tensor (GPU-only) 3. grouped_linear.py: Vectorized output extraction for T=1 decode - Python loop z[:, g, :] = out[...] → CPU sync for each slice - Fixed: GPU gather with pre-computed indices for T=1 4. grouped_linear.py: Pre-allocate output buffer - torch.empty() per call → allocation inside graph - Fixed: use self._output_buf (pre-allocated at max size) 5. grouped_linear.py: Pre-allocate expert_offsets_range_buf - torch.arange() per call → allocation inside graph - Fixed: compute once at init, reuse via element-wise multiply	2026-06-03 16:52:19 +00:00
biondizzle	46a3a51832	CUDA graph: Fix per-step allocations in decode loop 1. mHCLayer.init_state: Add out_buf parameter for in-place write - Pre-allocated dec_X_buf (1, 4, 7168) on cuda:0 - Eliminates .unsqueeze().expand().clone() allocation each step 2. single_shot_inference.py: Pre-allocate dec_embed_buf - Placeholder for embedding output (graph capture will use this) 3. Note: Cross-GPU X.to() transfers still allocate per step - This requires per-GPU X buffers (part of graph capture architecture)	2026-06-03 16:38:35 +00:00
biondizzle	a9ea30353c	CUDA graph: Fix sync violations (Category 1-2) 1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated) - The X_next.abs().max().item() was syncing EVERY layer's post_block - Diagnostics moved to caller (outside graph region) 2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size - _assemble_scales_single_group now uses pre-allocated buffer - Eliminates per-call torch.zeros() allocation (graph capture killer) 3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf - _assemble_scales_single_group no longer allocates 4. quantize.py: Remove .contiguous() from gsa expand - expand() creates stride-0 view, CUDA kernel reads correctly - No allocation on the hot path 5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog	2026-06-03 16:37:20 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	0d1cd1e216	P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized - QuantizedActivation: carries (x_fp4, x_sf, gsa) for skip-quantize path - Nvfp4Linear.run_from_quantized(): runs GEMM with pre-quantized input - Enables fused RMSNorm+quantize to feed directly into all downstream linears (q_a, kv, o_proj, etc.) without re-quantizing	2026-06-02 16:37:38 +00:00
biondizzle	6cb5078821	Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback Root cause: float row_max[n] is a VLA — not allowed in CUDA device code. Fix: use shared memory with MHC_MAX_N=16 fixed-size slots. Also: REMOVED the Python fallback in sinkhorn_knopp(). If the CUDA kernel fails, the pipeline DIES. No soft landing. This is the correct behavior — silent fallback to broken precision is worse than a loud crash. The residual growth \|X\|→500-700 at L60 was likely caused by the Python fallback running a DIFFERENT numerical path (BF16 accumulation in torch ops vs FP32 in the CUDA kernel). With the fixed kernel, Sinkhorn should produce properly doubly-stochastic B_l, bounding the residual.	2026-06-02 10:44:53 +00:00
biondizzle	f01d3f3eac	wip: SE fused SwiGLU deinterleave fix	2026-06-02 08:41:00 +00:00
biondizzle	1726cb64a9	fix: interleave_l1_weights granularity_bf16 (not granularity) in SE	2026-06-02 08:29:03 +00:00
biondizzle	7904cf05c4	Add set_fused_swiglu() method to Nvfp4MoE	2026-06-02 07:59:57 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	61d5e7ba53	revert: P2 gsa fill elimination — revert to proven path for e2e stability The fill_() is a CPU→GPU scalar write (tiny cost). The optimization was marginal and the output quality regression (CJK tokens) needs investigation separately. P2 can re-land after the regression is confirmed to be sampling-related (not gsa-related). P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed.	2026-06-02 07:32:10 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	7e3fb5f4d0	fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path	2026-06-02 04:28:29 +00:00
biondizzle	668a42e71a	debug: print mhc_sinkhorn CUDA kernel compile errors	2026-06-02 04:02:34 +00:00
biondizzle	7b82d31330	perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)	2026-06-02 03:50:57 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	16b72b9581	PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head 1. o_a_proj (Nvfp4GroupedLinear): Added load_nvfp4_weight() method that loads checkpoint NVFP4 weights directly — no more dequant→BF16→requant. Each group's weight is transposed from (N, K_packed) checkpoint layout to (K_packed, N) layout expected by the grouped GEMM. 2. lm_head: Quantize BF16 weight to NVFP4 at load time, use production Nvfp4Linear GEMM instead of F.linear. Runtime gsa for activation. Frees the 1.8GB BF16 weight after quantization. 3. Hash router (L0-2): Already optimal — tid2eid is an int32 lookup, no GEMM to accelerate.	2026-06-01 19:41:21 +00:00

1 2

69 Commits