nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	6650f06121	CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug	2026-06-06 08:18:18 +00:00
biondizzle	86275851d4	Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue	2026-06-06 08:02:35 +00:00
biondizzle	2cbf7a43e9	Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification	2026-06-06 07:51:22 +00:00
biondizzle	2bb52c7cae	Add per-layer graph capture verification — replay immediately and check for zeros	2026-06-06 07:40:19 +00:00
biondizzle	5a98cc6d90	Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug	2026-06-06 07:29:33 +00:00
biondizzle	dcb2495a5b	Add graph replay debug prints for first 3 steps/layers	2026-06-06 07:19:07 +00:00
biondizzle	16b9a4def2	Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay	2026-06-06 07:18:49 +00:00
biondizzle	32902d1036	CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se	2026-06-06 07:01:12 +00:00
biondizzle	64f547058e	Fix graph replay: pass q_a from Graph A output to forward_attention - q_a is needed by the indexer in CSA layers - When q_heads/kv_3d are provided (graph replay), the projection code is skipped so q_a is never computed - Fix: add q_a_bufs to CUDAGraphDecoder, write q_a during Graph A capture, pass q_a as kwarg to forward_attention during graph replay - Also: forward_attention now accepts q_a kwarg (default None)	2026-06-04 08:09:30 +00:00
biondizzle	26da6d33af	Fix graph replay: remove extra token_id arg from forward_attention call The forward_attention() signature has no token_id parameter, but the graph replay path was passing dec_tid32_per_gpu[gpu] between positions and compressor — causing the int tensor to be interpreted as compressor and triggering AttributeError: 'int' object has no attribute 'ratio'	2026-06-04 06:10:02 +00:00
biondizzle	55def5eef9	Restore A/B split + gsa scalar fix (error is pre-existing, not regression)	2026-06-04 01:03:36 +00:00
biondizzle	59eccd04ab	REVERT: test if cudaErrorInvalidValue is pre-existing or regression	2026-06-04 00:53:09 +00:00
biondizzle	1d6610c46d	CUDA graph A/B split: eager-break-at-attention architecture CUDAGraphDecoder now splits each layer into two graph-captured regions with eager attention in between: Graph A (pre-attention): mHC pre_block + fused RMSNorm + quantize + q_a/q_b/kv projections → writes intermediates to pre-allocated buffers Eager (attention): Compressor → Indexer → FMHA → o_proj → dynamic shapes, data-dependent control flow Graph B (post-attention): mHC post_block + FFN + Router + MoE + SE → writes X_next to pre-allocated output buffer The attention path has dynamic shapes (FMHA seq_len grows, compressor returns None) and cannot be captured. The compute path has fixed shapes for T=1 decode and CAN be captured. Changes: - CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph - Pre-allocated intermediate buffers for graph A → eager → graph B boundary - forward_attention: accepts optional q_heads/kv_3d to skip projections - Replay loop: graph A → eager attention → graph B per layer This replaces the single-graph-per-layer approach which failed at L1+ because the attention path contains data-dependent control flow and dynamic shapes that cannot be captured.	2026-06-03 23:53:08 +00:00
biondizzle	56b816a54f	CUDA graph: Use per-GPU position/token buffers for graph capture Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to capture() so each layer's graph uses buffers on its own GPU.	2026-06-03 22:56:20 +00:00
biondizzle	92225b07e7	CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split) The A/B split approach was too complex: it required splitting forward_layer, handling the eager FMHA section, and fixing per-GPU buffer issues. The simpler approach captures the entire forward_layer as one graph per layer, just like the detector test did for L0. This works because: - FMHA pads KV to 128 → fixed shape for graph capture - Compressor returns None on non-boundary steps → graph captures the path taken during warmup (typically the None path for HCA r=128) - All sync violations were already fixed in previous commits The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer handles device transfer internally).	2026-06-03 22:04:18 +00:00
biondizzle	2661cebe9a	Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa)	2026-06-03 19:49:54 +00:00
biondizzle	486f74d900	CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split Architecture: - Sub-graph A (per layer): mHC pre + fused rmsnorm/quantize + Q/KV projections + RoPE - Eager section: KV append + Compressor + Indexer + KV gather + FMHA + Inverse RoPE - Sub-graph B (per layer): o_proj + mHC post(attn) + mHC pre(FFN) + fused rmsnorm/quantize + Router + MoE + SE + mHC post(FFN) - lm_head graph on cuda:0 Key features: - Per-GPU token/position buffers (avoids cross-device .to() inside graphs) - Pre-allocated I/O buffers with fixed addresses for graph capture - Uses fused P5 rmsnorm+quantize path inside graphs (production path) - Captures after step 0 warmup (after CuTeDSL compile + gsa fix) - Eager path unchanged for warmup and --no-cuda-graph runs - eager_attention() extracted from forward_attention() for graph replay path Wires --cuda-graph flag into main() decode loop.	2026-06-03 19:24:26 +00:00
biondizzle	0ca7bed0e1	CUDA graph: Fix sync violations found by B200 detector Fixes from running Section A detector on B200: 1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer - dec_tid_buf[0] = python_int causes CPU→GPU sync - Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable) 2. grouped_linear.py: Fix expert_offsets Python loop - expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration - Fixed: element-wise multiply with pre-allocated range tensor (GPU-only) 3. grouped_linear.py: Vectorized output extraction for T=1 decode - Python loop z[:, g, :] = out[...] → CPU sync for each slice - Fixed: GPU gather with pre-computed indices for T=1 4. grouped_linear.py: Pre-allocate output buffer - torch.empty() per call → allocation inside graph - Fixed: use self._output_buf (pre-allocated at max size) 5. grouped_linear.py: Pre-allocate expert_offsets_range_buf - torch.arange() per call → allocation inside graph - Fixed: compute once at init, reuse via element-wise multiply	2026-06-03 16:52:19 +00:00
biondizzle	46a3a51832	CUDA graph: Fix per-step allocations in decode loop 1. mHCLayer.init_state: Add out_buf parameter for in-place write - Pre-allocated dec_X_buf (1, 4, 7168) on cuda:0 - Eliminates .unsqueeze().expand().clone() allocation each step 2. single_shot_inference.py: Pre-allocate dec_embed_buf - Placeholder for embedding output (graph capture will use this) 3. Note: Cross-GPU X.to() transfers still allocate per step - This requires per-GPU X buffers (part of graph capture architecture)	2026-06-03 16:38:35 +00:00
biondizzle	f577ed97f4	Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate) The CUDA dequantize_nvfp4 (dsv4/ops/quantize.py) was designed for activations/KV and assumes row-major (M, N/16) scale layout. Using it for weight dequantization caused async illegal memory access because weight scales don't match the kernel's expected layout. The kernel only validates row count, not width or contiguity. All 4 call sites now use the PyTorch dequant_nvfp4 (defined in single_shot_inference.py) which handles weight_scale_2 and input_scale correctly and cannot cause OOB access: - Compressor.load: kv_proj, gate_proj - Indexer.load: weights_proj - Router gate dequantization in main()	2026-06-03 14:57:40 +00:00
biondizzle	1121cd7b47	Add CUDA_LAUNCH_BLOCKING=1 to catch async errors	2026-06-03 14:48:51 +00:00
biondizzle	f3bb0ca08c	Fix dequant gsa: use ws2 only, NOT input_scale * ws2 For weight dequantization, gsa should be weight_scale_2 only. input_scale is the activation global scale — it belongs on the GEMM's activation side, not the weight side. Using input_scale * ws2 gave gsa = 6e-8 (essentially zero), making dequantized weights ~0. The GEMM formula is y = (x * scale_a * gsa) @ (w * scale_b * gsb) where gsb = input_scale * ws2. But dequantize_nvfp4 is just the weight half: w_bf16 = lut[w] * block_scale * ws2.	2026-06-03 14:38:24 +00:00
biondizzle	470e65fb19	Fix dequant gsb: input_scale * ws2, not 1.0 * ws2 The NVFP4 dequantize formula is w = lut[w_packed] * scale * ws2, and in the GEMM the global_scale_b = input_scale * ws2. Was incorrectly using gsb = 1.0 * ws2 (missing input_scale). This would produce wrongly-scaled BF16 weights from dequantize_nvfp4.	2026-06-03 14:26:59 +00:00
biondizzle	2dd16d5789	Switch compressor + indexer weights_proj to BF16 F.linear Only the CSA indexer QK path (q_b_proj) is explicitly FP4-QATed. The rest of the compressor/indexer projections are NOT, so use BF16: - Compressor kv_proj, gate_proj: dequantize NVFP4 → BF16, F.linear - Indexer weights_proj: dequantize NVFP4 → BF16, F.linear - Indexer q_b_proj: KEEP as NVFP4 (this IS the FP4-QATed path) - Indexer compressor: inherits Compressor's BF16 path	2026-06-03 14:19:41 +00:00
biondizzle	95e45a87e3	Add explicit .to(dev) on W_gate after transpose — belt and suspenders	2026-06-03 14:17:02 +00:00
biondizzle	ef94c48957	Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman) Same as what worked before. The checkpoint stores NVFP4 weights, so we dequantize once at load time and use cuBLAS F.linear. No FP8 re-quantize step needed — that was just adding noise on top of the NVFP4 dequant.	2026-06-03 14:14:10 +00:00
biondizzle	715602c87c	Switch lm_head to BF16 + router gate to FP8_E4M3 lm_head: BF16 F.linear (checkpoint weight is BF16, no quantization) Router gate: FP8_E4M3 quantize→dequantize round-trip, then F.linear - Dequantize NVFP4 checkpoint weights to BF16 first - Quantize to FP8_E4M3 (scale = amax/448) - Dequantize back to BF16 for F.linear - Uses BF16 dispatch path in dense_router_dispatch - Simpler scale wiring than NVFP4 (single per-tensor scale)	2026-06-03 14:10:28 +00:00
biondizzle	8cfc1cae58	Canonical encoding: derive special token IDs from official encoding module + tokenizer - Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs - Import token strings from encoding.deepseek_v4_encoding (official source) - Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime - Use parse_message_from_completion_text() for structured output parsing - No more hand-rolled prompt construction or hardcoded token IDs - Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference	2026-06-03 10:23:02 +00:00
biondizzle	a86d6d90a5	Replace hand-rolled prompt with official DSV4 encoder (canonical path) - Copied deepseek_v4_encoding.py from vLLM tree to encoding/ - Replaced hand-rolled prompt construction with encode_messages() - --chat-mode → --thinking-mode (thinking\|chat) - The official encoder handles: BOS, User/Assistant tokens, thinking mode, tool calls, and all special token placement. It can't drift. - This is the same code path inference engines will use.	2026-06-03 09:59:05 +00:00
biondizzle	284fc9ca86	Fix: thread comp_rope_cos/comp_rope_sin through forward_attention Previous commit added params to forward_layer but forward_attention (where compressed RoPE is applied) didn't receive them, causing NameError. Also confirmed from B200 test output: compress_rope_theta=160000 vs rope_theta=10000 — a 16x difference. The separate cache is essential.	2026-06-03 09:30:57 +00:00
biondizzle	6a3374da18	Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through - Fixed comp_pos: (bir) block-aligned instead of ((bi+1)r-1) last-position - compress_rope_theta: separate rope cache for compressed KV entries - comp_rope_cos/comp_rope_sin wired to all forward_layer call sites (prefill chunk loop, decode loop, CUDAGraphDecoder capture) - forward_layer uses comp_rope caches for compressed RoPE, falls back to normal - Only single_shot_inference.py modified, no kernel code touched	2026-06-03 09:19:11 +00:00
biondizzle	5003e756e2	WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support - CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)r-1)) instead of FIRST position (bir). Off by r-1: 3 for CSA, 127 for HCA. vLLM uses (position // ratio) * ratio = block-aligned first position. - Added compress_rope_theta config support (vLLM uses separate theta for compressed) - Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through) Only single_shot_inference.py changed — no kernel code touched. Base commit: `572bdd2`	2026-06-03 09:17:54 +00:00
biondizzle	1e77dfcaa0	Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode	2026-06-03 08:19:33 +00:00
biondizzle	019a3a34b7	Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md Batched prefill + T>128 chunking now complete. All dangling items in FINAL_STRETCH.md are marked done.	2026-06-03 08:12:54 +00:00
biondizzle	60309ef124	Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing - Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint) - Each chunk goes through ALL 61 layers before the next chunk - KV cache append_swa, compressor, indexer all already support T>1 - FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1 - For T>128: splits into multiple launches automatically - mHC, Router, MoE, Nvfp4Linear all handle M>1 natively - Eliminates ~N_prefill * 61 per-token overhead from the old loop	2026-06-03 07:39:37 +00:00
biondizzle	75288bd12f	Wire prefill FMHA into production.py and single_shot - Add dsv4_attention_mixed_fp8_prefill to production.py - _run_production_fmha_mixed now dispatches to prefill kernel for T>1 - Remove decode-only T==1 restriction - Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted	2026-06-03 03:49:57 +00:00
biondizzle	af58f2c5b2	Add B1 weight/format verification at L0 in single_shot	2026-06-03 01:52:55 +00:00
biondizzle	b9243fe40a	B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k - New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4 - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3 - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA) - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu) - Modified: single_shot_inference.py - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16) - Consumes FP8 indexer keys from cache without BF16 dequantization - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode) - FP32 einsum fallback retained only for T>1 (prefill) - Removed 'Intentional first-pass limits' section from B1 doc (those limits ARE the correct production design, not shortcuts)	2026-06-02 23:18:54 +00:00
biondizzle	a9d5e09f4c	B1: mixed FP8/BF16 decode FMHA integration - New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel) - New: fmha_mixed_fp8_capi.cu (C ABI launcher) - New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge) - New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels) - New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers) - Modified: production.py (dsv4_attention_mixed_fp8_decode API) - Modified: single_shot_inference.py (B1 gather + FMHA path) - Modified: __init__.py (export mixed FP8 API) - New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16. No global FP8->BF16 KV staging before FMHA. Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64. CUDA compile/runtime validation pending on B200.	2026-06-02 22:53:14 +00:00
biondizzle	9d4a014fad	Fix NameError: dequantize_nvfp4 not in scope in forward_attention The B3 fused q_a_norm path used dequantize_nvfp4 but it was only imported in forward_layer, not forward_attention. Added local import.	2026-06-02 21:52:29 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	7e42b5e090	A1: Add ◇ (think_start) priming after Assistant token DSV4 is a reasoning model. The standard prompt format is: BOS <\|User\|> prompt <\|Assistant\|> ◇ Without the ◇ priming, the model is out-of-distribution — it expects to be inside a thinking block but never received the sentinel. This causes degenerate output from step 0 (France instead of Paris, looping on newlines/repeated tokens). With ◇, the model will: 1. Generate thinking content (reasoning) 2. Emit ◇ (think_end=128822) to close the thinking block 3. Produce the actual answer 4. Emit EOS (token 1) This matches the pattern described in the Kimi K2 accuracy blog: https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed prompt formatting is the #1 cause of degenerate output in chat-tuned reasoning models.	2026-06-02 20:23:47 +00:00
biondizzle	ecd48ab65e	A1: Add explicit stop set for DSV4 turn-end tokens Previously only stopped on tokenizer.eos_token_id. DSV4 uses special turn-end tokens (<\|end_of_sentence\|>, USER_TOKEN=128803) that indicate the assistant turn is complete. Missing these caused decode to continue past the model's natural stopping point, producing degenerate output. Also increased diagnostic logging (every step for first 20 steps) to catch turn-end token emissions.	2026-06-02 19:59:52 +00:00
biondizzle	eb5ef93bf1	Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize - Added --ab-compare flag to run both fused and unfused paths for first 3 layers - Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv) - Added --no-fused-rmsnorm to disable P4 and use unfused path - This will help diagnose the correctness regression introduced by P4	2026-06-02 18:49:30 +00:00
biondizzle	7bb3207347	P4: Integrate fused RMSNorm+quantize into single_shot (attention path) - forward_layer: use rmsnorm_quantize_nvfp4 for attn_norm - forward_attention: accept x_quant, use run_from_quantized for q_a/kv - Dequantize for compressor/indexer (still saves 2+ launches per site) - FFN path kept unfused — MoE internal quantization needs refactoring (P5) - _use_fused_rmsnorm_quantize flag to toggle (default True)	2026-06-02 16:38:44 +00:00
biondizzle	82294fc21e	Fix nope_dim UnboundLocalError — hoist to function scope	2026-06-02 11:18:58 +00:00
biondizzle	c89762ecdd	Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage	2026-06-02 10:20:26 +00:00
biondizzle	1f69f61363	Add detailed comment: why compressed KV uses FP8 not NVFP4 We tried NVFP4 (Blackwell native FP4→MMA). Three approaches. cos=0.995 round-trip seems fine in isolation but 4.5 effective bits compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective bits gives cos=0.9997 — that 0.4% difference is the margin between working and broken. Kernels exist, path is proven, precision isn't.	2026-06-02 10:19:54 +00:00
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	7ef6402936	KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys Architecture: - Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa) - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4 - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA - No BF16 intermediate in the write path - Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale) - Write path: compress→FP32 → quantize FP32→FP8_E4M3 - Read path: dequant_fp8_e4m3 → BF16 for scoring - SWA: remains BF16 (8MB total, fits in L2) New kernels in kv_quantize.cu: - compute_amax_gsa_fp32: per-row gsa from FP32 input - quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer - quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys - dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16 - rope_fp32: FP32 GPT-J interleaved RoPE (no BF16) Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused): Kernel 1: amax_gsa (GPU-only) Kernel 2: quantize from buffer (GPU gsa) No shared memory bugs. No cross-CTA race conditions. KVCache updated: - comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16) - comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16) - comp_kv property: dequant NVFP4→BF16 on demand - comp_kv_selective: dequant only top-k entries (bandwidth savings) - comp_idx_kv property: dequant FP8→BF16 on demand Removed: compressor_reduce_quant.cu (buggy single-kernel approach)	2026-06-02 10:00:50 +00:00

1 2 3 4 5

247 Commits