nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	8cfc1cae58	Canonical encoding: derive special token IDs from official encoding module + tokenizer - Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs - Import token strings from encoding.deepseek_v4_encoding (official source) - Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime - Use parse_message_from_completion_text() for structured output parsing - No more hand-rolled prompt construction or hardcoded token IDs - Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference	2026-06-03 10:23:02 +00:00
biondizzle	a86d6d90a5	Replace hand-rolled prompt with official DSV4 encoder (canonical path) - Copied deepseek_v4_encoding.py from vLLM tree to encoding/ - Replaced hand-rolled prompt construction with encode_messages() - --chat-mode → --thinking-mode (thinking\|chat) - The official encoder handles: BOS, User/Assistant tokens, thinking mode, tool calls, and all special token placement. It can't drift. - This is the same code path inference engines will use.	2026-06-03 09:59:05 +00:00
biondizzle	284fc9ca86	Fix: thread comp_rope_cos/comp_rope_sin through forward_attention Previous commit added params to forward_layer but forward_attention (where compressed RoPE is applied) didn't receive them, causing NameError. Also confirmed from B200 test output: compress_rope_theta=160000 vs rope_theta=10000 — a 16x difference. The separate cache is essential.	2026-06-03 09:30:57 +00:00
biondizzle	6a3374da18	Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through - Fixed comp_pos: (bir) block-aligned instead of ((bi+1)r-1) last-position - compress_rope_theta: separate rope cache for compressed KV entries - comp_rope_cos/comp_rope_sin wired to all forward_layer call sites (prefill chunk loop, decode loop, CUDAGraphDecoder capture) - forward_layer uses comp_rope caches for compressed RoPE, falls back to normal - Only single_shot_inference.py modified, no kernel code touched	2026-06-03 09:19:11 +00:00
biondizzle	5003e756e2	WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support - CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)r-1)) instead of FIRST position (bir). Off by r-1: 3 for CSA, 127 for HCA. vLLM uses (position // ratio) * ratio = block-aligned first position. - Added compress_rope_theta config support (vLLM uses separate theta for compressed) - Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through) Only single_shot_inference.py changed — no kernel code touched. Base commit: `572bdd2`	2026-06-03 09:17:54 +00:00
biondizzle	1e77dfcaa0	Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode	2026-06-03 08:19:33 +00:00
biondizzle	019a3a34b7	Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md Batched prefill + T>128 chunking now complete. All dangling items in FINAL_STRETCH.md are marked done.	2026-06-03 08:12:54 +00:00
biondizzle	60309ef124	Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing - Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint) - Each chunk goes through ALL 61 layers before the next chunk - KV cache append_swa, compressor, indexer all already support T>1 - FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1 - For T>128: splits into multiple launches automatically - mHC, Router, MoE, Nvfp4Linear all handle M>1 natively - Eliminates ~N_prefill * 61 per-token overhead from the old loop	2026-06-03 07:39:37 +00:00
biondizzle	75288bd12f	Wire prefill FMHA into production.py and single_shot - Add dsv4_attention_mixed_fp8_prefill to production.py - _run_production_fmha_mixed now dispatches to prefill kernel for T>1 - Remove decode-only T==1 restriction - Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted	2026-06-03 03:49:57 +00:00
biondizzle	af58f2c5b2	Add B1 weight/format verification at L0 in single_shot	2026-06-03 01:52:55 +00:00
biondizzle	b9243fe40a	B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k - New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4 - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3 - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA) - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu) - Modified: single_shot_inference.py - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16) - Consumes FP8 indexer keys from cache without BF16 dequantization - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode) - FP32 einsum fallback retained only for T>1 (prefill) - Removed 'Intentional first-pass limits' section from B1 doc (those limits ARE the correct production design, not shortcuts)	2026-06-02 23:18:54 +00:00
biondizzle	a9d5e09f4c	B1: mixed FP8/BF16 decode FMHA integration - New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel) - New: fmha_mixed_fp8_capi.cu (C ABI launcher) - New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge) - New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels) - New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers) - Modified: production.py (dsv4_attention_mixed_fp8_decode API) - Modified: single_shot_inference.py (B1 gather + FMHA path) - Modified: __init__.py (export mixed FP8 API) - New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16. No global FP8->BF16 KV staging before FMHA. Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64. CUDA compile/runtime validation pending on B200.	2026-06-02 22:53:14 +00:00
biondizzle	9d4a014fad	Fix NameError: dequantize_nvfp4 not in scope in forward_attention The B3 fused q_a_norm path used dequantize_nvfp4 but it was only imported in forward_layer, not forward_attention. Added local import.	2026-06-02 21:52:29 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	7e42b5e090	A1: Add ◇ (think_start) priming after Assistant token DSV4 is a reasoning model. The standard prompt format is: BOS <\|User\|> prompt <\|Assistant\|> ◇ Without the ◇ priming, the model is out-of-distribution — it expects to be inside a thinking block but never received the sentinel. This causes degenerate output from step 0 (France instead of Paris, looping on newlines/repeated tokens). With ◇, the model will: 1. Generate thinking content (reasoning) 2. Emit ◇ (think_end=128822) to close the thinking block 3. Produce the actual answer 4. Emit EOS (token 1) This matches the pattern described in the Kimi K2 accuracy blog: https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed prompt formatting is the #1 cause of degenerate output in chat-tuned reasoning models.	2026-06-02 20:23:47 +00:00
biondizzle	ecd48ab65e	A1: Add explicit stop set for DSV4 turn-end tokens Previously only stopped on tokenizer.eos_token_id. DSV4 uses special turn-end tokens (<\|end_of_sentence\|>, USER_TOKEN=128803) that indicate the assistant turn is complete. Missing these caused decode to continue past the model's natural stopping point, producing degenerate output. Also increased diagnostic logging (every step for first 20 steps) to catch turn-end token emissions.	2026-06-02 19:59:52 +00:00
biondizzle	eb5ef93bf1	Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize - Added --ab-compare flag to run both fused and unfused paths for first 3 layers - Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv) - Added --no-fused-rmsnorm to disable P4 and use unfused path - This will help diagnose the correctness regression introduced by P4	2026-06-02 18:49:30 +00:00
biondizzle	7bb3207347	P4: Integrate fused RMSNorm+quantize into single_shot (attention path) - forward_layer: use rmsnorm_quantize_nvfp4 for attn_norm - forward_attention: accept x_quant, use run_from_quantized for q_a/kv - Dequantize for compressor/indexer (still saves 2+ launches per site) - FFN path kept unfused — MoE internal quantization needs refactoring (P5) - _use_fused_rmsnorm_quantize flag to toggle (default True)	2026-06-02 16:38:44 +00:00
biondizzle	82294fc21e	Fix nope_dim UnboundLocalError — hoist to function scope	2026-06-02 11:18:58 +00:00
biondizzle	c89762ecdd	Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage	2026-06-02 10:20:26 +00:00
biondizzle	1f69f61363	Add detailed comment: why compressed KV uses FP8 not NVFP4 We tried NVFP4 (Blackwell native FP4→MMA). Three approaches. cos=0.995 round-trip seems fine in isolation but 4.5 effective bits compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective bits gives cos=0.9997 — that 0.4% difference is the margin between working and broken. Kernels exist, path is proven, precision isn't.	2026-06-02 10:19:54 +00:00
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	7ef6402936	KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys Architecture: - Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa) - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4 - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA - No BF16 intermediate in the write path - Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale) - Write path: compress→FP32 → quantize FP32→FP8_E4M3 - Read path: dequant_fp8_e4m3 → BF16 for scoring - SWA: remains BF16 (8MB total, fits in L2) New kernels in kv_quantize.cu: - compute_amax_gsa_fp32: per-row gsa from FP32 input - quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer - quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys - dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16 - rope_fp32: FP32 GPT-J interleaved RoPE (no BF16) Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused): Kernel 1: amax_gsa (GPU-only) Kernel 2: quantize from buffer (GPU gsa) No shared memory bugs. No cross-CTA race conditions. KVCache updated: - comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16) - comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16) - comp_kv property: dequant NVFP4→BF16 on demand - comp_kv_selective: dequant only top-k entries (bandwidth savings) - comp_idx_kv property: dequant FP8→BF16 on demand Removed: compressor_reduce_quant.cu (buggy single-kernel approach)	2026-06-02 10:00:50 +00:00
biondizzle	3c295f225a	P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated _apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call) instead of PyTorch ops (5-6 kernels per call). Total: 183 RoPE calls × (5-1) = 732 launches saved per token. With fallback to PyTorch if CUDA kernel fails.	2026-06-02 09:08:07 +00:00
biondizzle	553275d810	feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group)	2026-06-02 08:25:52 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	e9506e0c20	perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from (max_context + compress_ratio - 1) // ratio per layer type. CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries. No more hardcoded 65536 that wastes memory on HCA layers. C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache. Gather writes compressed+SWA into this buffer via slice assignment. Zero torch.cat allocations on the hot decode path. C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed views. Caller copies into gather_buf so no aliasing risk.	2026-06-02 06:18:06 +00:00
biondizzle	617da29a5b	fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only The indexer silently returning None caused CSA layers to attend over only the SWA window (128 tokens), not the compressed sparse KV. This went undetected because the model still produced plausible output at short context. The assert makes any future indexer regression immediately visible.	2026-06-02 06:14:23 +00:00
biondizzle	5b4c496512	fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum 1. Indexer.load: weights at .indexer.kv_proj not .indexer.compressor.kv_proj 2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim 3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd); einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads (paper's c_I = ihd = 128, one vector per compressed block) Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP, RESHAPE FAILURE, indexer load state) — served their purpose.	2026-06-02 05:53:10 +00:00
biondizzle	8162c586c3	probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete	2026-06-02 05:38:44 +00:00
biondizzle	5be31d8582	fix: indexer compressor weight path — weights are at .indexer.kv_proj not .indexer.compressor.kv_proj	2026-06-02 05:25:44 +00:00
biondizzle	fdfcca918c	probe: verify indexer compressor load state	2026-06-02 05:17:00 +00:00
biondizzle	fb0ed87626	probe: add indexer compressor early-return and buffering diagnostics	2026-06-02 05:06:18 +00:00
biondizzle	06c92f208f	INDEXER PROBE: instrumentation prints for compressed key width investigation	2026-06-02 04:44:47 +00:00
biondizzle	f0dec9f6bd	profile: fine-grained attention component timing	2026-06-02 03:08:34 +00:00
biondizzle	7114c48575	fix: parenthesize profile_detail condition	2026-06-02 02:56:13 +00:00
biondizzle	4734e894c7	profile: add per-layer attn vs ffn timing with CUDA sync	2026-06-02 02:46:35 +00:00
biondizzle	4017ef2f16	fix: accurate profile sync + remove paris_tids 129K iteration	2026-06-01 23:55:26 +00:00
biondizzle	73ae9393da	FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536	2026-06-01 23:18:37 +00:00
biondizzle	36f9782bad	Add thinking/Paris token logit check on step 0 for quality debugging	2026-06-01 23:14:24 +00:00
biondizzle	ef7e0d63bb	Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches	2026-06-01 23:04:44 +00:00
biondizzle	008e59eb90	Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)	2026-06-01 23:03:46 +00:00
biondizzle	e53645654d	Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float	2026-06-01 22:33:03 +00:00
biondizzle	6f4bbc997a	Add sync after sampler for step<3 to catch async CUDA errors early	2026-06-01 22:32:40 +00:00
biondizzle	5493a8727e	P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16	2026-06-01 22:29:56 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	8767c263ab	Add cuda.synchronize + better logits validation after lm_head Catch CUDA errors at the source instead of seeing them surfaced at torch.topk. Print logits stats every step.	2026-06-01 22:06:41 +00:00
biondizzle	2a6f9a10b1	lm_head: fall back to BF16 F.linear for stability NVFP4 quantize_from_buffer produces CUDA error on large-magnitude inputs (\|X\|>500 at L60 output). BF16 lm_head is correct and only runs once per decode step — not a bottleneck. TODO: debug the NVFP4 path for large activations and re-enable.	2026-06-01 22:05:22 +00:00

1 2 3 4 5

220 Commits