nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	bdb25ee5cd	Add production-value unit tests for kv_quantize kernels	2026-06-02 10:01:07 +00:00
biondizzle	7ef6402936	KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys Architecture: - Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa) - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4 - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA - No BF16 intermediate in the write path - Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale) - Write path: compress→FP32 → quantize FP32→FP8_E4M3 - Read path: dequant_fp8_e4m3 → BF16 for scoring - SWA: remains BF16 (8MB total, fits in L2) New kernels in kv_quantize.cu: - compute_amax_gsa_fp32: per-row gsa from FP32 input - quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer - quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys - dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16 - rope_fp32: FP32 GPT-J interleaved RoPE (no BF16) Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused): Kernel 1: amax_gsa (GPU-only) Kernel 2: quantize from buffer (GPU gsa) No shared memory bugs. No cross-CTA race conditions. KVCache updated: - comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16) - comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16) - comp_kv property: dequant NVFP4→BF16 on demand - comp_kv_selective: dequant only top-k entries (bandwidth savings) - comp_idx_kv property: dequant FP8→BF16 on demand Removed: compressor_reduce_quant.cu (buggy single-kernel approach)	2026-06-02 10:00:50 +00:00
biondizzle	40dd56eac2	KV-1: Fix shared memory corruption in block_reduce block_reduce_sum/max write to smem[0..n_warps-1] but we passed &s_amax (single float). For 128 threads / 4 warps, this wrote 4 floats starting at &s_amax, corrupting adjacent shared variables (s_inv_rms, s_vals). Fix: use s_scratch[8] array (4 for sum, 4 for max) with proper sizing.	2026-06-02 09:49:12 +00:00
biondizzle	0fefadedd4	KV-1: Fix FP8 round-trip mismatch in fused quantize CRITICAL: quantize must use the FP8-round-tripped block scale, not the raw pre-FP8 value. The dequant reads the FP8 bytes back, so the quantize must match exactly. Same pattern as quantize_nvfp4.cu. This was the root cause of cos=0.925 (should be ~0.995).	2026-06-02 09:46:32 +00:00
biondizzle	d74ff5768d	KV diag test	2026-06-02 09:43:45 +00:00
biondizzle	c2664281c3	KV-1/KV-2: Fix quantize kernel — each thread handles 16-elem blocks independently Previous version used __shfl_down_sync for group-level amax reduction, but shuffles operate at warp level and crossed group boundaries. Fix: each thread independently quantizes its assigned 16-element blocks from shared memory. Simpler and correct.	2026-06-02 09:41:15 +00:00
biondizzle	f23320b5b2	KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant - compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize. No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel. Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer). - dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels. Full dequant (HCA dense gather) and selective dequant (CSA top-k gather). Single kernel launch per gather operation. - production_compress.py: Added csa_compress_production_nvfp4() and hca_compress_production_nvfp4() — production path for KV-1/KV-2. - loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules. - test_kv_compress_quant.py: Unit tests verifying cos >= 0.999 between BF16 reference and NVFP4 round-trip path.	2026-06-02 09:37:53 +00:00
biondizzle	107d62dd76	docs: update PERFORMANCE_AUDIT.md — Part 1 (P0-P3) landed, Part 2 KV cache next	2026-06-02 09:30:06 +00:00
biondizzle	3c295f225a	P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated _apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call) instead of PyTorch ops (5-6 kernels per call). Total: 183 RoPE calls × (5-1) = 732 launches saved per token. With fallback to PyTorch if CUDA kernel fails. v-p0p1p2p3-fused-swiglu-cuda-rope-20260602	2026-06-02 09:08:07 +00:00
biondizzle	54a9b6961b	fix: rope_cuda path — kernels/cuda not ops/cuda	2026-06-02 09:06:36 +00:00
biondizzle	2bbbead984	P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) New files: - dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse) - dsv4/ops/rope_cuda.py: Python bridge with ctypes loading - tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998) Savings: ~915 launches/token → 183 launches/token	2026-06-02 09:05:22 +00:00
biondizzle	851ec9b4d5	P3 WIP: fused RMSNorm + quantize kernel skeleton (not yet integrated)	2026-06-02 09:02:52 +00:00
biondizzle	b13c1057f5	test: verify GEMM shape with production weight format	2026-06-02 08:43:40 +00:00
biondizzle	40fb49d670	test: verify GEMM output shape	2026-06-02 08:41:22 +00:00
biondizzle	f01d3f3eac	wip: SE fused SwiGLU deinterleave fix	2026-06-02 08:41:00 +00:00
biondizzle	1726cb64a9	fix: interleave_l1_weights granularity_bf16 (not granularity) in SE	2026-06-02 08:29:03 +00:00
biondizzle	553275d810	feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group)	2026-06-02 08:25:52 +00:00
biondizzle	5ed4c86137	fix: expert_offsets for 4-expert fused SwiGLU test	2026-06-02 08:24:32 +00:00
biondizzle	53362d2579	test: isolate fused SwiGLU — test no-clamp first	2026-06-02 08:23:28 +00:00
biondizzle	ae4506d722	fix: w_gs is scalar not iterable	2026-06-02 08:22:29 +00:00
biondizzle	b0c71b947e	test: fused SwiGLU — smoke test + correctness comparison with graceful degradation	2026-06-02 08:21:33 +00:00
biondizzle	2cfca36095	fix: compute correct gs from data in fused SwiGLU test	2026-06-02 08:20:27 +00:00
biondizzle	4a05a40cf0	fix: fused SwiGLU test — proper weight quant + 128-token alignment	2026-06-02 08:19:31 +00:00
biondizzle	fa769b6214	fix: pad activation as uint8 view for float4 dtype	2026-06-02 08:18:26 +00:00
biondizzle	024be1a60b	fix: test weight quantization dtype for fused SwiGLU test	2026-06-02 08:17:35 +00:00
biondizzle	19afa52e80	fix: use cute.where() directly for clamp in fused SwiGLU (silu_result > limit).float() doesn't work on TensorSSA. cute.where(cond, true_val, false_val) is the correct TensorSSA API.	2026-06-02 08:16:41 +00:00
biondizzle	5c746bbdf2	fix: TensorSSA-compatible clamp in fused SwiGLU kernel cute.arch.fmin/fmax take scalar Float32, not TensorSSA. Replace with cute.where() and arithmetic for TensorSSA compatibility. Also changed subtile loop to unroll=1 for cute.where() compatibility.	2026-06-02 08:15:46 +00:00
biondizzle	3a30f35c68	fix: cute.math.fmin/fmax → cute.arch.fmin/fmax in fused SwiGLU kernel cute.math has no fmin/fmax. cute.arch does (register-level ops). README constraint #4: use cute.arch.fmax inside plain range(), not vectorize=True.	2026-06-02 08:12:55 +00:00
biondizzle	fca72427ea	fix: add fp4_out/sf_out/l2_global_scale params to fused_swiglu kernel() signature The __call__ method passes these 3 Optional params to self.kernel(), but kernel() didn't accept them, causing TypeError: too many positional arguments during cute.compile(). This was the CuTeDSL 'arg-binding bug' blocking P0/P1.	2026-06-02 08:11:18 +00:00
biondizzle	55ea109cca	test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)	2026-06-02 08:09:57 +00:00
biondizzle	7904cf05c4	Add set_fused_swiglu() method to Nvfp4MoE	2026-06-02 07:59:57 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	61d5e7ba53	revert: P2 gsa fill elimination — revert to proven path for e2e stability The fill_() is a CPU→GPU scalar write (tiny cost). The optimization was marginal and the output quality regression (CJK tokens) needs investigation separately. P2 can re-land after the regression is confirmed to be sampling-related (not gsa-related). P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed. v-perf-part1-p2-reverted-20260602	2026-06-02 07:32:10 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	e9506e0c20	perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from (max_context + compress_ratio - 1) // ratio per layer type. CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries. No more hardcoded 65536 that wastes memory on HCA layers. C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache. Gather writes compressed+SWA into this buffer via slice assignment. Zero torch.cat allocations on the hot decode path. C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed views. Caller copies into gather_buf so no aliasing risk. v-c1-c2-c3-20260602 v-post-indexer-c-fixes-20260602	2026-06-02 06:18:06 +00:00
biondizzle	617da29a5b	fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only The indexer silently returning None caused CSA layers to attend over only the SWA window (128 tokens), not the compressed sparse KV. This went undetected because the model still produced plausible output at short context. The assert makes any future indexer regression immediately visible.	2026-06-02 06:14:23 +00:00
biondizzle	5b4c496512	fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum 1. Indexer.load: weights at .indexer.kv_proj not .indexer.compressor.kv_proj 2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim 3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd); einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads (paper's c_I = ihd = 128, one vector per compressed block) Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP, RESHAPE FAILURE, indexer load state) — served their purpose. v-indexer-fix-20260602	2026-06-02 05:53:10 +00:00
biondizzle	0fbf28dd54	doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192	2026-06-02 05:51:24 +00:00
biondizzle	8162c586c3	probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete	2026-06-02 05:38:44 +00:00
biondizzle	5be31d8582	fix: indexer compressor weight path — weights are at .indexer.kv_proj not .indexer.compressor.kv_proj	2026-06-02 05:25:44 +00:00
biondizzle	fdfcca918c	probe: verify indexer compressor load state	2026-06-02 05:17:00 +00:00
biondizzle	fb0ed87626	probe: add indexer compressor early-return and buffering diagnostics	2026-06-02 05:06:18 +00:00
biondizzle	06c92f208f	INDEXER PROBE: instrumentation prints for compressed key width investigation	2026-06-02 04:44:47 +00:00
biondizzle	510eaf4a26	probe: HF indexer architecture from B200	2026-06-02 04:38:24 +00:00
biondizzle	938e9079ce	probe: indexer and compressor weight shapes from checkpoint	2026-06-02 04:36:35 +00:00
biondizzle	9254cb0b0d	test: NVFP4 runtime gsa accuracy vs PyTorch reference	2026-06-02 04:31:18 +00:00
biondizzle	7e3fb5f4d0	fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path	2026-06-02 04:28:29 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00
biondizzle	668a42e71a	debug: print mhc_sinkhorn CUDA kernel compile errors	2026-06-02 04:02:34 +00:00

1 2 3 4 5 ...

2202 Commits