nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	5c746bbdf2	fix: TensorSSA-compatible clamp in fused SwiGLU kernel cute.arch.fmin/fmax take scalar Float32, not TensorSSA. Replace with cute.where() and arithmetic for TensorSSA compatibility. Also changed subtile loop to unroll=1 for cute.where() compatibility.	2026-06-02 08:15:46 +00:00
biondizzle	3a30f35c68	fix: cute.math.fmin/fmax → cute.arch.fmin/fmax in fused SwiGLU kernel cute.math has no fmin/fmax. cute.arch does (register-level ops). README constraint #4: use cute.arch.fmax inside plain range(), not vectorize=True.	2026-06-02 08:12:55 +00:00
biondizzle	fca72427ea	fix: add fp4_out/sf_out/l2_global_scale params to fused_swiglu kernel() signature The __call__ method passes these 3 Optional params to self.kernel(), but kernel() didn't accept them, causing TypeError: too many positional arguments during cute.compile(). This was the CuTeDSL 'arg-binding bug' blocking P0/P1.	2026-06-02 08:11:18 +00:00
biondizzle	55ea109cca	test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)	2026-06-02 08:09:57 +00:00
biondizzle	7904cf05c4	Add set_fused_swiglu() method to Nvfp4MoE	2026-06-02 07:59:57 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	61d5e7ba53	revert: P2 gsa fill elimination — revert to proven path for e2e stability The fill_() is a CPU→GPU scalar write (tiny cost). The optimization was marginal and the output quality regression (CJK tokens) needs investigation separately. P2 can re-land after the regression is confirmed to be sampling-related (not gsa-related). P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed. v-perf-part1-p2-reverted-20260602	2026-06-02 07:32:10 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	e9506e0c20	perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from (max_context + compress_ratio - 1) // ratio per layer type. CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries. No more hardcoded 65536 that wastes memory on HCA layers. C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache. Gather writes compressed+SWA into this buffer via slice assignment. Zero torch.cat allocations on the hot decode path. C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed views. Caller copies into gather_buf so no aliasing risk. v-c1-c2-c3-20260602 v-post-indexer-c-fixes-20260602	2026-06-02 06:18:06 +00:00
biondizzle	617da29a5b	fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only The indexer silently returning None caused CSA layers to attend over only the SWA window (128 tokens), not the compressed sparse KV. This went undetected because the model still produced plausible output at short context. The assert makes any future indexer regression immediately visible.	2026-06-02 06:14:23 +00:00
biondizzle	5b4c496512	fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum 1. Indexer.load: weights at .indexer.kv_proj not .indexer.compressor.kv_proj 2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim 3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd); einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads (paper's c_I = ihd = 128, one vector per compressed block) Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP, RESHAPE FAILURE, indexer load state) — served their purpose. v-indexer-fix-20260602	2026-06-02 05:53:10 +00:00
biondizzle	0fbf28dd54	doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192	2026-06-02 05:51:24 +00:00
biondizzle	8162c586c3	probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete	2026-06-02 05:38:44 +00:00
biondizzle	5be31d8582	fix: indexer compressor weight path — weights are at .indexer.kv_proj not .indexer.compressor.kv_proj	2026-06-02 05:25:44 +00:00
biondizzle	fdfcca918c	probe: verify indexer compressor load state	2026-06-02 05:17:00 +00:00
biondizzle	fb0ed87626	probe: add indexer compressor early-return and buffering diagnostics	2026-06-02 05:06:18 +00:00
biondizzle	06c92f208f	INDEXER PROBE: instrumentation prints for compressed key width investigation	2026-06-02 04:44:47 +00:00
biondizzle	510eaf4a26	probe: HF indexer architecture from B200	2026-06-02 04:38:24 +00:00
biondizzle	938e9079ce	probe: indexer and compressor weight shapes from checkpoint	2026-06-02 04:36:35 +00:00
biondizzle	9254cb0b0d	test: NVFP4 runtime gsa accuracy vs PyTorch reference	2026-06-02 04:31:18 +00:00
biondizzle	7e3fb5f4d0	fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path	2026-06-02 04:28:29 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00
biondizzle	668a42e71a	debug: print mhc_sinkhorn CUDA kernel compile errors	2026-06-02 04:02:34 +00:00
biondizzle	ca53bdb8e1	perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)	2026-06-02 03:54:03 +00:00
biondizzle	7b82d31330	perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)	2026-06-02 03:50:57 +00:00
biondizzle	f0dec9f6bd	profile: fine-grained attention component timing	2026-06-02 03:08:34 +00:00
biondizzle	7114c48575	fix: parenthesize profile_detail condition	2026-06-02 02:56:13 +00:00
biondizzle	4734e894c7	profile: add per-layer attn vs ffn timing with CUDA sync	2026-06-02 02:46:35 +00:00
biondizzle	4017ef2f16	fix: accurate profile sync + remove paris_tids 129K iteration	2026-06-01 23:55:26 +00:00
biondizzle	73ae9393da	FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536	2026-06-01 23:18:37 +00:00
biondizzle	36f9782bad	Add thinking/Paris token logit check on step 0 for quality debugging	2026-06-01 23:14:24 +00:00
biondizzle	ef7e0d63bb	Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches	2026-06-01 23:04:44 +00:00
biondizzle	008e59eb90	Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)	2026-06-01 23:03:46 +00:00
biondizzle	106f42c93c	auto: pre-test commit	2026-06-01 23:01:34 +00:00
biondizzle	e53645654d	Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float	2026-06-01 22:33:03 +00:00
biondizzle	6f4bbc997a	Add sync after sampler for step<3 to catch async CUDA errors early	2026-06-01 22:32:40 +00:00
biondizzle	5493a8727e	P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16	2026-06-01 22:29:56 +00:00
biondizzle	828ba73dff	Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done	2026-06-01 22:21:31 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	8767c263ab	Add cuda.synchronize + better logits validation after lm_head Catch CUDA errors at the source instead of seeing them surfaced at torch.topk. Print logits stats every step.	2026-06-01 22:06:41 +00:00
biondizzle	2a6f9a10b1	lm_head: fall back to BF16 F.linear for stability NVFP4 quantize_from_buffer produces CUDA error on large-magnitude inputs (\|X\|>500 at L60 output). BF16 lm_head is correct and only runs once per decode step — not a bottleneck. TODO: debug the NVFP4 path for large activations and re-enable.	2026-06-01 22:05:22 +00:00
biondizzle	9bad30c777	Add logits validation debug before topk sampling	2026-06-01 21:59:23 +00:00
biondizzle	9fec7d609e	Fix gsa_buffer shape mismatch for MoE (M>1 rows) compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,). Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).	2026-06-01 21:33:59 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	e3412cf913	P5: In-place RoPE — no x.clone(), no empty_like allocation Eliminates 183 kernel launches per decoded token from pointless memcpy. Operates on rope dims in-place via views instead of cloning the full tensor and allocating an empty_like buffer.	2026-06-01 21:18:41 +00:00
biondizzle	00746c2d2b	Fix module path: move loader code from __init__.py to loader.py quantize.py and others import from dsv4.kernels.cuda.loader — the module must be a separate file, not just __init__.py.	2026-06-01 21:18:29 +00:00
biondizzle	230d28e562	Fix KVCache constructor call — device as keyword arg, not positional KVCache signature has max_comp before device, so positional pass of dev was hitting max_comp parameter instead of device.	2026-06-01 21:11:01 +00:00
biondizzle	c9b92cd840	Remove P1 from audit — multi-GPU layout is correct for the reference script The single_shot is a reference for vLLM/SGLang integration. The layer-pipeline sharding (gpu = li % NUM_GPUS) is the right pattern for this reference. EP/TP sharding belongs in the actual vLLM integration, not here.	2026-06-01 21:07:59 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00

1 2 3 4 5 ...

2176 Commits