nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	7e3fb5f4d0	fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path	2026-06-02 04:28:29 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00
biondizzle	668a42e71a	debug: print mhc_sinkhorn CUDA kernel compile errors	2026-06-02 04:02:34 +00:00
biondizzle	ca53bdb8e1	perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)	2026-06-02 03:54:03 +00:00
biondizzle	7b82d31330	perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)	2026-06-02 03:50:57 +00:00
biondizzle	f0dec9f6bd	profile: fine-grained attention component timing	2026-06-02 03:08:34 +00:00
biondizzle	7114c48575	fix: parenthesize profile_detail condition	2026-06-02 02:56:13 +00:00
biondizzle	4734e894c7	profile: add per-layer attn vs ffn timing with CUDA sync	2026-06-02 02:46:35 +00:00
biondizzle	4017ef2f16	fix: accurate profile sync + remove paris_tids 129K iteration	2026-06-01 23:55:26 +00:00
biondizzle	73ae9393da	FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536	2026-06-01 23:18:37 +00:00
biondizzle	36f9782bad	Add thinking/Paris token logit check on step 0 for quality debugging	2026-06-01 23:14:24 +00:00
biondizzle	ef7e0d63bb	Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches	2026-06-01 23:04:44 +00:00
biondizzle	008e59eb90	Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)	2026-06-01 23:03:46 +00:00
biondizzle	106f42c93c	auto: pre-test commit	2026-06-01 23:01:34 +00:00
biondizzle	e53645654d	Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float	2026-06-01 22:33:03 +00:00
biondizzle	6f4bbc997a	Add sync after sampler for step<3 to catch async CUDA errors early	2026-06-01 22:32:40 +00:00
biondizzle	5493a8727e	P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16	2026-06-01 22:29:56 +00:00
biondizzle	828ba73dff	Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done	2026-06-01 22:21:31 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	8767c263ab	Add cuda.synchronize + better logits validation after lm_head Catch CUDA errors at the source instead of seeing them surfaced at torch.topk. Print logits stats every step.	2026-06-01 22:06:41 +00:00
biondizzle	2a6f9a10b1	lm_head: fall back to BF16 F.linear for stability NVFP4 quantize_from_buffer produces CUDA error on large-magnitude inputs (\|X\|>500 at L60 output). BF16 lm_head is correct and only runs once per decode step — not a bottleneck. TODO: debug the NVFP4 path for large activations and re-enable.	2026-06-01 22:05:22 +00:00
biondizzle	9bad30c777	Add logits validation debug before topk sampling	2026-06-01 21:59:23 +00:00
biondizzle	9fec7d609e	Fix gsa_buffer shape mismatch for MoE (M>1 rows) compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,). Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).	2026-06-01 21:33:59 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	e3412cf913	P5: In-place RoPE — no x.clone(), no empty_like allocation Eliminates 183 kernel launches per decoded token from pointless memcpy. Operates on rope dims in-place via views instead of cloning the full tensor and allocating an empty_like buffer.	2026-06-01 21:18:41 +00:00
biondizzle	00746c2d2b	Fix module path: move loader code from __init__.py to loader.py quantize.py and others import from dsv4.kernels.cuda.loader — the module must be a separate file, not just __init__.py.	2026-06-01 21:18:29 +00:00
biondizzle	230d28e562	Fix KVCache constructor call — device as keyword arg, not positional KVCache signature has max_comp before device, so positional pass of dev was hitting max_comp parameter instead of device.	2026-06-01 21:11:01 +00:00
biondizzle	c9b92cd840	Remove P1 from audit — multi-GPU layout is correct for the reference script The single_shot is a reference for vLLM/SGLang integration. The layer-pipeline sharding (gpu = li % NUM_GPUS) is the right pattern for this reference. EP/TP sharding belongs in the actual vLLM integration, not here.	2026-06-01 21:07:59 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	e0607c9e2f	P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching - fused_amax_quantize.cu: Single kernel launch computes amax → gsa → NVFP4 quantize Zero CPU-GPU syncs. gsa written to GPU buffer for downstream GEMM global_scale_a. - dsv4/kernels/cuda/__init__.py: Module loader that compiles .cu once and caches. Eliminates JIT recompilation overhead (was ~100ms per call, ~500x per token). - P1 audit corrected: layer-pipe at batch=1 is wrong, but single-GPU doesn't fit (800GB weights vs 192GB HBM). Correct fix is EP=8 for MoE + TP/replicate for dense.	2026-06-01 21:02:03 +00:00
biondizzle	d279965db4	Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status - Removed: RoPE 8x duplication (INVALIDATED), mHC BF16 bmm (INVALIDATED), Router .float() cast (INVALIDATED) - Added: WIP section documenting current session's work and status - Added: Cardinal rule violation warning (must use test harness) - Added: Compilation issues found (c10::, x.options()) - P0 marked PARTIAL: amax_gsa kernel written, GEMM path sync-free, quantize kernel still needs .item() - P4 marked DONE - All other items NOT STARTED or DEFERRED	2026-06-01 20:55:44 +00:00
biondizzle	60715f89bc	Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream - amax_gsa.cu: fix at::cuda::getCurrentCUDAStream → c10:: - amax_gsa.cu: fix torch::TensorOptions().device() → x.options() - sampler.cu: same fixes for compilation on B200 - Both kernels now compile cleanly with torch.utils.cpp_extension.load	2026-06-01 20:49:55 +00:00
biondizzle	2dc5b4ec19	Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128 128 * (sizeof(float) + sizeof(int)) = 1KB — within CUDA default stack limit. 256 * 8 = 2KB would overflow.	2026-06-01 20:42:53 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	4f698baa5d	Production fused CUDA sampler + decode loop optimizations - Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty + top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs - Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference - Update single_shot_inference.py: - Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95) - Pre-allocate decode buffers (no per-step torch.tensor allocation) - Track thinking tokens (128821/128822) — not garbage for reasoning model - Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20) - Add --top-k and --top-p CLI args - Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)	2026-06-01 20:29:57 +00:00
biondizzle	2830a3ee7c	Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout quantize_weight_to_nvfp4 returns (K_packed, N) but Nvfp4Linear expects (N, K_packed) from the checkpoint format. Transpose both fp4 and sf. v-e2e-nvfp4-all-projections	2026-06-01 19:51:21 +00:00
biondizzle	16b72b9581	PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head 1. o_a_proj (Nvfp4GroupedLinear): Added load_nvfp4_weight() method that loads checkpoint NVFP4 weights directly — no more dequant→BF16→requant. Each group's weight is transposed from (N, K_packed) checkpoint layout to (K_packed, N) layout expected by the grouped GEMM. 2. lm_head: Quantize BF16 weight to NVFP4 at load time, use production Nvfp4Linear GEMM instead of F.linear. Runtime gsa for activation. Frees the 1.8GB BF16 weight after quantization. 3. Hash router (L0-2): Already optimal — tid2eid is an int32 lookup, no GEMM to accelerate.	2026-06-01 19:41:21 +00:00
biondizzle	9a3bb43f20	Set default max-tokens=512 for reasoning model	2026-06-01 17:27:01 +00:00
biondizzle	db6e3545da	Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot The checkpoint-path gate was using the checkpoint's input_scale as gsa — the same E4M3 overflow bug we fixed in Nvfp4Linear/Nvfp4MoE/etc. The runtime-quantized BF16 path was using 1/(6*448) as a fixed gsa. Both now compute gsa from actual activation magnitude at runtime.	2026-06-01 17:25:04 +00:00
biondizzle	9d57b0453b	auto: pre-test commit	2026-06-01 15:04:46 +00:00
biondizzle	1a6d9ee29b	Reset to greedy decoding (temperature=0)	2026-06-01 15:04:02 +00:00
biondizzle	038fe81c68	Fix MoE non-fused L2 runtime gsa + update test harness for extra args	2026-06-01 15:03:54 +00:00
biondizzle	a48d6e14ae	Default temperature=0.7 with rep penalty	2026-06-01 14:55:43 +00:00
biondizzle	1d64b863ca	Add temperature sampling + repetition penalty to fix degenerate repetition With --temperature 0.7 --repetition-penalty 1.2, the model should generate more diverse text instead of repeating 'France' endlessly.	2026-06-01 14:54:49 +00:00
biondizzle	6cca16f97a	Set max-tokens=128 default, clean up for final verification	2026-06-01 14:43:48 +00:00
biondizzle	a0e758ec3b	Set default max-tokens=30 for faster iteration	2026-06-01 14:33:55 +00:00
biondizzle	2b1fca6dae	CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow The checkpoint's input_scale was designed for training-time FP8 quantization, not NVFP4 activation quantization. Using it as gsa causes x/gsa to exceed the E4M3 block scale maximum (448), leading to systematic magnitude loss in every projection. This accumulates over 61 layers, compressing the logit range and producing garbage tokens. Fix: compute gsa at runtime from actual activation magnitude: gsa = max(\|x\|) / (6.0 * 448.0) This ensures x/gsa ≤ 2688 (the maximum representable in E4M3 block scales). Applied to: Nvfp4Linear, Nvfp4GroupedLinear, Nvfp4MoE, Nvfp4SharedExpert, Router gate	2026-06-01 14:21:16 +00:00
biondizzle	3b2714410f	Add NVFP4 linear accuracy test: prod vs ref with all-ones input	2026-06-01 14:15:27 +00:00
biondizzle	3e47d5f20a	Add prod vs ref GEMM comparison test + gate logits diagnostic	2026-06-01 14:11:37 +00:00
biondizzle	ad143afe37	Add L58-60 diagnostic: mHC A/B/C, MoE routed/shared, topk	2026-06-01 13:55:55 +00:00

1 2 3 4 5 ...

2155 Commits