nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	cec505ce14	add CUDA test runner script (screen-based, follows harness pattern)	2026-05-28 07:31:41 +00:00
biondizzle	2eb44a00bf	fix(tmem): warp-collective TMEM ops + one-way correction epilogue Key fixes for fmha_epilogue_sm100.cuh hang: - tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute - Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG - tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer - Compute attention in SMEM, then do one-way TMEM pipeline: SMEM → TMEM (warp-collective store) → regs (warp-collective load) → normalize in regs → BF16 cast → GMEM - This proves the MoE-style one-way correction epilogue on FMHA Also: enable TMEM kernel test + hd=128 in standalone test	2026-05-28 07:27:25 +00:00
biondizzle	bd16e8fa85	fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence ROOT CAUSE of TMET hang: tcgen05.fence.cta_group::1.sync.aligned is NOT a valid PTX instruction. The correct TMEM ordering primitives are: - tcgen05.wait::st.sync.aligned (wait for TMEM stores to complete) - tcgen05.wait::ld.sync.aligned (wait for TMEM loads to complete) Found in cutlass/arch/barrier.h fence_view_async_tmem_store/load.	2026-05-28 07:12:26 +00:00
biondizzle	ba1e81f2dc	test: minimal TMEM isolation test (alloc, store, load, dealloc)	2026-05-28 07:09:06 +00:00
biondizzle	d46ae8b967	test: disable TMEM test (hanging), verify reference still works	2026-05-28 06:46:27 +00:00
biondizzle	e58980f80e	fix: increase test timeout for TMEM kernel	2026-05-28 06:41:59 +00:00
biondizzle	73d1e38129	fix: last HD→HD_val	2026-05-28 06:32:55 +00:00
biondizzle	e940786fd5	fix: HD_val variable name in test	2026-05-28 06:32:01 +00:00
biondizzle	e173295a3a	FMHA SM100: Refactor into common + reference + TMEM epilogue headers - fmha_common.cuh: BF16, TMEM ops, warp reductions (shared) - fmha_sm100.cuh: Phase 1 reference (SMEM-based, cos 0.999999) - fmha_epilogue_sm100.cuh: Phase 2 TMEM+correction epilogue (Priority 2) - Test both kernels at hd=64 and hd=128	2026-05-28 06:31:05 +00:00
biondizzle	a73fb689f9	fix: dispatch template HD at compile time	2026-05-28 06:29:10 +00:00
biondizzle	bcc5d0b6cb	FMHA SM100: Add TMEM+correction epilogue kernel (Priority 2) New file: fmha_epilogue_sm100.cuh - TMEM alloc/dealloc/load/store via tcgen05 PTX - One-way correction epilogue: TMEM→regs→normalize→BF16→GMEM - D1.5 fix: O rescale in REGISTERS (TMEM→regs→multiply→TMEM) - Same pattern as MoE epilogue but with normalize instead of SwiGLU - Unblocks D2 multi-CTA and NVFP4-1.2 (register slot for FP4 pack) Test: hd=64 + hd=128, reference vs TMEM kernels	2026-05-28 06:27:56 +00:00
biondizzle	7fb838913f	fix: include path for standalone test	2026-05-28 05:31:39 +00:00
biondizzle	99b35eb2de	test: standalone CUDA test for FMHA SM100 (no PyTorch needed)	2026-05-28 05:31:03 +00:00
biondizzle	97df02ea07	fix: -Xcompiler -fPIC for nvcc shared library	2026-05-28 05:22:15 +00:00
biondizzle	4dfb71bc20	test: nvcc direct compilation test (avoid torch JIT __bf16 ICE)	2026-05-28 05:21:41 +00:00
biondizzle	09dfd4a41f	fix: rename .cpp to .cu for CUDA compilation	2026-05-28 05:16:41 +00:00
biondizzle	4c194b7254	fix: add CUDA include path for host compiler	2026-05-28 05:15:48 +00:00
biondizzle	f0660d0bd7	fix: use C++20 for cuda_bf16.h compat	2026-05-28 05:13:18 +00:00
biondizzle	6bd3356582	fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr	2026-05-28 05:13:01 +00:00
biondizzle	3eb432d064	fix: CUTLASS path /root/cutlass	2026-05-28 05:06:48 +00:00
biondizzle	66d9f5c60f	fix: --x cu for .cuh compilation	2026-05-28 05:06:13 +00:00
biondizzle	4dcd80ea0d	fix: use full nvcc path	2026-05-28 05:05:55 +00:00
biondizzle	fac7275f2b	test: nvcc compilation test for FMHA SM100 kernel	2026-05-28 05:05:31 +00:00
biondizzle	230c350c77	FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton 6-warp specialization using CUTLASS C++ atoms directly: - tcgen05.mma for QK (SMEM→SMEM→TMEM) and PV (TMEM→SMEM→TMEM) - TMEM accumulator with one-way correction epilogue (TMEM→regs→SMEM→GMEM) - In-kernel O rescale via registers (fixes D1.5 TMEM round-trip!) - D3/D4/D5c masks, NVFP4 quantize helpers, FP8 E4M3 encode - PyTorch binding with head_dim template dispatch This bypasses all CuTeDSL limitations: float→int, TMEM round-trip, multi-CTA, hd=512 MLIR compilation hang.	2026-05-28 05:04:44 +00:00
biondizzle	b2d0417a46	NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast. Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works). For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach. Removed dead test files (test_ptx_, test_fp4_isolate, test_minimal_cmp*, test_dtype_store, test_threshold_round).	2026-05-28 04:59:01 +00:00
biondizzle	650bcdcccf	test: f32 vs i32 GMEM store	2026-05-28 04:57:45 +00:00
biondizzle	cc37ce6dbf	test: absolute minimum CuTeDSL int store + float cmp	2026-05-28 04:56:16 +00:00
biondizzle	c4fdfc7789	test: isolate which fp4_quant function causes LLVM ERROR	2026-05-28 04:55:23 +00:00
biondizzle	71ee1485ea	test: constraints runner	2026-05-28 04:50:14 +00:00
biondizzle	c55c237fcd	test: different constraint strings + bitcast approach	2026-05-28 04:50:09 +00:00
biondizzle	4806e9ba11	test: llvm.inline_asm with Int32._mlir_type matching cvt_i8_bf16 pattern	2026-05-28 04:49:02 +00:00
biondizzle	ade49d964d	fix: test_ptx_runner path	2026-05-28 04:47:04 +00:00
biondizzle	dc9596c6bc	test: sub-process isolation for each f32→i32 approach	2026-05-28 04:46:45 +00:00
biondizzle	136a89f4e3	test: compare nvvm.inline_ptx approaches + arith.fptosi	2026-05-28 04:46:06 +00:00
biondizzle	eebf33b97d	test: clean minimal nvvm.inline_ptx test	2026-05-28 04:45:21 +00:00
biondizzle	882d48588b	test: debug nvvm.inline_ptx with CUTLASS_LOG_LEVEL=DEBUG	2026-05-28 04:44:35 +00:00
biondizzle	3ffb3b807a	test: minimal nvvm.inline_ptx isolation test	2026-05-28 04:43:18 +00:00
biondizzle	1cbb3cf752	NVFP4-1.1: Replace threshold rounding with inline PTX cvt.rni/rz/rmi - Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even - Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero - Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity - Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks with proper PTX hardware rounding in fp8_e4m3_from_float32 - quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step - Add test_ptx_convert.py for inline PTX conversion verification - This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1	2026-05-28 04:40:17 +00:00
biondizzle	2777ebfe8e	NVFP4-1.1: ultra-minimal test — Float32 comparison + Int32 select	2026-05-28 04:35:06 +00:00
biondizzle	2087eaef49	NVFP4-1.1: minimal threshold rounding test	2026-05-28 04:33:38 +00:00
biondizzle	1828a71cde	NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)	2026-05-28 04:32:08 +00:00
biondizzle	accc66741d	NVFP4-1.1: update test kernel with threshold rounding API	2026-05-28 04:27:29 +00:00
biondizzle	c3d5a7b82f	NVFP4-1.1: try .to(Int32) for float-to-int conversion	2026-05-28 04:02:45 +00:00
biondizzle	dc35d29811	NVFP4-1.1: fix cute.arch.store signature - store(ptr, val) not store(ptr, val, dtype)	2026-05-28 04:01:38 +00:00
biondizzle	a05a76bb6b	NVFP4-1.1: add Int32 cast diagnostic test	2026-05-28 03:59:01 +00:00
biondizzle	6f94925491	NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)	2026-05-28 03:48:51 +00:00
biondizzle	60790564f0	NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200	2026-05-28 03:46:45 +00:00
biondizzle	a41de129cb	NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store	2026-05-28 03:42:24 +00:00
biondizzle	3a78bdf570	NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization	2026-05-28 03:40:54 +00:00
biondizzle	80b6b79f9e	NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels - fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid, NaN guard for exp=15/mant=7, mantissa overflow handling) - fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32 - half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7) - quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack - Verified 0/500 trial failures against Python reference - Key fixes discovered during validation: 1. FP8 E4M3 bias is 7, NOT 8 2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid) 3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024) 4. Round-to-nearest-even (not round-half-up) for half_step and mantissa 5. Mantissa overflow (round to 8) must increment exponent	2026-05-28 03:39:55 +00:00

1 2 3 4 5 ...

607 Commits