nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	97df02ea07	fix: -Xcompiler -fPIC for nvcc shared library	2026-05-28 05:22:15 +00:00
biondizzle	4dfb71bc20	test: nvcc direct compilation test (avoid torch JIT __bf16 ICE)	2026-05-28 05:21:41 +00:00
biondizzle	373900fa08	FMHA SM100: Fix launch wrapper to match new kernel API	2026-05-28 05:20:31 +00:00
biondizzle	a30ebfb197	FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop - TMEM alloc/dealloc/load/store via inline PTX (tcgen05.*) - UMMA SMEM descriptor construction (make_umma_desc) - QK GEMM via tcgen05.mma.kind::f16 inline asm - Online softmax with D3/D4/D5c masks - O rescale in REGISTERS (D1.5 fix — no TMEM round-trip!) - FP4 quantize helpers (hs2e2m1, fp8_e4m3_encode) - Still needs: PV GEMM, proper P staging, TMEM O load/store	2026-05-28 05:19:34 +00:00
biondizzle	09dfd4a41f	fix: rename .cpp to .cu for CUDA compilation	2026-05-28 05:16:41 +00:00
biondizzle	4c194b7254	fix: add CUDA include path for host compiler	2026-05-28 05:15:48 +00:00
biondizzle	48baea7728	FMHA SM100: Remove CUTLASS includes, write raw PTX inline asm CUTLASS headers transitively include cuda_bf16.h which has a CUDA 13.2 in_place_from bug. Writing tcgen05 PTX directly via inline asm instead. No dependencies on CUTLASS C++ — pure PTX + CUDA runtime.	2026-05-28 05:15:07 +00:00
biondizzle	88d5995ec9	fix: define bf16_t using __bf16 built-in, avoid cuda_bf16.h bug	2026-05-28 05:14:01 +00:00
biondizzle	f0660d0bd7	fix: use C++20 for cuda_bf16.h compat	2026-05-28 05:13:18 +00:00
biondizzle	6bd3356582	fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr	2026-05-28 05:13:01 +00:00
biondizzle	c1266b5275	fix: include cuda_bf16.h only in device code	2026-05-28 05:12:30 +00:00
biondizzle	a64e55665b	fix: avoid cuda_bf16.h, use inline PTX for BF16 conversion	2026-05-28 05:12:08 +00:00
biondizzle	1734d13f60	fix: restore cuda_bf16.h include	2026-05-28 05:11:39 +00:00
biondizzle	8783a25deb	fix: guard cuda_bf16.h with __CUDA_ARCH__	2026-05-28 05:11:11 +00:00
biondizzle	5e389b5ed9	fix: remove duplicate desc declaration	2026-05-28 05:10:43 +00:00
biondizzle	7ac2499266	fix: defer UMMA descriptor — use placeholder for now	2026-05-28 05:10:15 +00:00
biondizzle	db17d8db9a	fix: cvta.to.shared PTX for SMEM address	2026-05-28 05:09:50 +00:00
biondizzle	e12a81ae36	fix: include cstdint	2026-05-28 05:09:28 +00:00
biondizzle	0c73a024ba	fix: guard CUTLASS includes with __CUDA_ARCH__ for host compilation	2026-05-28 05:09:07 +00:00
biondizzle	41e59a2423	FMHA SM100: Add SMEM descriptor construction for tcgen05.mma	2026-05-28 05:08:25 +00:00
biondizzle	3eb432d064	fix: CUTLASS path /root/cutlass	2026-05-28 05:06:48 +00:00
biondizzle	66d9f5c60f	fix: --x cu for .cuh compilation	2026-05-28 05:06:13 +00:00
biondizzle	4dcd80ea0d	fix: use full nvcc path	2026-05-28 05:05:55 +00:00
biondizzle	fac7275f2b	test: nvcc compilation test for FMHA SM100 kernel	2026-05-28 05:05:31 +00:00
biondizzle	230c350c77	FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton 6-warp specialization using CUTLASS C++ atoms directly: - tcgen05.mma for QK (SMEM→SMEM→TMEM) and PV (TMEM→SMEM→TMEM) - TMEM accumulator with one-way correction epilogue (TMEM→regs→SMEM→GMEM) - In-kernel O rescale via registers (fixes D1.5 TMEM round-trip!) - D3/D4/D5c masks, NVFP4 quantize helpers, FP8 E4M3 encode - PyTorch binding with head_dim template dispatch This bypasses all CuTeDSL limitations: float→int, TMEM round-trip, multi-CTA, hd=512 MLIR compilation hang.	2026-05-28 05:04:44 +00:00
biondizzle	b2d0417a46	NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files CuTeDSL MLIR pipeline cannot lower any float→int op. All approaches fail: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast. Production path: dsv4/kernels/cuda/quantize_nvfp4.cu (raw CUDA, works). For NVFP4-1.1 fusion, use post-epilogue CUDA kernel approach. Removed dead test files (test_ptx_, test_fp4_isolate, test_minimal_cmp*, test_dtype_store, test_threshold_round).	2026-05-28 04:59:01 +00:00
biondizzle	650bcdcccf	test: f32 vs i32 GMEM store	2026-05-28 04:57:45 +00:00
biondizzle	cc37ce6dbf	test: absolute minimum CuTeDSL int store + float cmp	2026-05-28 04:56:16 +00:00
biondizzle	c4fdfc7789	test: isolate which fp4_quant function causes LLVM ERROR	2026-05-28 04:55:23 +00:00
biondizzle	b3eb46d4ec	NVFP4-1.1: Restore threshold RNE approach — inline PTX blocked by toolchain CuTeDSL MLIR pipeline cannot lower any float→int conversion: arith.fptosi, llvm.inline_asm, nvvm.inline_ptx, llvm.bitcast — all fail with 'LLVM ERROR: unsupported operation'. The pipeline has no path from Float32 to Int32 MLIR types. Threshold RNE is the mathematically correct software implementation: - Float32 comparisons select Int32 constants (no arith.fptosi) - > vs >= at .5 boundaries implements round-to-nearest-even - Equivalent to PTX cvt.rni.s32.f32 for bounded ranges	2026-05-28 04:54:27 +00:00
biondizzle	71ee1485ea	test: constraints runner	2026-05-28 04:50:14 +00:00
biondizzle	c55c237fcd	test: different constraint strings + bitcast approach	2026-05-28 04:50:09 +00:00
biondizzle	4806e9ba11	test: llvm.inline_asm with Int32._mlir_type matching cvt_i8_bf16 pattern	2026-05-28 04:49:02 +00:00
biondizzle	ade49d964d	fix: test_ptx_runner path	2026-05-28 04:47:04 +00:00
biondizzle	dc9596c6bc	test: sub-process isolation for each f32→i32 approach	2026-05-28 04:46:45 +00:00
biondizzle	136a89f4e3	test: compare nvvm.inline_ptx approaches + arith.fptosi	2026-05-28 04:46:06 +00:00
biondizzle	eebf33b97d	test: clean minimal nvvm.inline_ptx test	2026-05-28 04:45:21 +00:00
biondizzle	882d48588b	test: debug nvvm.inline_ptx with CUTLASS_LOG_LEVEL=DEBUG	2026-05-28 04:44:35 +00:00
biondizzle	3ffb3b807a	test: minimal nvvm.inline_ptx isolation test	2026-05-28 04:43:18 +00:00
biondizzle	e33c48e44c	NVFP4-1.1: Use nvvm.inline_ptx instead of llvm.inline_asm for f32→i32 llvm.inline_asm fails with 'LLVM ERROR: unsupported operation' in CuTeDSL lowering pipeline. Switch to nvvm.inline_ptx which is native to the NVVM dialect and lowers correctly. - f32_to_i32_rni: cvt.rni.s32.f32 via nvvm.inline_ptx - f32_to_i32_rz: cvt.rzi.s32.f32 via nvvm.inline_ptx - f32_to_i32_rmi: cvt.rmi.s32.f32 via nvvm.inline_ptx	2026-05-28 04:42:33 +00:00
biondizzle	74dba6ab9d	auto: pre-test commit	2026-05-28 04:40:20 +00:00
biondizzle	1cbb3cf752	NVFP4-1.1: Replace threshold rounding with inline PTX cvt.rni/rz/rmi - Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even - Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero - Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity - Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks with proper PTX hardware rounding in fp8_e4m3_from_float32 - quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step - Add test_ptx_convert.py for inline PTX conversion verification - This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1	2026-05-28 04:40:17 +00:00
biondizzle	2777ebfe8e	NVFP4-1.1: ultra-minimal test — Float32 comparison + Int32 select	2026-05-28 04:35:06 +00:00
biondizzle	2087eaef49	NVFP4-1.1: minimal threshold rounding test	2026-05-28 04:33:38 +00:00
biondizzle	1828a71cde	NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)	2026-05-28 04:32:08 +00:00
biondizzle	d2aa93aad7	NVFP4-1.1: fix Int32 clamping — use comparisons instead of fmin/fmax (float-only ops)	2026-05-28 04:30:06 +00:00
biondizzle	accc66741d	NVFP4-1.1: update test kernel with threshold rounding API	2026-05-28 04:27:29 +00:00
biondizzle	dabcc415a8	NVFP4-1.1: threshold rounding for float-to-int — avoids CuTeDSL limitation All float-to-int conversions replaced with threshold comparisons: - round_rne_u0_8: mantissa rounding via Float32 comparisons → Int32 constants - abs_scaled_to_e2m1_idx: direct \|scaled\| → E2M1 index (no half_step needed) - Verified 0/500 trial failures against Python reference Key thresholds (RNE boundaries): - 0.25, 0.75, 1.25, 1.75, 2.75, 3.75, 5.25 with > vs >= for RNE tie-breaking - Fixed: 2.75 must use >= (not >) to match round(5.5)=6 RNE	2026-05-28 04:26:40 +00:00
biondizzle	acf46c494c	NVFP4-1.1: update approach doc and fp4_quant with CuTeDSL API fixes	2026-05-28 04:09:58 +00:00
biondizzle	f3a2b37d70	NVFP4-1.1: document CuTeDSL float-to-int limitation, revise approach to compact SwiGLU output	2026-05-28 04:06:27 +00:00

1 2 3 4 5 ...

1315 Commits