nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	a983a8fb41	WIP: TMEM vector for per-row row_sum (not yet working) Key finding: the root cause is that each epilogue thread owns MULTIPLE rows in the QK C-fragment, so scalar row_max/row_sum are wrong (global across all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads use the same row_sum (from row 114). Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))). This is a no-op because both write and read use the SAME QK partition with a scalar row_sum. The vector approach only helps when different partitions are used for write vs read, or when per-row values are stored. Next steps: 1. Need PER-ROW row_max and row_sum, not per-thread scalar 2. The CUTLASS FMHA works because each thread owns exactly 1 row 3. Options: restructure thread layout, or compute per-row values differently 4. The vector must store ALL 128 per-row values, then read per-row in C9	2026-05-21 18:45:30 +00:00
biondizzle	331d9e95f3	WIP: Stage C softmax - partial progress Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code. Hardcoding s_k=128 (matching Stage B) fixes the base pipeline. Current status: kernel produces non-zero output but softmax math is still wrong. Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True Need to debug row_sum computation and C9 normalization.	2026-05-21 18:04:21 +00:00
biondizzle	84cd636ba9	Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True - Add pv_done_bar (barrier_id=4): MMA signals PV complete, epilogue waits before O rescale (C6) and final normalization (C9) - Fix acc_scale: exp2(scale * (old_max - new_max)) includes the scale_softmax_log2 factor matching CUTLASS FMHA reference - fastmath=True for both exp2 calls (P computation + rescale) - No *0.5 (our scalar row_sum pattern initializes (0,0) not (sum,sum))	2026-05-21 17:58:04 +00:00
biondizzle	52b46a2dee	Stage C: add validation harness with real softmax reference (C1)	2026-05-21 17:49:26 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

5 Commits