nvfp4-megamoe-kernel

Files

biondizzle a1f08f9488 🚀 MULTI-TILE SOFTMAX + O RESCALE WORKING: n=128 cos 0.999998, n=256 cos 0.80

Fixed ALL loops to use self.n_kv_tiles (Python int) instead of
cute.size(gK, mode=[3]) which returned 1 for all n values.

Results:
  n=128: cos 0.999998 ✅ PASS (single tile, full softmax + normalize)
  n=256: cos 0.801156 (2 tiles, O rescale partially working)
  n=512: CUDA launch failure (pipeline can't cycle past kv_stage=2)

The n=256 improvement (0.71 → 0.80) confirms:
  1. TMA fix (None,0,None,0) loads both KV tiles correctly
  2. Softmax processes both tiles with online row_max/row_sum tracking
  3. O rescale (O *= acc_scale for kt > 0) is partially working
  4. Final normalize (O *= 1/row_sum) works correctly

Remaining:
  - n=256 cos 0.80 → 0.9999: O rescale precision issue
  - n≥384: pipeline cycling (kv_stage=2 can only hold 2 tiles)
  - Need to increase kv_stage or fix pipeline state cycling

2026-05-23 00:35:42 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

cudagraph_test.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

layertest.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_cutedsl.py

Restructure: cutedsl/ -> dsv4/ with proper layering