nvfp4-megamoe-kernel/= at ec250eccd6a67683dcd3b19032e6d25d2c4e2f79 - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

biondizzle c97661994e WIP: correction warp group architecture - compiles, illegal address at runtime

4 softmax warps (0-3), 4 correction warps (4-7), 1 MMA (8), 1 TMA (9).
320 threads total.

Softmax: QK→softmax, write P, write row metadata to TMEM vector.
Correction: read vector via QK partition, rescale O (C6), normalize O (C9).

Compiles successfully but hits CUDA_ERROR_ILLEGAL_ADDRESS at runtime.
Likely: vector TMEM offsets or correction TMEM access layout is wrong.

Key files:
- tests/unit/test_fmha_v3_correction.py (new correction architecture)
- tests/unit/test_fmha_v3_softmax.py (working n=128, cosine 0.993)

2026-05-21 21:20:39 +00:00

0 lines

0 B

Plaintext

Raw Blame History

The file is empty.