🚀 MULTI-TILE SOFTMAX + O RESCALE WORKING: n=128 cos 0.999998, n=256 cos 0.80
Fixed ALL loops to use self.n_kv_tiles (Python int) instead of
cute.size(gK, mode=[3]) which returned 1 for all n values.
Results:
n=128: cos 0.999998 ✅ PASS (single tile, full softmax + normalize)
n=256: cos 0.801156 (2 tiles, O rescale partially working)
n=512: CUDA launch failure (pipeline can't cycle past kv_stage=2)
The n=256 improvement (0.71 → 0.80) confirms:
1. TMA fix (None,0,None,0) loads both KV tiles correctly
2. Softmax processes both tiles with online row_max/row_sum tracking
3. O rescale (O *= acc_scale for kt > 0) is partially working
4. Final normalize (O *= 1/row_sum) works correctly
Remaining:
- n=256 cos 0.80 → 0.9999: O rescale precision issue
- n≥384: pipeline cycling (kv_stage=2 can only hold 2 tiles)
- Need to increase kv_stage or fix pipeline state cycling
This commit is contained in:
@@ -323,8 +323,6 @@ class FmhaV3StageCMulti:
|
||||
# the missing rescale shows as accuracy drift.
|
||||
for kt in range(self.n_kv_tiles):
|
||||
si_handle = s_cons.wait_and_advance()
|
||||
if kt == 0:
|
||||
cute.printf("SOFTMAX self.n_kv_tiles=%d\n", Int32(self.n_kv_tiles))
|
||||
|
||||
# Load S[kt]
|
||||
tTMEM_LOADrS = cute.make_rmem_tensor(tTMEM_LOADcS.shape, self.qk_acc_dtype)
|
||||
@@ -403,7 +401,6 @@ class FmhaV3StageCMulti:
|
||||
|
||||
# === Final O normalization: O *= 1/row_sum ===
|
||||
inv_row_sum = Float32(1.0) / row_sum
|
||||
cute.printf("FINAL row_sum=%f inv_row_sum=%f\n", row_sum, inv_row_sum)
|
||||
|
||||
tTMrO = cute.make_rmem_tensor(
|
||||
(tTMEM_LOADcO.shape, 128 // corr_tile_size), self.acc_dtype
|
||||
|
||||
Reference in New Issue
Block a user