auto: pre-test commit

This commit is contained in:
2026-05-23 19:36:58 +00:00
parent 4f8559ae2e
commit e0aa7ccd19

View File

@@ -117,11 +117,31 @@ else:
- Zeroed sP causes PV MMA to read zeros → garbage output
- TMEM-P path works for hd=64 but fails for hd>64 due to TMEM layout mismatch
## Blocked Until
1. Answer from CUTLASS LLM on coordinate mapping
2. Or alternative approach suggested
## Progress Update (2026-05-23 19:35 UTC)
**CUTLASS LLM responded!** Got complete solution:
## Time Pressure
- Been debugging for ~20 minutes
- Manual addressing proving much harder than anticipated
- Without mapping formula, guessing is low-probability
### Key Solution:
1. Use `make_identity_tensor(tStS0.shape)` for coordinate tensor
2. Partition coordinate tensor same way as data tensor
3. Mapping formula: QK `((m, n), 0, 0)` → PV `((m, n % 16), 0, ((n // 16) % 4, n // 64), 0)`
4. Use tensor indexing `sP[dst_coord] = value`, not manual offsets
### Implementation Progress:
- ✅ Coordinate mapping function implemented and works (`qk_to_pv_coord`)
- ✅ Tensor indexing with coordinate works (`sP[test_coord] = value`)
- ❌ Need to implement full 128-value mapping per thread
- ❌ Need to get QK coordinates for each of thread's 128 P values
### Next Steps:
1. Create coordinate tensor `cP_qk = cute.make_identity_tensor(tStS0.shape)`
2. Partition it same way as `rP_bf16` (through `tTMEM_LOADcP`)
3. In softmax loop, for each fragment j and element k:
- Get P value from `rP_bf16_frg` or directly from `tTMEM_LOADrS_frg`
- Get coordinate from partitioned coordinate tensor
- Map to PV coordinate using `qk_to_pv_coord`
- Write to SMEM: `sP[dst_coord] = value`
### Time Pressure:
- Got working coordinate mapping
- Need to implement full mapping (~15-30 min)
- Then test and debug