auto: pre-test commit
This commit is contained in:
@@ -117,11 +117,31 @@ else:
|
||||
- Zeroed sP causes PV MMA to read zeros → garbage output
|
||||
- TMEM-P path works for hd=64 but fails for hd>64 due to TMEM layout mismatch
|
||||
|
||||
## Blocked Until
|
||||
1. Answer from CUTLASS LLM on coordinate mapping
|
||||
2. Or alternative approach suggested
|
||||
## Progress Update (2026-05-23 19:35 UTC)
|
||||
**CUTLASS LLM responded!** Got complete solution:
|
||||
|
||||
## Time Pressure
|
||||
- Been debugging for ~20 minutes
|
||||
- Manual addressing proving much harder than anticipated
|
||||
- Without mapping formula, guessing is low-probability
|
||||
### Key Solution:
|
||||
1. Use `make_identity_tensor(tStS0.shape)` for coordinate tensor
|
||||
2. Partition coordinate tensor same way as data tensor
|
||||
3. Mapping formula: QK `((m, n), 0, 0)` → PV `((m, n % 16), 0, ((n // 16) % 4, n // 64), 0)`
|
||||
4. Use tensor indexing `sP[dst_coord] = value`, not manual offsets
|
||||
|
||||
### Implementation Progress:
|
||||
- ✅ Coordinate mapping function implemented and works (`qk_to_pv_coord`)
|
||||
- ✅ Tensor indexing with coordinate works (`sP[test_coord] = value`)
|
||||
- ❌ Need to implement full 128-value mapping per thread
|
||||
- ❌ Need to get QK coordinates for each of thread's 128 P values
|
||||
|
||||
### Next Steps:
|
||||
1. Create coordinate tensor `cP_qk = cute.make_identity_tensor(tStS0.shape)`
|
||||
2. Partition it same way as `rP_bf16` (through `tTMEM_LOADcP`)
|
||||
3. In softmax loop, for each fragment j and element k:
|
||||
- Get P value from `rP_bf16_frg` or directly from `tTMEM_LOADrS_frg`
|
||||
- Get coordinate from partitioned coordinate tensor
|
||||
- Map to PV coordinate using `qk_to_pv_coord`
|
||||
- Write to SMEM: `sP[dst_coord] = value`
|
||||
|
||||
### Time Pressure:
|
||||
- Got working coordinate mapping
|
||||
- Need to implement full mapping (~15-30 min)
|
||||
- Then test and debug
|
||||
Reference in New Issue
Block a user