From e0aa7ccd19348a2d71966f7d6ff5b2d5ac3464f1 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 23 May 2026 19:36:58 +0000 Subject: [PATCH] auto: pre-test commit --- STAGE_D1.3.md | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/STAGE_D1.3.md b/STAGE_D1.3.md index a4c1bad9..c719c5c5 100644 --- a/STAGE_D1.3.md +++ b/STAGE_D1.3.md @@ -117,11 +117,31 @@ else: - Zeroed sP causes PV MMA to read zeros → garbage output - TMEM-P path works for hd=64 but fails for hd>64 due to TMEM layout mismatch -## Blocked Until -1. Answer from CUTLASS LLM on coordinate mapping -2. Or alternative approach suggested +## Progress Update (2026-05-23 19:35 UTC) +**CUTLASS LLM responded!** Got complete solution: -## Time Pressure -- Been debugging for ~20 minutes -- Manual addressing proving much harder than anticipated -- Without mapping formula, guessing is low-probability \ No newline at end of file +### Key Solution: +1. Use `make_identity_tensor(tStS0.shape)` for coordinate tensor +2. Partition coordinate tensor same way as data tensor +3. Mapping formula: QK `((m, n), 0, 0)` → PV `((m, n % 16), 0, ((n // 16) % 4, n // 64), 0)` +4. Use tensor indexing `sP[dst_coord] = value`, not manual offsets + +### Implementation Progress: +- ✅ Coordinate mapping function implemented and works (`qk_to_pv_coord`) +- ✅ Tensor indexing with coordinate works (`sP[test_coord] = value`) +- ❌ Need to implement full 128-value mapping per thread +- ❌ Need to get QK coordinates for each of thread's 128 P values + +### Next Steps: +1. Create coordinate tensor `cP_qk = cute.make_identity_tensor(tStS0.shape)` +2. Partition it same way as `rP_bf16` (through `tTMEM_LOADcP`) +3. In softmax loop, for each fragment j and element k: + - Get P value from `rP_bf16_frg` or directly from `tTMEM_LOADrS_frg` + - Get coordinate from partitioned coordinate tensor + - Map to PV coordinate using `qk_to_pv_coord` + - Write to SMEM: `sP[dst_coord] = value` + +### Time Pressure: +- Got working coordinate mapping +- Need to implement full mapping (~15-30 min) +- Then test and debug \ No newline at end of file