Update STAGE_D.md: manual SMEM addressing blocked on layout mapping

This commit is contained in:
2026-05-23 19:22:28 +00:00
parent 060cea5d0f
commit 5b6a4fbef9

View File

@@ -778,15 +778,22 @@ The following are real potential wins but go beyond what the V4 paper explicitly
**Decision:** Manual SMEM addressing it is. Abandon `make_tiled_copy_C` entirely.
**Approach:**
1. Get thread's position in QK C-fragment partition
2. Compute which P values this thread owns (range in QK C-fragment space)
3. For each P value, compute destination SMEM address in PV A-operand layout
4. Write P values to computed SMEM addresses
**Status:** STUCK — Manual addressing harder than expected due to CuTeDSL JIT constraints.
**Implementation Plan:**
- Use `cute.coord` to get thread's logical coordinates in QK C-fragment partition
- Compute mapping: (thread_coord, element_idx) → SMEM_offset
- Write via `sP[smem_offset] = p_value`
**Problems Encountered:**
1. `cute.coord` doesn't exist — can't get thread's logical coordinates
2. Array indexing requires compile-time constants or vectorized loops
3. Layouts are completely different:
- TMEM P layout: `((128,128),1,1):((65536,1),0,0)`
- SMEM P layout: `((128,16),1,(4,2),1):((64,1),0,(16,8192),0)`
4. No clear mapping from TMEM coordinates to SMEM coordinates
**Expected Complexity:** Few hours. Need to understand QK C-fragment layout and PV A-operand SMEM layout coordinate systems.
**Root Issue:** Manual layout conversion in CuTeDSL requires understanding coordinate systems and offset computation, which is complex without proper documentation/examples.
**Options:**
1. Continue trying to implement manual conversion (high risk, time-consuming)
2. Find existing example of layout conversion in codebase
3. Ask for more specific guidance on coordinate mapping
4. Try different approach: make PV read from TMEM with different layout
**Blocked:** Need coordinate mapping formula or example.