auto: pre-test commit

2026-05-23 19:20:42 +00:00
parent 9b72411ca7
commit 841db091f7
1 changed files with 34 additions and 19 deletions
--- a/STAGE_D.md
+++ b/STAGE_D.md
@@ -748,30 +748,45 @@ The following are real potential wins but go beyond what the V4 paper explicitly

 ---

-## ⚡ CURRENT ACTION (2026-05-23 19:20 UTC)
+## ⚡ CURRENT ACTION (2026-05-23 19:25 UTC)

-**D1.3 SMEM-P Rank Mismatch Issue — DEBUGGING IN PROGRESS:**
+**D1.3 SMEM-P — MANUAL COPY ATTEMPT FAILED:**

-**Problem:** SMEM-P copy fails with rank mismatch: source rank 4, destination rank 5.
- Source `tSMEM_CPYrP_qk`: `((64, 8), 128, 1, 1)` rank 4
- Destination `tSMEM_CPYsP`: `(((128, 16), 8), 16, 1, (4, 2), 1)` rank 5
- Error: "Expected source and destination tensors to have the same rank"
+**Problem:** `make_tiled_copy_C` creates incompatible partitions:
+- Source partition `tSMEM_CPYrP_qk`: size=65536 elements (rank 4)
+- Destination partition `tSMEM_CPYsP`: size=2097152 elements (rank 5) — 32× larger!
+- Manual copy attempted but size mismatch prevents element-wise mapping.

 **Debug Findings:**
-1. `sP` shape: `((128, 16), 1, (4, 2), 1)` (4 modes) — PV A-operand SMEM layout
-2. `rP_bf16_qk` shape: `((128, 128), 1, 1)` (3 modes) — QK C-fragment layout
-3. Using `make_tiled_copy_C(smem_copy_atom, qk_mma)` with `tcgen05.copy.St32x32bOp` BF16 atom
-4. `partition_S` produces rank 4 from rank 3 input (adds a mode)
-5. `partition_D` produces rank 5 from rank 4 input (adds a mode)
+1. `make_tiled_copy_C(smem_copy_atom, qk_mma)` partitions threads by QK C-fragment layout
+2. But `sP` has PV A-operand SMEM layout — incompatible tiling structure
+3. `partition_S` and `partition_D` produce tensors with different element counts (65536 vs 2M)
+4. This confirms "helpers are a trap" — they assume compatible layouts

-**Root Cause Hypothesis:**
-The tiled copy created by `make_tiled_copy_C` with `qk_mma` expects source/destination tensors with specific tiling structure. `sP` (PV A-operand layout) doesn't match that structure.
+**Root Cause:** QK C-fragment tiling and PV A-operand tiling are fundamentally different. A tiled copy operation expects source and destination to have same tiling pattern.

-**Possible Solutions (from STAGE_D.md):**
-1. Try different `group_modes` combinations
-2. Implement manual SMEM copy without helpers (bypass `make_tiled_copy_C`)
-3. Compute SMEM addresses manually from QK partition → PV A-operand mapping
+**Realization:** We need SMEM as rendezvous point with manual addressing, not automatic tiled copy.

-**Blocked:** Need to understand correct approach for SMEM-P copy given "helpers are a trap" warning.
+**Possible Paths Forward:**
+1. **Manual SMEM addressing:** Compute SMEM addresses directly from QK C-fragment coordinates
+2. **Change sP layout:** Make `sP` have QK C-fragment layout (not PV A-operand)
+3. **Abandon helpers entirely:** Implement complete manual copy without `make_tiled_copy_C`

-**Next:** Investigate manual copy approach or try `group_modes(sP, 0, 4)` to flatten all modes.
+**Blocked:** Need to decide on correct approach. Manual addressing seems most aligned with "helpers are a trap" warning.
+
+**Mike says:** "Youre gonna need to do manual SMEM addressing. It may take you a few hours, but I trust you can do it."
+
+**Decision:** Manual SMEM addressing it is. Abandon `make_tiled_copy_C` entirely.
+
+**Approach:**
+1. Get thread's position in QK C-fragment partition
+2. Compute which P values this thread owns (range in QK C-fragment space)
+3. For each P value, compute destination SMEM address in PV A-operand layout
+4. Write P values to computed SMEM addresses
+
+**Implementation Plan:**
+- Use `cute.coord` to get thread's logical coordinates in QK C-fragment partition
+- Compute mapping: (thread_coord, element_idx) → SMEM_offset
+- Write via `sP[smem_offset] = p_value`
+
+**Expected Complexity:** Few hours. Need to understand QK C-fragment layout and PV A-operand SMEM layout coordinate systems.