auto: pre-test commit

2026-05-23 19:14:02 +00:00
parent 341527977c
commit 4d6acaeef0
1 changed files with 23 additions and 11 deletions
--- a/STAGE_D.md
+++ b/STAGE_D.md
@@ -748,18 +748,30 @@ The following are real potential wins but go beyond what the V4 paper explicitly

 ---

-## ⚡ CURRENT ACTION (2026-05-23 19:10 UTC)
+## ⚡ CURRENT ACTION (2026-05-23 19:20 UTC)

-**Starting NVFP4-0 verification and D1.3 validation on B200:**
+**D1.3 SMEM-P Rank Mismatch Issue — DEBUGGING IN PROGRESS:**

-1. **NVFP4-0.1**: Check sf_dtype in gemm_runner.py - appears to be `torch.float8_e4m3fn` from quantize.py ✅
-2. **NVFP4-0.2**: SF TMEM layout - need to verify packing matches UE4M3 (NVFP4) not UE8M0 (MXFP4)
-3. **NVFP4-0.3**: FP4 TMA element type - check `float4_e2m1fn_x2` in FMHA TMA
-4. **NVFP4-0.4**: MMA kind verification - confirm tcgen05 infers NVFP4 not MXFP4
+**Problem:** SMEM-P copy fails with rank mismatch: source rank 4, destination rank 5.
+- Source `tSMEM_CPYrP_qk`: `((64, 8), 128, 1, 1)` rank 4
+- Destination `tSMEM_CPYsP`: `(((128, 16), 8), 16, 1, (4, 2), 1)` rank 5
+- Error: "Expected source and destination tensors to have the same rank"

-**D1.3 Verification:**
- Running `test_fmha_v3_stage_d1.py` on B200 via `fire_b200_test`
- Checking debug prints from SMEM-P path in fmha.py
- Verifying hd=128,256,512 produce correct cosine similarity
+**Debug Findings:**
+1. `sP` shape: `((128, 16), 1, (4, 2), 1)` (4 modes) — PV A-operand SMEM layout
+2. `rP_bf16_qk` shape: `((128, 128), 1, 1)` (3 modes) — QK C-fragment layout
+3. Using `make_tiled_copy_C(smem_copy_atom, qk_mma)` with `tcgen05.copy.St32x32bOp` BF16 atom
+4. `partition_S` produces rank 4 from rank 3 input (adds a mode)
+5. `partition_D` produces rank 5 from rank 4 input (adds a mode)

-**Git workflow:** Edit locally → commit → push → test on B200
+**Root Cause Hypothesis:**
+The tiled copy created by `make_tiled_copy_C` with `qk_mma` expects source/destination tensors with specific tiling structure. `sP` (PV A-operand layout) doesn't match that structure.
+
+**Possible Solutions (from STAGE_D.md):**
+1. Try different `group_modes` combinations
+2. Implement manual SMEM copy without helpers (bypass `make_tiled_copy_C`)
+3. Compute SMEM addresses manually from QK partition → PV A-operand mapping
+
+**Blocked:** Need to understand correct approach for SMEM-P copy given "helpers are a trap" warning.
+
+**Next:** Investigate manual copy approach or try `group_modes(sP, 0, 4)` to flatten all modes.