The 8-mode indexing (tBgK[None,None,None,None,kt,None,None,None]) fails at
JIT compilation with 'coord and shape are weakly congruent' error. The actual
MLIR tensor shape is (((64,128),1),?,?,?) — 4 modes, not 8.
The working fix from commit 845ad98 on the B200 used 4-mode indexing all along:
tBgK[(None, None, kt, 0)] — mode 2 = GMEM tile dim
tVgV[(None, 0, kt, 0)] — mode 2 = GMEM tile dim
Updated all files: example10, test_fmha_v3_stage_c, README, docstrings.