biondizzle
e678afcde0
Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr
Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
..
2026-05-20 04:13:52 +00:00
2026-05-20 06:43:43 +00:00
2026-05-16 02:57:54 +00:00
2026-05-19 15:30:29 +00:00
2026-05-20 07:15:01 +00:00
2026-05-19 08:01:31 +00:00
2026-05-19 01:54:48 +00:00
2026-05-20 05:46:15 +00:00
2026-05-19 02:36:30 +00:00
2026-05-20 04:39:47 +00:00
2026-05-20 05:46:15 +00:00
2026-05-20 20:26:25 +00:00
2026-05-20 06:43:43 +00:00
2026-05-20 06:43:43 +00:00
2026-05-19 07:18:10 +00:00
2026-05-20 06:43:43 +00:00
2026-05-19 02:45:57 +00:00