Commit Graph

  • 7cb133c5bf test: exact copy of working TS test (verify it still passes) biondizzle 2026-05-28 13:49:04 +00:00
  • 0dcaa648b3 test: properly aligned V SMEM buffer biondizzle 2026-05-28 13:47:47 +00:00
  • 1c14ada386 test: write O to tb (overwriting P), same as isolated test biondizzle 2026-05-28 13:46:04 +00:00
  • a944f90040 test: match isolated TS test exactly (V=all-1, BLOCK_MN=16) biondizzle 2026-05-28 13:44:36 +00:00
  • 482328160a test: single PV K-tile debug biondizzle 2026-05-28 13:43:24 +00:00
  • 3a40ed6d69 test: skip QK+softmax, write P directly to TMEM for PV debug biondizzle 2026-05-28 13:41:50 +00:00
  • f24bc583dc test: zero O TMEM before PV GEMM biondizzle 2026-05-28 13:40:30 +00:00
  • 2885b3f2ed test: full FMHA HD=16 with PV GEMM via tcgen05.mma TS biondizzle 2026-05-28 13:39:34 +00:00
  • dc2130cb12 test: cleanup TS MMA test biondizzle 2026-05-28 13:38:07 +00:00
  • a767e90a12 test: B=2.0 to understand TS MMA scale factor biondizzle 2026-05-28 13:36:30 +00:00
  • b7c6971720 test: use 32x32b.x8 for A write (avoids 16x256b misalign) biondizzle 2026-05-28 13:34:50 +00:00
  • a7c81d66ba test: step-by-step TMEM write/read debug for TS MMA biondizzle 2026-05-28 13:33:36 +00:00
  • c05cc1ac93 test: separate TMEM regions for A and C in TS MMA biondizzle 2026-05-28 13:32:22 +00:00
  • 37a502e476 test: minimal tcgen05.mma TS debug (PV GEMM) biondizzle 2026-05-28 13:31:18 +00:00
  • efa03f53d4 docs: update CURRENT_ISSUE and MEMORY — full FMHA HD=64 pipeline working biondizzle 2026-05-28 13:11:32 +00:00
  • 654a2ae7f4 test: merge softmax+PV into single warp0 block (s_vals scope fix) biondizzle 2026-05-28 13:10:02 +00:00
  • 5c9e3c41af test: full FMHA HD=64 — QK+softmax+PV(register math) biondizzle 2026-05-28 13:09:20 +00:00
  • 0672373e51 test: debug — just QK+softmax+P read (no PV) biondizzle 2026-05-28 13:08:06 +00:00
  • 5d75decd57 test: full FMHA HD=16 — PV via register math (decode T=1) biondizzle 2026-05-28 13:06:52 +00:00
  • f62772992b test: full FMHA HD=16 with PV GEMM (separate TMEM for P and O) biondizzle 2026-05-28 13:05:27 +00:00
  • bd15bce853 test: HD=16 QK+softmax (no PV) biondizzle 2026-05-28 13:04:10 +00:00
  • 38d7bcd776 test: HD=16 FMHA softmax only (skip PV for now) biondizzle 2026-05-28 13:03:06 +00:00
  • 834d682443 test: full FMHA HD=16 pipeline (QK→softmax→PV→epilogue) biondizzle 2026-05-28 13:02:00 +00:00
  • 3b8be4b2db test: FMHA softmax (QK→read S→softmax→write P→read P→verify) biondizzle 2026-05-28 13:00:37 +00:00
  • c936940428 test: separate (128,16) SMEM per K-tile with correct source stride biondizzle 2026-05-28 12:57:38 +00:00
  • f244c4fdd2 test: single-thread MMA (tid==0) for Layout D biondizzle 2026-05-28 12:56:39 +00:00
  • ba2e390e1e test: debug single K-tile from full (128,64) SMEM biondizzle 2026-05-28 12:55:52 +00:00
  • a7e8b483cd test: HD=64 multi-K-tile with correct source stride in SMEM writes biondizzle 2026-05-28 12:54:57 +00:00
  • 926ae5d7bf test: fix K source stride mismatch in manual SMEM write biondizzle 2026-05-28 12:54:03 +00:00
  • 7d16a30cb6 test: exact HD=16 pattern with HD=64 data biondizzle 2026-05-28 12:53:13 +00:00
  • db4f661843 test: debug with (128,16) SMEM matching HD=16 exactly biondizzle 2026-05-28 12:52:19 +00:00
  • b703dc0a50 test: debug single K-tile with offset descriptor biondizzle 2026-05-28 12:51:33 +00:00
  • 435ca037cf test: use accumulate=false for first K-tile, skip TMEM zero biondizzle 2026-05-28 12:50:44 +00:00
  • e8ac2120ad test: HD=64 QK with contiguous SMEM + offset descriptors biondizzle 2026-05-28 12:50:07 +00:00
  • 1c01e8e412 test: fix inline asm line continuation for nvcc biondizzle 2026-05-28 12:48:45 +00:00
  • 71c774027c test: fix HD=64 QK — zero TMEM, fence after MMA, single-thread MMA call biondizzle 2026-05-28 12:47:51 +00:00
  • 1bf76388c8 test: always accumulate, separate SMEM per K-tile, TMEM starts at 0 biondizzle 2026-05-28 12:23:47 +00:00
  • 8707f555c2 test: add extra syncwarp + syncthreads for MMA safety biondizzle 2026-05-28 12:20:01 +00:00
  • 5a65d46c26 test: HD=64 with separate SMEM per K-tile — no offset descriptors needed biondizzle 2026-05-28 12:18:06 +00:00
  • 526fafb808 test: revert volatile, fix wid==0, full 4 K-tiles biondizzle 2026-05-28 12:16:09 +00:00
  • de879342dd test: 1 K-tile, volatile writes, verify SMEM biondizzle 2026-05-28 12:13:23 +00:00
  • bd6440fd83 test: volatile SMEM writes + 2 K-tiles biondizzle 2026-05-28 12:11:47 +00:00
  • c2e41a858e test: force 2 K-tiles for debug biondizzle 2026-05-28 12:09:45 +00:00
  • 8b2200a6d3 test: HD=64 full 4 K-tile accumulate + full-HD scalar reference biondizzle 2026-05-28 12:07:50 +00:00
  • afb18caf2d test: clean HD=64, 1 K-tile only, verify SMEM writes + compare vs scalar biondizzle 2026-05-28 12:04:54 +00:00
  • e587e26b06 test: log canonical indices we write Q to biondizzle 2026-05-28 12:01:28 +00:00
  • facd509c3c test: remove sanity check (zeroing loop overwrites), fix verify offsets biondizzle 2026-05-28 11:59:08 +00:00
  • 20ae390d32 test: fix compile error biondizzle 2026-05-28 11:57:08 +00:00
  • 7b16eceb91 test: more detailed SMEM sanity check biondizzle 2026-05-28 11:56:07 +00:00
  • eb0ca18e23 test: sanity check sQ[0] write+read biondizzle 2026-05-28 11:54:13 +00:00
  • 8936a2dec7 test: clean SMEM write loops for HD=64 biondizzle 2026-05-28 11:52:51 +00:00
  • 2ffbfda47d test: print SMEM verify data biondizzle 2026-05-28 11:51:08 +00:00
  • 4fd41365de test: add SMEM verify for HD=64 K-tile offsets biondizzle 2026-05-28 11:49:44 +00:00
  • 4483539f01 test: HD=64 random data, 4 K-tiles, accumulate biondizzle 2026-05-28 11:47:56 +00:00
  • 73bd21ce01 test: force 1 K-tile for HD=64 debug biondizzle 2026-05-28 11:46:12 +00:00
  • abe1870429 test: HD=64 all-ones, expected S[0,j]=64 (unscaled) or 8.0 scaled biondizzle 2026-05-28 11:44:31 +00:00
  • 73f9ff98c9 test: UMMA QK HD=64 (4 K-tiles, accumulate) — multi-K-tile test biondizzle 2026-05-28 11:42:29 +00:00
  • df34cae9c6 UMMA QK GEMM WORKING! Update docs — 4x was scale factor, not bug biondizzle 2026-05-28 11:41:19 +00:00
  • 1874a70a6d test: fix var ref biondizzle 2026-05-28 11:39:15 +00:00
  • 8426d13285 test: fix comparison — row 0 is S[0,c], rows 1-127 should be zero biondizzle 2026-05-28 11:38:22 +00:00
  • 6f40fafa91 test: verify ALL 128 rows × 8 cols match scalar reference biondizzle 2026-05-28 11:36:46 +00:00
  • 3c7d9d9303 test: apply 1/sqrt(HD) scale to MMA output — 4x was the scale factor, not a bug! biondizzle 2026-05-28 11:34:45 +00:00
  • 013f370046 test: all-ones data, expected S[0,j]=16.0 for every j biondizzle 2026-05-28 11:32:56 +00:00
  • f5a0966afc test: 4 warp leaders (lane==0) call MMA simultaneously biondizzle 2026-05-28 11:30:19 +00:00
  • c01d6fddf4 test: gau-nernst pattern — fence::after_thread_sync, 4 warps, 128 threads, 32x32b.x8 loop biondizzle 2026-05-28 11:28:47 +00:00
  • a048b56886 test: single-thread MMA + 0.25 scaling for 4× factor biondizzle 2026-05-28 10:23:06 +00:00
  • 57d67e6b51 test: revert to 64-bit descriptors, 4 warp leaders, 32x32b read biondizzle 2026-05-28 10:21:06 +00:00
  • 32f7fa7bce Update CURRENT_ISSUE.md and MEMORY.md with UMMA 4× bug details biondizzle 2026-05-28 10:15:14 +00:00
  • 3f95f1c5d4 test: try LBO with block_mn=32 (1/4 of M=128) biondizzle 2026-05-28 10:11:38 +00:00
  • d03e353972 test: 4 warp leaders call MMA (Layout D requires 4 warps) biondizzle 2026-05-28 10:10:07 +00:00
  • 8059ed15ad test: explicitly zero padding between Q and K biondizzle 2026-05-28 10:08:35 +00:00
  • 9e98c067ab test: Layout D TMEM read using 32x32b.x8 format, 4 warps biondizzle 2026-05-28 10:07:15 +00:00
  • 68d1a7920c test: M=64 in both desc and idesc biondizzle 2026-05-28 10:04:17 +00:00
  • 0f51fda0da test: try N=8 in idesc biondizzle 2026-05-28 10:02:52 +00:00
  • 4f7c9649fd test: clean UMMA QK test, debug 4x factor, 8KB padding, 128 TMEM cols biondizzle 2026-05-28 10:01:39 +00:00
  • ac65ece33b test: TMEM 2-store with fence outside wid guard, 64 threads biondizzle 2026-05-28 09:59:43 +00:00
  • 2c89eea6be test: fence+sync between 2 tmem_stores biondizzle 2026-05-28 09:58:51 +00:00
  • 24c5afe1dc test: 64 threads, 2 stores to col 0 biondizzle 2026-05-28 09:57:53 +00:00
  • 987f2c8917 test: 2 tmem_stores to SAME column 0 biondizzle 2026-05-28 09:57:07 +00:00
  • 494149f034 test: 32 threads (1 warp), no guards, all participate biondizzle 2026-05-28 09:56:17 +00:00
  • f0cb71da5c test: TMEM 2-col with fence+sync between stores, separate wid==0 blocks biondizzle 2026-05-28 09:54:19 +00:00
  • b69a538ab1 test: add fence+sync between 2 tmem_stores biondizzle 2026-05-28 09:53:10 +00:00
  • 7a21fa4bd8 test: add 2nd tmem_store to column 1 biondizzle 2026-05-28 09:52:05 +00:00
  • 4b129c146e test: add 1 tmem_load back biondizzle 2026-05-28 09:51:21 +00:00
  • 61f19ce891 test: skip tmem_load, only store+dealloc biondizzle 2026-05-28 09:50:48 +00:00
  • 2513e1a692 test: use 64 threads, fence outside warp guard, 1 store biondizzle 2026-05-28 09:50:09 +00:00
  • abfe9dbaa1 test: only 1 tmem_store to verify single column works biondizzle 2026-05-28 09:49:21 +00:00
  • 5795589abc test: TMEM 4 columns, individual store calls + loop load biondizzle 2026-05-28 09:48:27 +00:00
  • 8a428f6127 test: TMEM column addressing test (128 cols, store+load) biondizzle 2026-05-28 09:46:49 +00:00
  • ee3fe6d6b2 test: tmem_load column 1 only biondizzle 2026-05-28 09:45:34 +00:00
  • 6c38c6e442 test: read 8 TMEM columns individually (no loop) biondizzle 2026-05-28 09:44:30 +00:00
  • bcc6ed114d test: add 8KB padding after sQ to prevent MMA read overrun biondizzle 2026-05-28 09:43:17 +00:00
  • 764ed01d6f test: try M=64 in descriptor + idesc to debug 4x factor biondizzle 2026-05-28 09:41:50 +00:00
  • 4cb656e583 test: try idesc=0 (same as gau-nernst) biondizzle 2026-05-28 09:40:19 +00:00
  • cfba8484da test: try idesc with N=128 (full extent) + 128 TMEM cols biondizzle 2026-05-28 09:39:19 +00:00
  • 30f0056b11 test: clean rewrite with SMEM Q/K verification and dot product check biondizzle 2026-05-28 09:38:26 +00:00
  • 7eb85a71fc test: add Q SMEM verification output + bf16_to_f32_host biondizzle 2026-05-28 09:37:07 +00:00
  • 8f23c2aaf6 test: verify SMEM Q layout by reading back canonical data biondizzle 2026-05-28 09:35:58 +00:00
  • 004046a6a8 test: read only 1 TMEM column after MMA biondizzle 2026-05-28 09:35:02 +00:00
  • 41128122e3 test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop biondizzle 2026-05-28 09:33:45 +00:00