-
58be79957d
test: 32 TMEM cols, add MMA call with N=32, read S from TMEM
biondizzle
2026-05-28 09:32:33 +00:00
-
22fb861447
test: 2 tmem_stores with syncwarp between
biondizzle
2026-05-28 09:30:37 +00:00
-
a87f20a4ae
test: just 1 tmem_store, no fence, no loop
biondizzle
2026-05-28 09:29:46 +00:00
-
2b57f28968
test: zero 128 TMEM columns, skip fence
biondizzle
2026-05-28 09:29:14 +00:00
-
25c9b70591
test: zero 2 TMEM columns
biondizzle
2026-05-28 09:28:31 +00:00
-
01c4097ccc
test: zero 32 TMEM columns
biondizzle
2026-05-28 09:27:59 +00:00
-
3694f63ba4
test: re-enable full TMEM zeroing (128 columns)
biondizzle
2026-05-28 09:27:25 +00:00
-
c3b6c3a5e6
test: minimal tmem_store debug (1 column + sentinels)
biondizzle
2026-05-28 09:26:52 +00:00
-
f1aaa50326
test: re-enable TMEM zeroing with tmem_base debug
biondizzle
2026-05-28 09:26:16 +00:00
-
a7f81331f8
test: skip TMEM zeroing again, alloc+dealloc only
biondizzle
2026-05-28 09:25:37 +00:00
-
3f5dcd481e
test: zero only 32 TMEM columns
biondizzle
2026-05-28 09:25:05 +00:00
-
2b1c8ce7df
test: re-enable all TMEM ops (alloc, zero, dealloc)
biondizzle
2026-05-28 09:24:28 +00:00
-
acc7424a48
test: skip TMEM zeroing, just alloc+dealloc
biondizzle
2026-05-28 09:23:48 +00:00
-
ca419c52f3
test: re-enable TMEM alloc + zero
biondizzle
2026-05-28 09:23:10 +00:00
-
09e8ea5933
test: fix compile error, skip TMEM read
biondizzle
2026-05-28 09:22:17 +00:00
-
69bbc21300
test: skip all TMEM ops, just test SMEM layout + descriptor
biondizzle
2026-05-28 09:21:52 +00:00
-
a6c0ce51a2
test: skip MMA, just test descriptor values
biondizzle
2026-05-28 09:20:59 +00:00
-
ea6b42e649
test_umma_qk: add descriptor debug output
biondizzle
2026-05-28 09:20:12 +00:00
-
0f6907b001
UMMA: fix descriptor + idesc — use gau-nernst tutorial values
biondizzle
2026-05-28 09:18:45 +00:00
-
9b458d2a6c
test_umma_qk: clean rewrite, hardcoded HD=16, explicit core-matrix layout writes
biondizzle
2026-05-28 09:16:37 +00:00
-
427410d94a
UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test
biondizzle
2026-05-28 09:15:40 +00:00
-
68b4151d21
dump SMEM layout info
biondizzle
2026-05-28 08:59:19 +00:00
-
fe0588d906
fix: simplify UMMA dump script
biondizzle
2026-05-28 08:57:49 +00:00
-
948a3f8a7a
add UMMA descriptor dump script
biondizzle
2026-05-28 08:55:43 +00:00
-
e5ba0ca119
debug: clean QK verify with scalar sanity + MMA result
biondizzle
2026-05-28 08:53:35 +00:00
-
a04d794979
debug: skip TMEM alloc — test SMEM loads only
biondizzle
2026-05-28 08:49:37 +00:00
-
72c97f2546
debug: minimal UMMA descriptor (just start_addr + version)
biondizzle
2026-05-28 08:48:01 +00:00
-
9a51bfa578
fix: align SMEM layout properly (128B aligned tmem + Q)
biondizzle
2026-05-28 08:46:56 +00:00
-
2a765be715
fix: correct SMEM size for row-major (not swizzled)
biondizzle
2026-05-28 08:44:55 +00:00
-
c64bd7b875
debug: read Q/K directly from SMEM
biondizzle
2026-05-28 08:43:39 +00:00
-
58b610c96c
fix: proper early return for SMEM load test
biondizzle
2026-05-28 08:41:30 +00:00
-
82bc2c4a49
debug: verify SMEM loads + scalar QK sanity check
biondizzle
2026-05-28 08:40:16 +00:00
-
53139d24bf
debug: verify TMEM r/w works before MMA
biondizzle
2026-05-28 08:39:12 +00:00
-
a9d71ff6ab
debug: print TMEM values after MMA
biondizzle
2026-05-28 08:38:08 +00:00
-
bfb1e177ce
debug: try all-lane MMA + print tmem_base
biondizzle
2026-05-28 08:37:02 +00:00
-
d3510980e4
feat: SWIZZLE_NONE UMMA descriptors with row-major SMEM
biondizzle
2026-05-28 08:35:30 +00:00
-
8c67c31497
add CuTe descriptor printing script
biondizzle
2026-05-28 08:23:34 +00:00
-
d29d6b575f
add UMMA descriptor diagnostic script
biondizzle
2026-05-28 08:20:56 +00:00
-
ab84ad0f86
feat: implement canonical UMMA SMEM layout with SWIZZLE_128B
biondizzle
2026-05-28 08:18:47 +00:00
-
ecbc75255c
fix: correct UMMA descriptor format from CUTLASS source
biondizzle
2026-05-28 08:07:52 +00:00
-
fe7d561143
debug: print UMMA descriptor values for diagnosis
biondizzle
2026-05-28 08:03:53 +00:00
-
c5f7a9a15c
fix: align SMEM buffers to 16 bytes for UMMA descriptors
biondizzle
2026-05-28 08:02:53 +00:00
-
3549a2388b
fix: constexpr HD for template param
biondizzle
2026-05-28 08:01:18 +00:00
-
7436315309
feat: add tcgen05.mma QK GEMM verification kernel + test
biondizzle
2026-05-28 08:00:42 +00:00
-
6fb3d54c02
docs: update here-docs with CuTeDSL rationale for NVIDIA
biondizzle
2026-05-28 07:54:01 +00:00
-
9524b674ab
test: enable both reference + TMEM epilogue tests at hd=64/128
biondizzle
2026-05-28 07:49:48 +00:00
-
446a0ca9fd
refactor(tmem): clean rewrite of TMEM epilogue kernel
biondizzle
2026-05-28 07:49:03 +00:00
-
c989dc78d9
debug: print sPvBuf[32] value
biondizzle
2026-05-28 07:47:37 +00:00
-
146e4f0282
debug: print NaN positions in test
biondizzle
2026-05-28 07:46:57 +00:00
-
b50f6a8512
debug: add TMEM read diagnostic
biondizzle
2026-05-28 07:46:15 +00:00
-
a12607b0bd
test: add NaN counter to FMHA test
biondizzle
2026-05-28 07:45:32 +00:00
-
53c676c8a6
test: add max_abs_diff to FMHA test output
biondizzle
2026-05-28 07:44:45 +00:00
-
579dd061cd
fix: remove duplicate TMEM_COLS_NEEDED declarations
biondizzle
2026-05-28 07:43:54 +00:00
-
278f1b34af
fix(tmem): correct lane-to-position mapping for tcgen05.ld/st
biondizzle
2026-05-28 07:43:40 +00:00
-
593bc25afa
test: add TMEM lane mapping diagnostics
biondizzle
2026-05-28 07:42:16 +00:00
-
33cedbee0a
fix(tmem): TMEM ld/st are warp-collective — ALL 32 lanes must call them
biondizzle
2026-05-28 07:41:16 +00:00
-
cea02fe407
fix: add cstdio for printf in TMEM debug
biondizzle
2026-05-28 07:40:04 +00:00
-
0ddcc6bafd
debug: add printf to TMEM kernel to find hang point
biondizzle
2026-05-28 07:39:53 +00:00
-
44fb04fa1f
test: disable tmem epilogue test (debugging reference hang)
biondizzle
2026-05-28 07:38:47 +00:00
-
224d7e24c6
harness: add fire_b200_cuda_test + check_b200_cuda, update README
biondizzle
2026-05-28 07:36:10 +00:00
-
cec505ce14
add CUDA test runner script (screen-based, follows harness pattern)
biondizzle
2026-05-28 07:31:41 +00:00
-
2eb44a00bf
fix(tmem): warp-collective TMEM ops + one-way correction epilogue
biondizzle
2026-05-28 07:27:25 +00:00
-
bd16e8fa85
fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence
biondizzle
2026-05-28 07:12:26 +00:00
-
ba1e81f2dc
test: minimal TMEM isolation test (alloc, store, load, dealloc)
biondizzle
2026-05-28 07:09:06 +00:00
-
4fe9bbab48
add back in the archived code
biondizzle
2026-05-28 07:04:59 +00:00
-
4336de9372
attention/: Clean up folder, archive backups, add detailed status headers
biondizzle
2026-05-28 07:01:33 +00:00
-
d46ae8b967
test: disable TMEM test (hanging), verify reference still works
biondizzle
2026-05-28 06:46:27 +00:00
-
e58980f80e
fix: increase test timeout for TMEM kernel
biondizzle
2026-05-28 06:41:59 +00:00
-
a391615f60
fix: uint64_t for SMEM pointer
biondizzle
2026-05-28 06:39:19 +00:00
-
b4779e3f48
fix: cvta.to.shared.u64 for 64-bit SMEM pointers
biondizzle
2026-05-28 06:37:52 +00:00
-
cf264bd0e2
fix: cvta.shared.u32 (not cvta.to.shared)
biondizzle
2026-05-28 06:36:50 +00:00
-
771799e112
FMHA SM100: Fix TMEM operations — uint32_t registers, correct PTX syntax
biondizzle
2026-05-28 06:35:50 +00:00
-
73d1e38129
fix: last HD→HD_val
biondizzle
2026-05-28 06:32:55 +00:00
-
e940786fd5
fix: HD_val variable name in test
biondizzle
2026-05-28 06:32:01 +00:00
-
e173295a3a
FMHA SM100: Refactor into common + reference + TMEM epilogue headers
biondizzle
2026-05-28 06:31:05 +00:00
-
a73fb689f9
fix: dispatch template HD at compile time
biondizzle
2026-05-28 06:29:10 +00:00
-
bcc5d0b6cb
FMHA SM100: Add TMEM+correction epilogue kernel (Priority 2)
biondizzle
2026-05-28 06:27:56 +00:00
-
8eb735618f
fix: use expf for softmax (not exp2f with scale)
biondizzle
2026-05-28 05:34:03 +00:00
-
3cb339129b
FMHA SM100: Fix Phase 1 — single-thread reference for correctness
biondizzle
2026-05-28 05:32:47 +00:00
-
7fb838913f
fix: include path for standalone test
biondizzle
2026-05-28 05:31:39 +00:00
-
99b35eb2de
test: standalone CUDA test for FMHA SM100 (no PyTorch needed)
biondizzle
2026-05-28 05:31:03 +00:00
-
77fa34a9a6
fix: update launch wrapper for fmha_decode_ref
biondizzle
2026-05-28 05:28:49 +00:00
-
00ac46c9d3
FMHA SM100: Phase 1 — reference scalar implementation
biondizzle
2026-05-28 05:27:36 +00:00
-
6f7449ce71
FMHA SM100: Fix tcgen05.mma PTX syntax — correct register constraints
biondizzle
2026-05-28 05:25:59 +00:00
-
a11a245307
fix: use unsigned short for BF16 storage, inline PTX for conversions
biondizzle
2026-05-28 05:24:32 +00:00
-
2d4e2c57e0
auto: pre-test commit
biondizzle
2026-05-28 05:22:23 +00:00
-
97df02ea07
fix: -Xcompiler -fPIC for nvcc shared library
biondizzle
2026-05-28 05:22:15 +00:00
-
4dfb71bc20
test: nvcc direct compilation test (avoid torch JIT __bf16 ICE)
biondizzle
2026-05-28 05:21:41 +00:00
-
373900fa08
FMHA SM100: Fix launch wrapper to match new kernel API
biondizzle
2026-05-28 05:20:31 +00:00
-
a30ebfb197
FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop
biondizzle
2026-05-28 05:19:34 +00:00
-
09dfd4a41f
fix: rename .cpp to .cu for CUDA compilation
biondizzle
2026-05-28 05:16:41 +00:00
-
4c194b7254
fix: add CUDA include path for host compiler
biondizzle
2026-05-28 05:15:48 +00:00
-
48baea7728
FMHA SM100: Remove CUTLASS includes, write raw PTX inline asm
biondizzle
2026-05-28 05:15:07 +00:00
-
88d5995ec9
fix: define bf16_t using __bf16 built-in, avoid cuda_bf16.h bug
biondizzle
2026-05-28 05:14:01 +00:00
-
f0660d0bd7
fix: use C++20 for cuda_bf16.h compat
biondizzle
2026-05-28 05:13:18 +00:00
-
6bd3356582
fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr
biondizzle
2026-05-28 05:13:01 +00:00
-
c1266b5275
fix: include cuda_bf16.h only in device code
biondizzle
2026-05-28 05:12:30 +00:00
-
a64e55665b
fix: avoid cuda_bf16.h, use inline PTX for BF16 conversion
biondizzle
2026-05-28 05:12:08 +00:00
-
1734d13f60
fix: restore cuda_bf16.h include
biondizzle
2026-05-28 05:11:39 +00:00
-
8783a25deb
fix: guard cuda_bf16.h with __CUDA_ARCH__
biondizzle
2026-05-28 05:11:11 +00:00