biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:12:27 +00:00
bd16e8fa85 fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:09:08 +00:00
ba1e81f2dc test: minimal TMEM isolation test (alloc, store, load, dealloc)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:05:10 +00:00
4fe9bbab48 add back in the archived code
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:01:38 +00:00
4336de9372 attention/: Clean up folder, archive backups, add detailed status headers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:46:29 +00:00
d46ae8b967 test: disable TMEM test (hanging), verify reference still works
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:42:02 +00:00
e58980f80e fix: increase test timeout for TMEM kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:39:21 +00:00
a391615f60 fix: uint64_t for SMEM pointer
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:37:56 +00:00
b4779e3f48 fix: cvta.to.shared.u64 for 64-bit SMEM pointers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:36:50 +00:00
cf264bd0e2 fix: cvta.shared.u32 (not cvta.to.shared)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:35:51 +00:00
771799e112 FMHA SM100: Fix TMEM operations — uint32_t registers, correct PTX syntax
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:32:56 +00:00
73d1e38129 fix: last HD→HD_val
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:32:05 +00:00
e940786fd5 fix: HD_val variable name in test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:31:06 +00:00
e173295a3a FMHA SM100: Refactor into common + reference + TMEM epilogue headers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:29:13 +00:00
a73fb689f9 fix: dispatch template HD at compile time
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 06:27:57 +00:00
bcc5d0b6cb FMHA SM100: Add TMEM+correction epilogue kernel (Priority 2)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 05:34:05 +00:00
8eb735618f fix: use expf for softmax (not exp2f with scale)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 05:32:50 +00:00
3cb339129b FMHA SM100: Fix Phase 1 — single-thread reference for correctness
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 05:31:40 +00:00
7fb838913f fix: include path for standalone test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 05:31:05 +00:00
99b35eb2de test: standalone CUDA test for FMHA SM100 (no PyTorch needed)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 05:28:57 +00:00
77fa34a9a6 fix: update launch wrapper for fmha_decode_ref