biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:26:21 +00:00
10915c4e70 fix: remove double normalization in fmha_6warp_multihead epilogue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:25:22 +00:00
cfac224b59 debug: single head sanity test with known values
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:24:37 +00:00
1c74d35fb4 debug: V layout reference comparison
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:23:44 +00:00
a3c5f817e1 debug: compare api vs direct kernel vs reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:22:51 +00:00
78e6d58b85 debug: V layout comparison test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:21:55 +00:00
074c4c4f42 P3: call fmha_multihead_decode_raw directly (skip custom op)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:20:54 +00:00
1b9cdf89fb P3: add full API integration test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:20:03 +00:00
0608d9d09e P3: fix GQA via K/V repeat_interleave, relax threshold to 0.999990
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:19:18 +00:00
d5c0086737 P3: fix SMEM computation, pad K/V to 128, remove stale files
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:18:21 +00:00
094b3c9e6c P3: fix test — create V in kernel layout (hd,N), transpose for reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:17:34 +00:00
7b5b3342fa P3: fix integration test — V transpose, direct ctypes call
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:16:50 +00:00
8a5070aa38 test: minimal ctypes debug test for P3
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:16:05 +00:00
63645a3c7b fix: -Xcompiler -fPIC instead of -fPIC for nvcc
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:15:39 +00:00
adcf3e04ab P3: ctypes loader for 6-warp FMHA (bypass torch JIT sm_100 arch issue)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 08:12:27 +00:00
1e6adf5e01 P3: wire 6-warp multi-head FMHA decode fast path into production.py
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 07:02:42 +00:00
20f3ccd992 D1.5 complete: HD=512 support via hd_chunk tiling with native TMEM columns
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 07:01:46 +00:00
f2592ea0da fix: native TMEM columns for hd_chunk (no remapping)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 07:00:48 +00:00
dcf89fdd1c debug: check full HD for chunk1 test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 07:00:17 +00:00
3dbd3c5e7f debug: test chunk 1 only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 06:59:42 +00:00
72779e7f71 debug: compare only first HD_CHUNK values