biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:32:09 +00:00
c82c1ddc1b test: add multiple seeds to verify softmax consistency
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:31:10 +00:00
a24b3e75a2 fix: use plain range loop for row_max (fmax not allowed in vectorized)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:30:37 +00:00
c96454d70b fix: add missing old_row_max = row_max before softmax max computation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:29:46 +00:00
aa9c2d2308 fix vectorize issue: remove vectorize from exp2 pass, add row_sum accumulation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:28:35 +00:00
f631ff16d6 fix: use cute.arch.fmax instead of if-else in vectorized loop
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:27:37 +00:00
941bcae8e1 softmax: element-wise row_max computation instead of .reduce()
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:24:00 +00:00
5e51b726ba fix O normalization: use direct rmem tensor from partition_D shape
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:22:57 +00:00
0da960d8da FMHA Stage-C: real softmax + O normalization in 6-warp layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:19:18 +00:00
6ebccf1e7e fix: use make_smem_layout_epi not make_epilogue_smem_layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 09:18:57 +00:00
208af3eadd FMHA v3 Stage-C full: 12-warp pipeline with real softmax + correction + epilogue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 08:57:44 +00:00
b81ed1924b more stuff
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 07:29:05 +00:00
7e1ba2b525 FMHA v3: per-row min test + explicit loop replacements
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 07:09:56 +00:00
791bdc53a0 FMHA v3: per-row patch from Mike + deadlock fix + V layout fix
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 05:52:15 +00:00
4761931c3e FMHA v3: add debug variants for C9 normalization investigation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 01:45:08 +00:00
201f11a339 Fix indexer score kernel: use static shared memory, correct FP4 head offsets
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 01:20:42 +00:00
6e06aed46c Indexer: score+topk kernel, gather KV, compute_valid_lens
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 00:25:51 +00:00
8fcbc699a8 Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 00:08:39 +00:00
23abfe9845 KV Cache: schema, allocator, pools, manager, append_swa kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 23:32:06 +00:00
44582ec43b Fix layer construction: match existing API signatures, add RMSNorm impl
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 23:26:52 +00:00
39c1592d9c Clean up: remove debug/temp files and dangling test kernels