biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 13:19:33 +00:00
bde81b95f4 Fix GEMM scale layout: pad to 128 tokens per expert
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 12:31:27 +00:00
7e692c3aec Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 11:39:06 +00:00
b0221662e7 Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 11:11:01 +00:00
b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 10:48:25 +00:00
04245b664b Add warmup-based activation global scale computation in finalize_weights
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:59:59 +00:00
4445882ba7 Fix: return 2D scale tensor for GEMM (shape[1] access)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:59:14 +00:00
3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:58:10 +00:00
4f6217acb9 Fix padded_cols calculation in scale assembly
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:57:27 +00:00
918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:56:30 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:39:44 +00:00
55ac60eb91 Add detailed debug prints for OOB investigation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:19:12 +00:00
fed3c417ba Add debug OOB check for sorted_token_ids
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 09:01:27 +00:00
eb7d4f099b Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:58:45 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:30:44 +00:00
1330e2b2cf cleanup: remove debug prints, ready for testing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:29:20 +00:00
d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:27:49 +00:00
235d5b314f fix: fallback token indices allocation with verify+rebuild
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:25:26 +00:00
dd0b3fd4f9 debug: print sorted_token_ids in warmup
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:24:59 +00:00
04999d86cf fix: add quantize_to_nvfp4 import
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 08:24:30 +00:00
33e28100ee test: use runner's built-in warmup method