biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 13:19:33 +00:00

bde81b95f4 Fix GEMM scale layout: pad to 128 tokens per expert

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 12:31:27 +00:00

7e692c3aec Fix cudaErrorStreamCaptureUnsupported: pre-allocate all tensors used during capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 11:39:06 +00:00

b0221662e7 Fix warmup: pass local expert IDs (not global), remove incorrect _warmup_done guard

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 11:11:01 +00:00

b531a98f8f Fix scale assembly: per-expert 128-row fixed slots, no dynamic sizing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 10:48:25 +00:00

04245b664b Add warmup-based activation global scale computation in finalize_weights

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:59:59 +00:00

4445882ba7 Fix: return 2D scale tensor for GEMM (shape[1] access)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:59:14 +00:00

3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:58:10 +00:00

4f6217acb9 Fix padded_cols calculation in scale assembly

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:57:27 +00:00

918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:56:30 +00:00

d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:39:44 +00:00

55ac60eb91 Add detailed debug prints for OOB investigation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:19:12 +00:00

fed3c417ba Add debug OOB check for sorted_token_ids

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 09:01:27 +00:00

eb7d4f099b Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:58:45 +00:00

ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:30:44 +00:00

1330e2b2cf cleanup: remove debug prints, ready for testing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:29:20 +00:00

d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:27:49 +00:00

235d5b314f fix: fallback token indices allocation with verify+rebuild

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:25:26 +00:00

dd0b3fd4f9 debug: print sorted_token_ids in warmup

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:24:59 +00:00

04999d86cf fix: add quantize_to_nvfp4 import

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 08:24:30 +00:00

33e28100ee test: use runner's built-in warmup method