biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 17:24:27 +00:00
3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:59:52 +00:00
11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:56:53 +00:00
94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:55:48 +00:00
49c28e6562 Fix: use real padded expert offsets instead of fixed layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:52:43 +00:00
87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:25:41 +00:00
c03438fc4e crap shoot
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:06:59 +00:00
7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 16:05:56 +00:00
ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 15:52:08 +00:00
3d0b1408b4 Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 15:47:39 +00:00
455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 15:47:12 +00:00
b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 15:46:15 +00:00
e2f33596a2 Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 15:18:09 +00:00
faf7c8cc51 Debug: print runner max_num_tokens and max_chunks
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 14:59:46 +00:00
c5af1aba6b Fix OOB: size padded buffers for num_experts*max_chunks*128
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 14:14:16 +00:00
8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 14:02:12 +00:00
5bb78564f5 Remove dynamic tensor allocation in scale assembly (cudagraph fix)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 13:59:01 +00:00
8c31e78359 Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 13:56:54 +00:00
ff74b33d2c Fix cudagraph: static loop for per-expert scale swizzle
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 13:55:12 +00:00
bf22b6f0e4 Fix scale assembly: variable-size per-expert padding matching GEMM offsets
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-17 13:32:43 +00:00
0d3c928ff2 Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs