biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 17:24:27 +00:00

3f2f4e1882 Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:59:52 +00:00

11b5aa5e37 Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:56:53 +00:00

94dec5922d Scale assembly Phase 2: use CPU-computed offsets for Python slicing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:55:48 +00:00

49c28e6562 Fix: use real padded expert offsets instead of fixed layout

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:52:43 +00:00

87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:25:41 +00:00

c03438fc4e crap shoot

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:06:59 +00:00

7c16f3cb46 Fix: init shared dict before using it, remove duplicate _output_buf

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 16:05:56 +00:00

ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 15:52:08 +00:00

3d0b1408b4 Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 15:47:39 +00:00

455ecb5631 Fix: define padded_max_slots before using it in shared buffer allocation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 15:47:12 +00:00

b1ac74bb4d Fix shape mismatch: shared padded buffers, revert max_num_tokens cap

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 15:46:15 +00:00

e2f33596a2 Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 15:18:09 +00:00

faf7c8cc51 Debug: print runner max_num_tokens and max_chunks

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 14:59:46 +00:00

c5af1aba6b Fix OOB: size padded buffers for num_experts*max_chunks*128

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 14:14:16 +00:00

8ac8e20fa9 Fix OOM: cap buffer pre-allocation at cudagraph max capture size

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 14:02:12 +00:00

5bb78564f5 Remove dynamic tensor allocation in scale assembly (cudagraph fix)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 13:59:01 +00:00

8c31e78359 Fix cudagraph: fully fixed-layout per-expert sections, no GPU scalars in Python control flow

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 13:56:54 +00:00

ff74b33d2c Fix cudagraph: static loop for per-expert scale swizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 13:55:12 +00:00

bf22b6f0e4 Fix scale assembly: variable-size per-expert padding matching GEMM offsets

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-17 13:32:43 +00:00

0d3c928ff2 Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs