biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:26:28 +00:00
b1778eedf8 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:16:37 +00:00
842bb42ed1 wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:12:25 +00:00
77cc28cc92 fix: cutlass.Float32 not cutlass.float32_t in fused epilogue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:10:58 +00:00
ed89e678be wip: add run_fused_swiglu_grouped_gemm bridge + step1 test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:07:09 +00:00
2fcd5f1902 wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 03:04:40 +00:00
9cdf79fd9c wip: fused SwiGLU kernel scaffold + bridge interleave + plan
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:17:45 +00:00
2f8b26c176 chore: remove unused _expert_id_range after bincount migration
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:17:24 +00:00
7e2adb7e85 perf: replace expert counting O(n*E) comparison with torch.bincount O(n)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:16:51 +00:00
d59b10e170 fix: zero out x_norm for underflow blocks before division in NVFP4 quantization
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:14:59 +00:00
c8fa87fac7 fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:11:43 +00:00
3c6b5a0522 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:08:28 +00:00
3181f74c86 fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16))
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 02:08:09 +00:00
cc6b094450 fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 01:36:31 +00:00
039a9e27d6 fix: handle 3D swa_indices and correct kv_bf16 expand dims
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 01:28:07 +00:00
b3f6f260ce feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 00:02:18 +00:00
268dc251c1 fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-20 00:00:02 +00:00
09669dded4 fix: dynamic buffer sizing in nvfp4_linear for varying token counts
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-19 23:19:01 +00:00
02b9c1ac20 nuke vllm because this keep confusing people
02b57071be Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean
7070fadf72 Add full layer NaN test (attention + MoE, multi-layer chain)
152b0749df Use 16 experts for MoE runner test (fits in memory)
daa59a7c75 Add MoE runner NaN test (grouped GEMM with real weights)
Compare 166 commits »
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 23:04:39 +00:00
02b9c1ac20 nuke vllm because this keep confusing people
biondizzle created repository biondizzle/dsv4-nvfp4-workspace 2026-05-19 22:43:52 +00:00