biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:26:28 +00:00

b1778eedf8 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:16:37 +00:00

842bb42ed1 wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:12:25 +00:00

77cc28cc92 fix: cutlass.Float32 not cutlass.float32_t in fused epilogue

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:10:58 +00:00

ed89e678be wip: add run_fused_swiglu_grouped_gemm bridge + step1 test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:07:09 +00:00

2fcd5f1902 wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 03:04:40 +00:00

9cdf79fd9c wip: fused SwiGLU kernel scaffold + bridge interleave + plan

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:17:45 +00:00

2f8b26c176 chore: remove unused _expert_id_range after bincount migration

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:17:24 +00:00

7e2adb7e85 perf: replace expert counting O(n*E) comparison with torch.bincount O(n)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:16:51 +00:00

d59b10e170 fix: zero out x_norm for underflow blocks before division in NVFP4 quantization

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:14:59 +00:00

c8fa87fac7 fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:11:43 +00:00

3c6b5a0522 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:08:28 +00:00

3181f74c86 fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16))

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 02:08:09 +00:00

cc6b094450 fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 01:36:31 +00:00

039a9e27d6 fix: handle 3D swa_indices and correct kv_bf16 expand dims

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 01:28:07 +00:00

b3f6f260ce feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 00:02:18 +00:00

268dc251c1 fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-20 00:00:02 +00:00

09669dded4 fix: dynamic buffer sizing in nvfp4_linear for varying token counts

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-19 23:19:01 +00:00

02b9c1ac20 nuke vllm because this keep confusing people

02b57071be Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean

7070fadf72 Add full layer NaN test (attention + MoE, multi-layer chain)

152b0749df Use 16 experts for MoE runner test (fits in memory)

daa59a7c75 Add MoE runner NaN test (grouped GEMM with real weights)

Compare 166 commits »

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 23:04:39 +00:00

02b9c1ac20 nuke vllm because this keep confusing people

biondizzle created repository biondizzle/dsv4-nvfp4-workspace

2026-05-19 22:43:52 +00:00