biondizzle
  • Joined on 2025-12-10
biondizzle deleted branch main from biondizzle/nvfp4-megamoe-kernel 2026-05-14 12:46:38 +00:00
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-14 12:44:49 +00:00
d3f35c9465 cleanup: remove abandoned TileLang and Mojo files
802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)
69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize
c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel
1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup
Compare 24 commits »
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 12:14:03 +00:00
802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 12:01:35 +00:00
69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 11:27:59 +00:00
c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 11:23:34 +00:00
1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:50:28 +00:00
904fc37ad8 Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:48:58 +00:00
494d30b6ab Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:37:02 +00:00
869151d211 Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:23:04 +00:00
84becfac93 Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:21:11 +00:00
a272bc49b0 Fix: torch::kBFloat16
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:20:01 +00:00
3f62e49e6e Fix PyTorch API: use c10::cuda and at::kBF16
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:18:38 +00:00
2ee4e26772 Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 10:05:43 +00:00
540e68593f Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 09:50:55 +00:00
2998c889e7 Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 09:40:18 +00:00
98913c9b1a Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-14 09:26:04 +00:00
25cbc85afe Replace kernel.py with thin wrapper around pre-compiled _C extension
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-13 23:28:04 +00:00
33e5d67326 Add CUTLASS_CHECK macro
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-13 23:27:16 +00:00
b7c5cba407 Fix device_memory include path
biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel 2026-05-13 23:26:22 +00:00
3299d22ad6 Fix type casts and includes for CUTLASS NVFP4 GEMM