biondizzle

biondizzle deleted branch main from biondizzle/nvfp4-megamoe-kernel

2026-05-14 12:46:38 +00:00

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-14 12:44:49 +00:00

d3f35c9465 cleanup: remove abandoned TileLang and Mojo files

802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)

69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize

c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel

1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup

Compare 24 commits »

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 12:14:03 +00:00

802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 12:01:35 +00:00

69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 11:27:59 +00:00

c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 11:23:34 +00:00

1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:50:28 +00:00

904fc37ad8 Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:48:58 +00:00

494d30b6ab Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:37:02 +00:00

869151d211 Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:23:04 +00:00

84becfac93 Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:21:11 +00:00

a272bc49b0 Fix: torch::kBFloat16

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:20:01 +00:00

3f62e49e6e Fix PyTorch API: use c10::cuda and at::kBF16

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:18:38 +00:00

2ee4e26772 Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 10:05:43 +00:00

540e68593f Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 09:50:55 +00:00

2998c889e7 Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 09:40:18 +00:00

98913c9b1a Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-14 09:26:04 +00:00

25cbc85afe Replace kernel.py with thin wrapper around pre-compiled _C extension

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-13 23:28:04 +00:00

33e5d67326 Add CUTLASS_CHECK macro

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-13 23:27:16 +00:00

b7c5cba407 Fix device_memory include path

biondizzle pushed to main at biondizzle/nvfp4-megamoe-kernel

2026-05-13 23:26:22 +00:00

3299d22ad6 Fix type casts and includes for CUTLASS NVFP4 GEMM