nvfp4-megamoe-kernel

Files

biondizzle baf44c92f8 fix: memory-efficient E2M1 quantization — no 32x distance tensor

quantize_to_nvfp4 was allocating a (..., n_blocks, block_size, 8)
float32 tensor for nearest-neighbor distances to all 8 E2M1 values.
That's 32x the input size — 10.5GB for a typical batch, causing OOM
with only 3GB free.

New approach: clamp to [0, 6], scale to half-integer steps, round,
then map through a 13-byte lookup table to E2M1 indices.
Peak memory is now ~2x input (x_f32 + x_scaled) instead of 32x.

This makes activation quantization CUDA-graph-safe for the
memory-constrained DeepSeek-V4 on B200 (175GB model / 178GB GPU).

2026-05-16 07:49:38 +00:00

kernel

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

__init__.py

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

bridge.py

fix: memory-efficient E2M1 quantization — no 32x distance tensor

2026-05-16 07:49:38 +00:00

moe_pipeline.py

fix: same gate/up split fix in moe_pipeline.py

2026-05-16 04:04:53 +00:00