nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	4f698baa5d	Production fused CUDA sampler + decode loop optimizations - Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty + top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs - Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference - Update single_shot_inference.py: - Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95) - Pre-allocate decode buffers (no per-step torch.tensor allocation) - Track thinking tokens (128821/128822) — not garbage for reasoning model - Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20) - Add --top-k and --top-p CLI args - Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)	2026-06-01 20:29:57 +00:00
biondizzle	7d9e70c5d5	Fix remaining mHC API references: layer_compare.py, layer.py comment	2026-05-31 18:38:34 +00:00
biondizzle	d3b772196d	E3: Implement DSV4Model — full model class - Token embedding → N×TransformerLayer → RMSNorm → lm_head - decode_step: single token decode with mHC state management - forward: prefill path (T tokens) - Cache handle acquisition per layer - mHC state initialization from embedding - Weight loading TODO (deferred to loader/)	2026-05-30 21:15:57 +00:00
biondizzle	4453d7475a	Fix layer construction: match existing API signatures, add RMSNorm impl - Nvfp4GroupedLinear: (n_local_groups, heads_per_group, head_dim, o_lora_rank) - mHCLayer: hidden_dim, t_max_sinkhorn (not hidden_size, sinkhorn_iters) - RMSNorm: PyTorch reference implementation (BF16, cudagraph-safe) - Verified: all 43 Flash + 61 Pro layers construct cleanly - All projection shapes validated against architecture spec	2026-05-21 23:31:58 +00:00
biondizzle	66a89859ed	Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer DSV4Config: frozen dataclass with .flash() / .pro() classmethods. All architectural constants (dims, heads, MoE params, mHC) in one place. LayerSchedule: pure-data per-layer-index -> (attn_type, ffn_type, router_mode). Flash: SWA, SWA, CSA, HCA, CSA, HCA, ... (43 layers) Pro: HCA, HCA, CSA, HCA, CSA, HCA, ... (61 layers) Both: first 3 MoE layers = hash routing, rest = dense validate_schedule() enforces correctness at construction. AttentionSubBlock: CSA / HCA / SWA variants. - Low-rank Q projection (q_down -> q_up) - KV down-projection (varies by attn type: 4h/2h/1h) - CSA: indexer_q_up + indexer_head_weights - Grouped output projection (wo_a + wo_b) - Kernel calls are imports (NotImplementedError until kernel lands) - No PyTorch fallback paths FFNSubBlock: MoE + shared expert. - Router (hash/dense) mode from LayerSpec - Nvfp4MoE + Nvfp4SharedExpert TransformerLayer: composition of mHC + norm + attention + FFN. - Two mHC wrappers (attn + ffn sub-blocks) - Two RMSNorm (one per sub-block) - Pure orchestration, no learned params on the layer itself Tests: schedule construction + validation for both variants. No forward tests yet (depends on FMHA kernel + KV cache).	2026-05-21 23:11:09 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

8 Commits