f3b551956d
Cleanup Step 2: Archive Lineage P code, fix broken imports
...
- Move dead dsv4/ modules to dsv4/_archive/ (52 files)
- model/{dsv4,mtp,layer,layer_schedule}
- layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live)
- cache/*, kernels/cache/*, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens}
- kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill}
- ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert}
- reference/{attention,compressor,csa_attention,moe_pipeline}
- kernels/compressor/{compress_tail,csa_hca}
- Restore dsv4/ops/{router,custom_ops}.py (needed by live layers)
- Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports)
- Remove preload_all() from loader.py (dead, referenced nonexistent .cu file)
- Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer)
- Move broken tests to tests/e2e_archive/
- test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca}
- vLLM has 0 imports of dsv4 (Step 0 confirmed)
2026-06-02 19:27:07 +00:00
c8faf20a99
P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
...
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
+ deinterleave_quantize_nvfp4_cuda (had .item() sync).
All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.
Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache
P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.
P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.
~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).
2026-06-01 21:05:03 +00:00
4f698baa5d
Production fused CUDA sampler + decode loop optimizations
...
- Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty
+ top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs
- Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference
- Update single_shot_inference.py:
- Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95)
- Pre-allocate decode buffers (no per-step torch.tensor allocation)
- Track thinking tokens (128821/128822) — not garbage for reasoning model
- Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20)
- Add --top-k and --top-p CLI args
- Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)
2026-06-01 20:29:57 +00:00
7d9e70c5d5
Fix remaining mHC API references: layer_compare.py, layer.py comment
2026-05-31 18:38:34 +00:00
d3b772196d
E3: Implement DSV4Model — full model class
...
- Token embedding → N×TransformerLayer → RMSNorm → lm_head
- decode_step: single token decode with mHC state management
- forward: prefill path (T tokens)
- Cache handle acquisition per layer
- mHC state initialization from embedding
- Weight loading TODO (deferred to loader/)
2026-05-30 21:15:57 +00:00
4453d7475a
Fix layer construction: match existing API signatures, add RMSNorm impl
...
- Nvfp4GroupedLinear: (n_local_groups, heads_per_group, head_dim, o_lora_rank)
- mHCLayer: hidden_dim, t_max_sinkhorn (not hidden_size, sinkhorn_iters)
- RMSNorm: PyTorch reference implementation (BF16, cudagraph-safe)
- Verified: all 43 Flash + 61 Pro layers construct cleanly
- All projection shapes validated against architecture spec
2026-05-21 23:31:58 +00:00
66a89859ed
Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer
...
DSV4Config: frozen dataclass with .flash() / .pro() classmethods.
All architectural constants (dims, heads, MoE params, mHC) in one place.
LayerSchedule: pure-data per-layer-index -> (attn_type, ffn_type, router_mode).
Flash: SWA, SWA, CSA, HCA, CSA, HCA, ... (43 layers)
Pro: HCA, HCA, CSA, HCA, CSA, HCA, ... (61 layers)
Both: first 3 MoE layers = hash routing, rest = dense
validate_schedule() enforces correctness at construction.
AttentionSubBlock: CSA / HCA / SWA variants.
- Low-rank Q projection (q_down -> q_up)
- KV down-projection (varies by attn type: 4h/2h/1h)
- CSA: indexer_q_up + indexer_head_weights
- Grouped output projection (wo_a + wo_b)
- Kernel calls are imports (NotImplementedError until kernel lands)
- No PyTorch fallback paths
FFNSubBlock: MoE + shared expert.
- Router (hash/dense) mode from LayerSpec
- Nvfp4MoE + Nvfp4SharedExpert
TransformerLayer: composition of mHC + norm + attention + FFN.
- Two mHC wrappers (attn + ffn sub-blocks)
- Two RMSNorm (one per sub-block)
- Pure orchestration, no learned params on the layer itself
Tests: schedule construction + validation for both variants.
No forward tests yet (depends on FMHA kernel + KV cache).
2026-05-21 23:11:09 +00:00
3fb3c925af
Restructure: cutedsl/ -> dsv4/ with proper layering
...
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00