nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	0d1cd1e216	P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized - QuantizedActivation: carries (x_fp4, x_sf, gsa) for skip-quantize path - Nvfp4Linear.run_from_quantized(): runs GEMM with pre-quantized input - Enables fused RMSNorm+quantize to feed directly into all downstream linears (q_a, kv, o_proj, etc.) without re-quantizing	2026-06-02 16:37:38 +00:00
biondizzle	6cb5078821	Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback Root cause: float row_max[n] is a VLA — not allowed in CUDA device code. Fix: use shared memory with MHC_MAX_N=16 fixed-size slots. Also: REMOVED the Python fallback in sinkhorn_knopp(). If the CUDA kernel fails, the pipeline DIES. No soft landing. This is the correct behavior — silent fallback to broken precision is worse than a loud crash. The residual growth \|X\|→500-700 at L60 was likely caused by the Python fallback running a DIFFERENT numerical path (BF16 accumulation in torch ops vs FP32 in the CUDA kernel). With the fixed kernel, Sinkhorn should produce properly doubly-stochastic B_l, bounding the residual.	2026-06-02 10:44:53 +00:00
biondizzle	f01d3f3eac	wip: SE fused SwiGLU deinterleave fix	2026-06-02 08:41:00 +00:00
biondizzle	1726cb64a9	fix: interleave_l1_weights granularity_bf16 (not granularity) in SE	2026-06-02 08:29:03 +00:00
biondizzle	7904cf05c4	Add set_fused_swiglu() method to Nvfp4MoE	2026-06-02 07:59:57 +00:00
biondizzle	d8e17d70c1	P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True)) - Saves 240+ unfused BF16 kernel launches per token - SiLU + clamp in kernel registers instead of separate launches P1: Fix shared expert _run_l1_fused + enable fused SwiGLU - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb - Fixed: expert_offsets dtype int64 -> int32 - Added proper padded buffer + scale assembly (matching unfused path) - Added runtime gsa support (quantize_nvfp4_gpu_fused) P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token) - _gsa_buf now initialized with _activation_global_scale (not zeros) - After warmup_gsa, buffer already has correct value — no fill needed	2026-06-02 07:57:39 +00:00
biondizzle	61d5e7ba53	revert: P2 gsa fill elimination — revert to proven path for e2e stability The fill_() is a CPU→GPU scalar write (tiny cost). The optimization was marginal and the output quality regression (CJK tokens) needs investigation separately. P2 can re-land after the regression is confirmed to be sampling-related (not gsa-related). P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed.	2026-06-02 07:32:10 +00:00
biondizzle	790f8c350a	perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers 'TypeError: too many positional arguments' during cute.compile(). The kernel signature doesn't match the positional args being passed. This is a kernel-side fix, not a single_shot fix. Disabled until the fused kernel is debugged. P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup. SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved weight path) is wired but disabled. Will activate once kernel fix lands.	2026-06-02 07:16:08 +00:00
biondizzle	040b2eb6e7	perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True). Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split, SiLU, clamp, elementwise multiply → single fused kernel launch). P1: Enable fused SwiGLU for shared expert (SE): - Added set_fused_swiglu() method to Nvfp4SharedExpert - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group) - Interleave L1 weights at finalize time for fused kernel compatibility - Fused kernel handles SwiGLU + clamp in registers, outputs BF16 P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear: - _activation_global_scale is set once at warmup, never changes after - Skip redundant fill_() via _gsa_buf_initialized flag - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers) P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)	2026-06-02 06:59:25 +00:00
biondizzle	7e3fb5f4d0	fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path	2026-06-02 04:28:29 +00:00
biondizzle	668a42e71a	debug: print mhc_sinkhorn CUDA kernel compile errors	2026-06-02 04:02:34 +00:00
biondizzle	7b82d31330	perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)	2026-06-02 03:50:57 +00:00
biondizzle	583ad6cfe6	P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs - grouped_linear.py: Replace .item() gsa + Python quantize with quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups into (G*T, D), single fused kernel launch, GPU-only gsa copy. - single_shot_inference.py: Reduce torch.cuda.synchronize() to every 20 steps instead of every step. Gate per-layer diagnostics to li<3 or li>=58 (avoid 61 .item() calls per decode step).	2026-06-01 22:21:12 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	16b72b9581	PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head 1. o_a_proj (Nvfp4GroupedLinear): Added load_nvfp4_weight() method that loads checkpoint NVFP4 weights directly — no more dequant→BF16→requant. Each group's weight is transposed from (N, K_packed) checkpoint layout to (K_packed, N) layout expected by the grouped GEMM. 2. lm_head: Quantize BF16 weight to NVFP4 at load time, use production Nvfp4Linear GEMM instead of F.linear. Runtime gsa for activation. Frees the 1.8GB BF16 weight after quantization. 3. Hash router (L0-2): Already optimal — tid2eid is an int32 lookup, no GEMM to accelerate.	2026-06-01 19:41:21 +00:00
biondizzle	038fe81c68	Fix MoE non-fused L2 runtime gsa + update test harness for extra args	2026-06-01 15:03:54 +00:00
biondizzle	2b1fca6dae	CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow The checkpoint's input_scale was designed for training-time FP8 quantization, not NVFP4 activation quantization. Using it as gsa causes x/gsa to exceed the E4M3 block scale maximum (448), leading to systematic magnitude loss in every projection. This accumulates over 61 layers, compressing the logit range and producing garbage tokens. Fix: compute gsa at runtime from actual activation magnitude: gsa = max(\|x\|) / (6.0 * 448.0) This ensures x/gsa ≤ 2688 (the maximum representable in E4M3 block scales). Applied to: Nvfp4Linear, Nvfp4GroupedLinear, Nvfp4MoE, Nvfp4SharedExpert, Router gate	2026-06-01 14:21:16 +00:00
biondizzle	7a05d3d3af	NVFP4 router gate: use Nvfp4Linear for both checkpoint and quantized paths - Checkpoint path: load NVFP4 gate weight directly into Nvfp4Linear - BF16 path: quantize and load into Nvfp4Linear - Both paths use proven production GEMM (no custom kernel) - load_nvfp4_fused_gate now creates Nvfp4Linear from BF16 weight	2026-06-01 11:25:50 +00:00
biondizzle	e5dbe1ed22	Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR) The custom fused router kernel crashes the CuTeDSL MLIR optimizer even with a simplified epilogue. Switch to the proven Nvfp4Linear path which uses the same NVFP4 Blackwell tensor-core GEMM, just with 2 kernel launches (GEMM + activation_topk) instead of 1. - Router's load_nvfp4_fused_gate now stores raw tensors for future use - single_shot_inference.py creates Nvfp4Linear from quantized gate weight - _run_dense_impl prioritizes gate_lin (NVFP4) over BF16 fallback	2026-06-01 11:17:54 +00:00
biondizzle	31ebe4f2db	Wire NVFP4 fused router kernel into e2e single-shot pipeline - Add dense_router_dispatch_nvfp4_fused() in dense_router_decode.py: single-kernel NVFP4 blockscaled GEMM + fused router epilogue - Router.load_nvfp4_fused_gate(): stores raw NVFP4 tensors for fused path - Router._run_dense_impl() dispatch priority: fused > 2-kernel > BF16 - single_shot_inference.py: loads raw NVFP4 gate weights for fused kernel instead of building Nvfp4Linear (which was the 2-kernel path) - Fix selection sort bug in nvfp4_fused_router_kernel.py: pass 0 was missing t_s/t_i/t_a temp save before swap, causing undefined vars - Export dense_router_dispatch_nvfp4_fused from __init__.py	2026-06-01 09:47:48 +00:00
biondizzle	cf2b7ab7ec	feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate projection when NVFP4 scales are available in the checkpoint. This replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core NVFP4 acceleration. Changes: - dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate fallback. New load_nvfp4_gate() method. - dsv4/kernels/router/dense_router_decode.py: add dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk - dsv4/kernels/router/__init__.py: export new function - single_shot_inference.py: load NVFP4 gate weights when available, fall back to BF16 when not	2026-06-01 05:58:56 +00:00
biondizzle	8ad617e2ff	diag: NaN detection in shared expert gate/up split	2026-06-01 04:01:46 +00:00
biondizzle	a53936a17c	diag: print l1_out shape warning in shared expert	2026-06-01 03:54:29 +00:00
biondizzle	e671780008	fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).	2026-06-01 00:30:37 +00:00
biondizzle	e8a7a9256f	fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors, but checkpoint weights from safetensors are loaded as uint8. The uint8 and float4_e2m1fn_x2 have the same byte representation, so .view() is safe. Fixed in: - Nvfp4Linear.finalize_weights() - Nvfp4SharedExpert.finalize_weights() - Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)	2026-06-01 00:18:34 +00:00
biondizzle	172448514c	fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was being dropped entirely in the production pipeline. The dequant formula is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be folded into the GEMM's global_scale_b parameter. Fixes in: - Nvfp4Linear: ws2 field, folded in finalize_weights() - Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked() - Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights() - single_shot_inference.py: pass weight_scale_2 through all loading paths - Also fix missing o_a_prod key fallback in attention output	2026-06-01 00:10:50 +00:00
biondizzle	2a886fe0f2	Add --no-thinking mode to skip thinking tokens and use second-best	2026-05-31 19:24:21 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	daf84524ac	E2/E3: compressor bridge, indexer bridge, flush pipeline wiring - compress_tail.py: PyTorch reference CSA/HCA compression (token-level softmax over m/m' entries, paper eq. 11-12) - compressor/__init__.py: csa_compress_and_store, hca_compress_and_store bridges (compression deferred to flush pipeline) - indexer/__init__.py: compute_index_scores_topk bridge (NotImplemented) - Fixed attention.py: removed extra positions arg to write_swa	2026-05-30 21:16:54 +00:00
biondizzle	c2e3d15633	NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring - Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels	2026-05-25 16:19:07 +00:00
biondizzle	4453d7475a	Fix layer construction: match existing API signatures, add RMSNorm impl - Nvfp4GroupedLinear: (n_local_groups, heads_per_group, head_dim, o_lora_rank) - mHCLayer: hidden_dim, t_max_sinkhorn (not hidden_size, sinkhorn_iters) - RMSNorm: PyTorch reference implementation (BF16, cudagraph-safe) - Verified: all 43 Flash + 61 Pro layers construct cleanly - All projection shapes validated against architecture spec	2026-05-21 23:31:58 +00:00
biondizzle	66a89859ed	Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer DSV4Config: frozen dataclass with .flash() / .pro() classmethods. All architectural constants (dims, heads, MoE params, mHC) in one place. LayerSchedule: pure-data per-layer-index -> (attn_type, ffn_type, router_mode). Flash: SWA, SWA, CSA, HCA, CSA, HCA, ... (43 layers) Pro: HCA, HCA, CSA, HCA, CSA, HCA, ... (61 layers) Both: first 3 MoE layers = hash routing, rest = dense validate_schedule() enforces correctness at construction. AttentionSubBlock: CSA / HCA / SWA variants. - Low-rank Q projection (q_down -> q_up) - KV down-projection (varies by attn type: 4h/2h/1h) - CSA: indexer_q_up + indexer_head_weights - Grouped output projection (wo_a + wo_b) - Kernel calls are imports (NotImplementedError until kernel lands) - No PyTorch fallback paths FFNSubBlock: MoE + shared expert. - Router (hash/dense) mode from LayerSpec - Nvfp4MoE + Nvfp4SharedExpert TransformerLayer: composition of mHC + norm + attention + FFN. - Two mHC wrappers (attn + ffn sub-blocks) - Two RMSNorm (one per sub-block) - Pure orchestration, no learned params on the layer itself Tests: schedule construction + validation for both variants. No forward tests yet (depends on FMHA kernel + KV cache).	2026-05-21 23:11:09 +00:00
biondizzle	abfe4485f7	Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill Step 1: Hash router (hash_router.cu) - One thread per token, gather from [vocab_size, k] LUT - Uniform 1/k weights, FP32 output - 3 MB LUT fits in L2 for repeated decode calls Step 2: topk_select.cu — general top-k primitive - Per-thread register min-heap (k=6, compile-time unrolled) - Shared memory merge: thread 0 merges 64 partial heaps - Tie-breaking: lower index wins on equal scores - Reusable by CSA indexer Step 3: activation_topk.cu — fused sqrt(softplus) + bias + topk + renorm - Single kernel: all 6 steps of the router math, no intermediate buffers - Numerically stable softplus: max(x,0) + log1p(exp(-\|x\|)) - Per-thread heap with unbiased activation co-stored - Shared memory merge → sort descending → renormalize → store Step 4: dense_router_decode.py — CuTeDSL fused GEMM kernel (skeleton) - BF16 GEMM with tcgen05.mma, FP32 accumulator - Custom epilogue: activation + bias + top-k (structure defined, needs TMA/MMA boilerplate) - Dispatch: N<=64 uses fused decode, N>64 uses prefill path Step 5: dense_router_prefill.py — prefill path - torch.nn.functional.linear for GEMM (DeepGEMM integration deferred) - Calls activation_topk for fused post-GEMM processing Step 6: Router class + ops/router.py + test_router.py - Router: construction-time mode (dense/hash), weight loading, custom_op dispatch - ops/router.py: torch.library.custom_op wrappers, integer-keyed registry - test_router.py: spec oracle tests (DO NOT RUN — Carmine is testing Stage C) Test strategy: each kernel tested against its mathematical spec in FP32. No reference implementation, no two debug streams. The oracle IS the math.	2026-05-21 21:54:05 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

38 Commits