nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	23abfe9845	KV Cache: schema, allocator, pools, manager, append_swa kernel Complete KV cache substrate for DSV4 inference: schema.py: Per-layer cache shape derived from LayerSpec. - CSA: 32 entries/block, 32 indexer entries, tail=3 - HCA: 1 entry/block, no indexer, tail=127 - SWA: no classical pool, no tail - BLOCK_SIZE_ORIGINAL_TOKENS=128 (lcm of compression ratios) - compute_block_budget() for allocator sizing allocator.py: Fixed-size block free-list. - GPU stack with pinned host top pointer - acquire/release between graph captures only - OOM raises on exhaustion paged_cache.py: Per-layer classical KV storage. - FP8 (uint8) for non-RoPE dims, BF16 for RoPE dims (paper 2.3.4) - Per-entry inverse scale for FP8 dequant - FP4 indexer keys for CSA layers (NVFP4 scheme) - memory_bytes() tracking state_cache.py: Per-layer SWA window + tail buffer. - Ring buffer with position tracking (swa_head, swa_pos) - CSA: dual streams (ka/za/kb/zb) for overlapping compression - HCA: single stream (ka/za only) - SWA: no tail buffer - reset_slot() for request completion handle.py: LayerCacheHandle — typed per-call view. - write_swa(), read_swa_view(), read_classical_view(), read_indexer_view() - No GPU allocation in acquire() — 0 bytes delta (cudagraph safe) - SWAView/ClassicalView/IndexerView dataclasses for kernel signatures manager.py: KVCacheManager — owns everything. - Per-layer schema, pool, and allocator construction - admit_request()/release_request() lifecycle - allocate_block() for compression flush - acquire() returns LayerCacheHandle (zero-alloc) append_swa.cu: Native kernel for SWA writes. - One block per token, 128 threads per block - Warp-level amax reduction, BF16->FP8 E4M3 quantization - Atomic ring buffer head increment - FP8/BF16 split write + inv_scale + position metadata - FP8 round-trip: <3.6% relative error - RoPE half: exact match (no quantization) All tests pass on B200: - Schema correctness for CSA/HCA/SWA - Allocator acquire/release/OOM - Pool shapes match architecture spec - Manager lifecycle (admit/release/recycle/exhaustion) - Zero-alloc acquire() (cudagraph safe) - append_swa kernel: positions, RoPE exact, FP8 quality, wrap-around, multi-request isolation	2026-05-22 00:08:38 +00:00
biondizzle	39c1592d9c	Clean up: remove debug/temp files and dangling test kernels	2026-05-21 23:26:50 +00:00
biondizzle	b034c915d1	10-warp debug: MMA=warp4 TMA=warp5 idle=6-9 still gives cosine 0.29 Pipeline init uses __syncthreads (all 320 threads participate). Pipeline groups match 6-warp exactly. Only difference: threads_per_cta=320 vs 192. Direct comparison: 6-warp output [15,-129,-77.5,65,59] vs 10-warp output [-7.5,2.2,-22.7,7.3,12.0] for row 0. Completely different values. Something in CuTe DSL runtime uses blockDim.x or total CTA size in a way that breaks computation when CTA size changes from 192 to 320. The pipeline_init_wait calls agent_sync(ThreadBlock) = __syncthreads which all 320 threads reach. NamedBarriers use specific thread counts. TMA atoms are created from MMA thread layout, not CTA size. Hypothesis: the PipelineTmaUmma or PipelineUmmaAsync internally uses blockDim.x for barrier arithmetic, making the barriers expect more participants than the actual working threads.	2026-05-21 23:24:44 +00:00
biondizzle	0b8f4da323	Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer DSV4Config: frozen dataclass with .flash() / .pro() classmethods. All architectural constants (dims, heads, MoE params, mHC) in one place. LayerSchedule: pure-data per-layer-index -> (attn_type, ffn_type, router_mode). Flash: SWA, SWA, CSA, HCA, CSA, HCA, ... (43 layers) Pro: HCA, HCA, CSA, HCA, CSA, HCA, ... (61 layers) Both: first 3 MoE layers = hash routing, rest = dense validate_schedule() enforces correctness at construction. AttentionSubBlock: CSA / HCA / SWA variants. - Low-rank Q projection (q_down -> q_up) - KV down-projection (varies by attn type: 4h/2h/1h) - CSA: indexer_q_up + indexer_head_weights - Grouped output projection (wo_a + wo_b) - Kernel calls are imports (NotImplementedError until kernel lands) - No PyTorch fallback paths FFNSubBlock: MoE + shared expert. - Router (hash/dense) mode from LayerSpec - Nvfp4MoE + Nvfp4SharedExpert TransformerLayer: composition of mHC + norm + attention + FFN. - Two mHC wrappers (attn + ffn sub-blocks) - Two RMSNorm (one per sub-block) - Pure orchestration, no learned params on the layer itself Tests: schedule construction + validation for both variants. No forward tests yet (depends on FMHA kernel + KV cache).	2026-05-21 23:11:09 +00:00
biondizzle	c681b591a0	10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999) Adding 4 idle warps (4-7) to 320-thread CTA: - No crash, no deadlock (idle warps just pass) - But output is garbage: cosine 0.29 vs 0.999999 Same softmax+MMA code, same TMEM layout, same barriers. Only difference: mma_warp_id=8 (was 4), threads_per_cta=320 (was 192) and 4 idle warps 4-7. Something in the pipeline/barrier system assumes the old 6-warp topology. Need to identify which component uses threads_per_cta or warp_idx in a way that breaks with more warps.	2026-05-21 22:07:53 +00:00
biondizzle	fb243a4133	Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill Step 1: Hash router (hash_router.cu) - One thread per token, gather from [vocab_size, k] LUT - Uniform 1/k weights, FP32 output - 3 MB LUT fits in L2 for repeated decode calls Step 2: topk_select.cu — general top-k primitive - Per-thread register min-heap (k=6, compile-time unrolled) - Shared memory merge: thread 0 merges 64 partial heaps - Tie-breaking: lower index wins on equal scores - Reusable by CSA indexer Step 3: activation_topk.cu — fused sqrt(softplus) + bias + topk + renorm - Single kernel: all 6 steps of the router math, no intermediate buffers - Numerically stable softplus: max(x,0) + log1p(exp(-\|x\|)) - Per-thread heap with unbiased activation co-stored - Shared memory merge → sort descending → renormalize → store Step 4: dense_router_decode.py — CuTeDSL fused GEMM kernel (skeleton) - BF16 GEMM with tcgen05.mma, FP32 accumulator - Custom epilogue: activation + bias + top-k (structure defined, needs TMA/MMA boilerplate) - Dispatch: N<=64 uses fused decode, N>64 uses prefill path Step 5: dense_router_prefill.py — prefill path - torch.nn.functional.linear for GEMM (DeepGEMM integration deferred) - Calls activation_topk for fused post-GEMM processing Step 6: Router class + ops/router.py + test_router.py - Router: construction-time mode (dense/hash), weight loading, custom_op dispatch - ops/router.py: torch.library.custom_op wrappers, integer-keyed registry - test_router.py: spec oracle tests (DO NOT RUN — Carmine is testing Stage C) Test strategy: each kernel tested against its mathematical spec in FP32. No reference implementation, no two debug streams. The oracle IS the math.	2026-05-21 21:54:05 +00:00
biondizzle	a4d12fd560	WIP: correction warp group architecture - compiles, illegal address at runtime 4 softmax warps (0-3), 4 correction warps (4-7), 1 MMA (8), 1 TMA (9). 320 threads total. Softmax: QK→softmax, write P, write row metadata to TMEM vector. Correction: read vector via QK partition, rescale O (C6), normalize O (C9). Compiles successfully but hits CUDA_ERROR_ILLEGAL_ADDRESS at runtime. Likely: vector TMEM offsets or correction TMEM access layout is wrong. Key files: - tests/unit/test_fmha_v3_correction.py (new correction architecture) - tests/unit/test_fmha_v3_softmax.py (working n=128, cosine 0.993)	2026-05-21 21:20:39 +00:00
biondizzle	bb3ad3d2ef	BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works. C9 fix: instead of using QK-partitioned row_sum (which maps to wrong PV rows), read P from TMEM using PV partition and sum via .reduce(ADD). QK: thread N owns row N//4, PV: thread N owns row N. Reading P via PV partition gives each thread its correct row P values. n=128: cosine 0.993 (was 0.514) n=256: cosine 0.725 (C6 still broken for multi-tile) n=384: cosine 0.676 (same C6 issue) Remaining: C6 O-rescale for multi-tile needs same PV-partitioned fix. Small accuracy gap (0.993 vs 0.999) likely from BF16 P store/load round-trip.	2026-05-21 20:13:51 +00:00
biondizzle	7d1c402a6d	WIP: TMEM vector bridge not working (same cosine 0.513) row_sum is PROVEN correct (29.25 vs 29.22 for row 0, ratio 1.001). The ONLY bug is QK→PV row mapping in C9 normalization. Tried: composition(tStS,(128,1)) for write, composition(tOtO,(128,1)) for read. Same result — the composition preserves the fragments internal thread-to-address mapping, so the same thread writes and reads the same TMEM address regardless of which fragment layout is used for the composition. Need: absolute row-coordinate indexed TMEM vector. Each QK thread writes inv_row_sum to vec[QK_row_id], each PV thread reads from vec[PV_row_id]. The row_id comes from the identity tensor coordinate. Alternative: implement FMHA correction_epilog pattern with dedicated correction warp group that reads row metadata from the vector.	2026-05-21 19:26:15 +00:00
biondizzle	cae87fd744	WIP: confirmed row_sum is wrong (5.5 vs correct 29.22 for row 0) The packed f32x2 reduction SHOULD sum all 128 exp2 P values but gives a result ~5.3x too small. Need to debug inside the kernel with print statements to see what values the reduction is actually summing. Unnormalized P@V is perfect (cosine 0.999998). row_max is correct (because P is correct). The bug is specifically in row_sum computation.	2026-05-21 19:16:15 +00:00
biondizzle	c09c660110	WIP: scalar C9 normalization - confirmed inv_row_sum is wrong The C9 TMEM round-trip IS modifying O (confirmed by epilogue * 2.0 test). But inv_row_sum is wrong: each thread computes row_sum via .reduce(MAX) and packed f32x2 reduction, but the result appears to be the same for all threads. Next: need to dump the QK C-fragment coordinate tensor to understand which rows each thread actually owns in the TMEM load partition.	2026-05-21 19:09:32 +00:00
biondizzle	ce91aa26e4	WIP: QK-partitioned C9 normalization (does not work) The QK composition(tStS, (128,64)) view of O TMEM region does not align with the actual PV C-fragment layout. Cannot read O with QK partition. Need to use TMEM vector approach: 1. Store inv_row_sum via QK partition (composition(tStS, (128,1))) 2. Read inv_row_sum via PV partition (need PV-partitioned view of vector) 3. Apply normalization in PV-partitioned O TMEM access The key challenge: creating a PV-partitioned read of the vector TMEM region that was written with QK partition. This is what CUTLASS FMHA does with its correction warp group.	2026-05-21 18:59:21 +00:00
biondizzle	1fa093ee12	BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128! The softmax math (exp2, P store, PV) is correct for single-tile. The bug is ONLY in C6/C9 normalization: applying inv_row_sum using PV partition instead of QK partition. n=128 (single tile): cosine 0.999998 PASS n=256/384 (multi-tile): C6 O-rescale using wrong partition = FAIL Fix: normalize O using QK row coordinates, not PV row coordinates. Can use TMEM vector to bridge QK partition to PV partition.	2026-05-21 18:55:00 +00:00
biondizzle	c2901b2ecc	WIP: TMEM vector for per-row row_sum (not yet working) Key finding: the root cause is that each epilogue thread owns MULTIPLE rows in the QK C-fragment, so scalar row_max/row_sum are wrong (global across all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads use the same row_sum (from row 114). Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))). This is a no-op because both write and read use the SAME QK partition with a scalar row_sum. The vector approach only helps when different partitions are used for write vs read, or when per-row values are stored. Next steps: 1. Need PER-ROW row_max and row_sum, not per-thread scalar 2. The CUTLASS FMHA works because each thread owns exactly 1 row 3. Options: restructure thread layout, or compute per-row values differently 4. The vector must store ALL 128 per-row values, then read per-row in C9	2026-05-21 18:45:30 +00:00
biondizzle	4c203809ef	WIP: Stage C softmax - partial progress Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code. Hardcoding s_k=128 (matching Stage B) fixes the base pipeline. Current status: kernel produces non-zero output but softmax math is still wrong. Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True Need to debug row_sum computation and C9 normalization.	2026-05-21 18:04:21 +00:00
biondizzle	8e1facef01	Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True - Add pv_done_bar (barrier_id=4): MMA signals PV complete, epilogue waits before O rescale (C6) and final normalization (C9) - Fix acc_scale: exp2(scale * (old_max - new_max)) includes the scale_softmax_log2 factor matching CUTLASS FMHA reference - fastmath=True for both exp2 calls (P computation + rescale) - No *0.5 (our scalar row_sum pattern initializes (0,0) not (sum,sum))	2026-05-21 17:58:04 +00:00
biondizzle	58ca480fd1	Stage C: add validation harness with real softmax reference (C1)	2026-05-21 17:49:26 +00:00
biondizzle	9cbdc92744	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00
biondizzle	94b26ebdf0	Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization - scale_softmax_log2 was missing from _setup (patch artifact) - C9 normalization: load O from TMEM, multiply by 1/row_sum, store back instead of trying to capture runtime value in const_expr lambda - Then use standard epilogue_tma_store with identity transform	2026-05-21 17:15:15 +00:00
biondizzle	3f6f4a3ad8	Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py - C1: Real softmax reference (torch.softmax, not identity) - C2: Per-thread row_max/row_sum registers - C3: QK scale folded (1/sqrt(d) * log2(e)) - C4: Row max via .reduce(MAX) - C5: Rescale factor (exp2(old_max - new_max)) - C6: O rescale in TMEM (correction_rescale pattern) - C7: Real exp2 for P computation - C8: Row sum via packed f32x2 reduction - C9: Final normalization (1/row_sum in epilogue) - Dynamic s_k for V FMHA reconstruction - fastmath=False for correctness first	2026-05-21 17:10:58 +00:00
biondizzle	2d10319e9f	Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999 Same P/O overlap bug: O at col 64 overlapped P at [32,96). Same fixes: O at col 128, FMHA V reconstruction, power-of-2 TMEM alloc.	2026-05-21 15:32:49 +00:00
biondizzle	73f38acf74	STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction Root cause: PV output O started at TMEM column 64 (from find_tmem_tensor_col_offset), overlapping with P at columns [32,96). PV MMA reading P while writing O to overlapping columns corrupted the A operand mid-computation. For (128,128) PV, O started at 128 (no overlap) so it worked by accident. For (128,64) PV, O started at 64, overlapping P [32,96) -> NaN/garbage. Fix: Place O at column 128 (after both S [0,128) and P [32,96)). Also added FMHA-style V reconstruction: logical (HEAD_DIM, s_k, 1) stride (1, hd, hd*s_k) instead of passing DLPack V directly to TMA. test_fmha_v3.py: (128,64) PV with random V -> cosine 0.999999 PASS	2026-05-21 15:30:24 +00:00
biondizzle	dba4279364	Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV Key findings: - P/A alias WORKS: PV reads non-zero P from TMEM at offset 32 (proven by no-softmax test) - V mode bug: V=(128,64) only loads 64 K-values, PV needs 128. Output = sum(S[:,:64]) = 0.67 cosine - FMHA-style V reconstruction (hd,n,1) stride (1,hd) gives NaN for (128,64) PV - K-major V (64,128) contiguous gives NaN for (128,64) PV - Square (128,128) PV works with ALL V approaches (cosine 0.999999) - Non-square PV consistently broken regardless of V layout Test files: - test_128_128_fmha_v.py: (128,128) with FMHA V - PASS - test_pv64_fmha_v.py: (128,64) with FMHA V - NaN - test_pv64_kmajor_v.py: (128,64) with K-major V - NaN - test_pv64_with_softmax.py: (128,64) with original V - 0.67 - test_pv64_no_softmax.py: proves P/A alias works - test_fmha_v3.py: full pipeline with QK C-fragment composition store	2026-05-21 15:20:14 +00:00
biondizzle	5b9d0280d9	FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV	2026-05-21 12:52:29 +00:00
biondizzle	b0912fbf85	Stage B: PV(128,64) test + v2 pipeline fixes - test_pv64.py: (128,64) PV with separate V SMEM, single ab pipeline Result: cosine 0.669848 — data path works but P layout mismatch Softmax writes P via QK C-fragment layout, PV reads via PV A-fragment layout These differ for non-(128,128) PV — Bug 1 from README - test_fmha_v2_fixed.py: KV-tile interleaved pipeline with fixes Fix 1: per-pipeline tx_count (Q vs KV separate byte counts) Fix 2: NamedBarrier for softmax-done signal (replaces double-acquire deadlock) Fix 3: Separate SMEM for V (no recast_ptr overlap with K) Still produces zeros — needs P layout fix (same root cause as test_pv64)	2026-05-21 11:49:06 +00:00
biondizzle	579b2f6abe	stuff and stuff	2026-05-21 10:50:30 +00:00
biondizzle	04cbea072e	FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug)	2026-05-21 10:47:46 +00:00
biondizzle	6c0a6cf50b	Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this.	2026-05-21 09:56:54 +00:00
biondizzle	80da5c51a6	Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths.	2026-05-21 09:44:22 +00:00
biondizzle	464390fcfc	Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0)	2026-05-21 09:20:09 +00:00
biondizzle	6a352cf6da	TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround.	2026-05-21 09:10:12 +00:00
biondizzle	73cb3a3277	Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.	2026-05-21 09:00:42 +00:00
biondizzle	768a696373	Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.	2026-05-21 08:45:49 +00:00
biondizzle	f3a7dc1598	FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.	2026-05-21 07:13:14 +00:00
biondizzle	d873284e84	v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0	2026-05-21 07:08:29 +00:00
biondizzle	6ec308dd50	v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV	2026-05-21 06:40:27 +00:00
biondizzle	3fdb9f008b	v28 attempt: PV MMA (128,64) - cosine 0.004, debugging	2026-05-21 05:41:44 +00:00
biondizzle	1f6fb856ea	more stuff	2026-05-21 05:08:57 +00:00
biondizzle	1314a67135	Stage B progress: PV works for square (128,128), broken for (128,64) - Bug 1 (V MN-major): Fix applied - Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0) - Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate) - Bug 4 (CURRENT): PV MMA broken for non-square output - (128,128) PV with random V: cosine 0.999999 ✅ - (128,64) PV with MN-major V: cosine ~0.01 ❌ - Softmax packing, layout aliasing, pipeline ordering all verified correct - Root cause unknown — likely epilogue/V layout/MMA tiler issue Added test_pv_diag.py (V=I and random V, 128x128 output — PASS) Added test_layout_compare.py (TMEM layout inspection) Added test_inspect_types.py (TMEM pointer arithmetic verification) Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix Updated READMEs with current state	2026-05-21 04:40:28 +00:00
biondizzle	2cbb0369fd	Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage Pipeline deadlock fixed: - No cta_layout_vmnk on mma_si PipelineUmmaAsync - TMA warp excluded from tmem.wait_for_alloc - PipelineTmaStore (not TmaStorePipeline) Bug 1 (V MN-major): fix applied - PV MMA uses v_major=OperandMajorMode.MN - V shaped (64,128) strides(1,64) via as_strided Bug 2 (softmax packing): C-fragment composition store applied - FP32 to BF16 packing works - St32x32bOp uses Float32 (not BFloat16) Bug 3 (PV garbage): investigating - PV MMA cosine ~0.01 against reference - Suspected TMEM layout mismatch between softmax P store and PV A-fragment read Test results: - test_mma_si_only: cosine 0.999999 PASS - test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)	2026-05-21 04:10:07 +00:00
biondizzle	1eca10e39f	Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed Key finding: C-fragment and A-fragment use different physical TMEM address mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16. Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999) Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02) Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)	2026-05-21 00:12:47 +00:00
biondizzle	e678afcde0	Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong Key fixes: - PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps) - TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded) - P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py) - V SMEM aliasing via recast_ptr Status: - Stage A: cosine 0.999999 ✅ - Stage B: runs without crash, identity softmax cosine -0.02 ❌ - Diagnostics: TMEM layout inspection, bisection results	2026-05-20 20:26:25 +00:00
biondizzle	dd7af0cd8a	feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL) - native_swa_decode.py: BlackwellSWADecodeKernel - CTA mapping: 1 CTA per (decode_token, q_head_group) - Online softmax with KV tile streaming (16 tokens/tile) - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext requires 32-bit aligned vector, no scalar fp8->bf16 support) - Cosine 0.9999+ vs PyTorch batched SDPA reference - Fallback _fallback_batched_sdp when CuTeDSL unavailable - native_sparse_decode.py: BlackwellSparseDecodeKernel - Combined SWA + compressed KV in single attention pass - Supports CSA (cr=4) and HCA (cr=128) layers - Sink weight merge on host side - Cosine 0.9999+ vs combined SDPA reference - fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires vector<4xf8>, no scalar support). Pre-dequant is the workaround. - vLLM wiring (attention.py): - SWA-only layers: native_swa_decode_attention - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink - csa_attention.py updated to use native kernels - Tests: test_decode_pipeline.py, test_sparse_decode.py both passing	2026-05-20 05:46:15 +00:00
biondizzle	d775d1075d	Fused SwiGLU epilogue with granularity-8 weight interleave - Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis) - Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline - De-interleave L1 GEMM output before gate/up split - Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing - Even subtiles = gate: SiLU in FP32 registers, save to register buffer - Odd subtiles = up: silu(gate)*up from buffer - Both branches produce same BF16 tensor type (CuTeDSL constraint) - run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2 - Runner: fused_swiglu=True option for CuTeDSLMoERunner - Layertest: both fused and non-fused paths PASS (cosine 0.988) - README.md updated with current status and lessons learned	2026-05-20 04:13:52 +00:00
biondizzle	b1778eedf8	wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL SiLU in registers: PASS (0.034% error, Step 1 stable) Gate/up subtile detection: blocked by CuTeDSL type system CuTeDSL compiles the kernel for ALL subtile iterations at once. Runtime conditionals (if is_gate_subtile) that affect: - Register tensor assignment → DSLRuntimeError (type structure mismatch) - TMA store skipping → corrupted output - Mask blending → wrong results Path forward: use const_expr debug flag for the BF16 side output, or process gate/up in a separate post-GEMM kernel.	2026-05-20 03:26:20 +00:00
biondizzle	ed89e678be	wip: add run_fused_swiglu_grouped_gemm bridge + step1 test	2026-05-20 03:10:56 +00:00
biondizzle	9cdf79fd9c	wip: fused SwiGLU kernel scaffold + bridge interleave + plan - fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with class rename and fused_swiglu/swiglu_limit params added - bridge.py: added interleave_l1_weights, deinterleave_l1_weights, warmup_fused_swiglu_compilation - Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave)) - Standalone GEMM interleave test fails due to kernel-internal N-tiling layout (expected, skipping per plan) - FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan, 4-step implementation strategy	2026-05-20 03:04:38 +00:00
biondizzle	3c6b5a0522	chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct Verified that our NVFP4 packing convention (odd<<4\|even, round-half-to-even) matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip across all tested experts. The dequantize->requantize path is lossless in practice but wasteful. Marked both prepare_weights_from_dequantized and prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked which loads checkpoint FP4 bytes directly via .view(). Also added test_fp4_roundtrip.py for future reference.	2026-05-20 02:11:40 +00:00
biondizzle	7070fadf72	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	152b0749df	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00

1 2 3 4

177 Commits