nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	b4d58df620	KV Cache: schema, allocator, pools, manager, append_swa kernel Complete KV cache substrate for DSV4 inference: schema.py: Per-layer cache shape derived from LayerSpec. - CSA: 32 entries/block, 32 indexer entries, tail=3 - HCA: 1 entry/block, no indexer, tail=127 - SWA: no classical pool, no tail - BLOCK_SIZE_ORIGINAL_TOKENS=128 (lcm of compression ratios) - compute_block_budget() for allocator sizing allocator.py: Fixed-size block free-list. - GPU stack with pinned host top pointer - acquire/release between graph captures only - OOM raises on exhaustion paged_cache.py: Per-layer classical KV storage. - FP8 (uint8) for non-RoPE dims, BF16 for RoPE dims (paper 2.3.4) - Per-entry inverse scale for FP8 dequant - FP4 indexer keys for CSA layers (NVFP4 scheme) - memory_bytes() tracking state_cache.py: Per-layer SWA window + tail buffer. - Ring buffer with position tracking (swa_head, swa_pos) - CSA: dual streams (ka/za/kb/zb) for overlapping compression - HCA: single stream (ka/za only) - SWA: no tail buffer - reset_slot() for request completion handle.py: LayerCacheHandle — typed per-call view. - write_swa(), read_swa_view(), read_classical_view(), read_indexer_view() - No GPU allocation in acquire() — 0 bytes delta (cudagraph safe) - SWAView/ClassicalView/IndexerView dataclasses for kernel signatures manager.py: KVCacheManager — owns everything. - Per-layer schema, pool, and allocator construction - admit_request()/release_request() lifecycle - allocate_block() for compression flush - acquire() returns LayerCacheHandle (zero-alloc) append_swa.cu: Native kernel for SWA writes. - One block per token, 128 threads per block - Warp-level amax reduction, BF16->FP8 E4M3 quantization - Atomic ring buffer head increment - FP8/BF16 split write + inv_scale + position metadata - FP8 round-trip: <3.6% relative error - RoPE half: exact match (no quantization) All tests pass on B200: - Schema correctness for CSA/HCA/SWA - Allocator acquire/release/OOM - Pool shapes match architecture spec - Manager lifecycle (admit/release/recycle/exhaustion) - Zero-alloc acquire() (cudagraph safe) - append_swa kernel: positions, RoPE exact, FP8 quality, wrap-around, multi-request isolation	2026-05-22 00:08:38 +00:00
biondizzle	4453d7475a	Fix layer construction: match existing API signatures, add RMSNorm impl - Nvfp4GroupedLinear: (n_local_groups, heads_per_group, head_dim, o_lora_rank) - mHCLayer: hidden_dim, t_max_sinkhorn (not hidden_size, sinkhorn_iters) - RMSNorm: PyTorch reference implementation (BF16, cudagraph-safe) - Verified: all 43 Flash + 61 Pro layers construct cleanly - All projection shapes validated against architecture spec	2026-05-21 23:31:58 +00:00
biondizzle	d5ec0e5133	Clean up: remove debug/temp files and dangling test kernels	2026-05-21 23:26:50 +00:00
biondizzle	97a1b11f41	10-warp debug: MMA=warp4 TMA=warp5 idle=6-9 still gives cosine 0.29 Pipeline init uses __syncthreads (all 320 threads participate). Pipeline groups match 6-warp exactly. Only difference: threads_per_cta=320 vs 192. Direct comparison: 6-warp output [15,-129,-77.5,65,59] vs 10-warp output [-7.5,2.2,-22.7,7.3,12.0] for row 0. Completely different values. Something in CuTe DSL runtime uses blockDim.x or total CTA size in a way that breaks computation when CTA size changes from 192 to 320. The pipeline_init_wait calls agent_sync(ThreadBlock) = __syncthreads which all 320 threads reach. NamedBarriers use specific thread counts. TMA atoms are created from MMA thread layout, not CTA size. Hypothesis: the PipelineTmaUmma or PipelineUmmaAsync internally uses blockDim.x for barrier arithmetic, making the barriers expect more participants than the actual working threads.	2026-05-21 23:24:44 +00:00
biondizzle	66a89859ed	Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer DSV4Config: frozen dataclass with .flash() / .pro() classmethods. All architectural constants (dims, heads, MoE params, mHC) in one place. LayerSchedule: pure-data per-layer-index -> (attn_type, ffn_type, router_mode). Flash: SWA, SWA, CSA, HCA, CSA, HCA, ... (43 layers) Pro: HCA, HCA, CSA, HCA, CSA, HCA, ... (61 layers) Both: first 3 MoE layers = hash routing, rest = dense validate_schedule() enforces correctness at construction. AttentionSubBlock: CSA / HCA / SWA variants. - Low-rank Q projection (q_down -> q_up) - KV down-projection (varies by attn type: 4h/2h/1h) - CSA: indexer_q_up + indexer_head_weights - Grouped output projection (wo_a + wo_b) - Kernel calls are imports (NotImplementedError until kernel lands) - No PyTorch fallback paths FFNSubBlock: MoE + shared expert. - Router (hash/dense) mode from LayerSpec - Nvfp4MoE + Nvfp4SharedExpert TransformerLayer: composition of mHC + norm + attention + FFN. - Two mHC wrappers (attn + ffn sub-blocks) - Two RMSNorm (one per sub-block) - Pure orchestration, no learned params on the layer itself Tests: schedule construction + validation for both variants. No forward tests yet (depends on FMHA kernel + KV cache).	2026-05-21 23:11:09 +00:00
biondizzle	dd364b6d4d	10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999) Adding 4 idle warps (4-7) to 320-thread CTA: - No crash, no deadlock (idle warps just pass) - But output is garbage: cosine 0.29 vs 0.999999 Same softmax+MMA code, same TMEM layout, same barriers. Only difference: mma_warp_id=8 (was 4), threads_per_cta=320 (was 192) and 4 idle warps 4-7. Something in the pipeline/barrier system assumes the old 6-warp topology. Need to identify which component uses threads_per_cta or warp_idx in a way that breaks with more warps.	2026-05-21 22:07:53 +00:00
biondizzle	0d06e55770	Router: Blackwell-native fused decode kernel — real CuTeDSL implementation DenseRouterDecodeKernel: BF16 GEMM + sqrt(softplus) + bias + top-k in a single kernel launch on Blackwell SM100. Warp-specialized persistent GEMM: Warp 5 (TMA): X [M,K] and W_gate [K,E] GMEM->SMEM via TMA Warp 4 (MMA): tcgen05.mma BF16, FP32 accumulator -> TMEM Warps 0-3 (EPI): TMEM->register (tcgen05.ld), activation, top-k, store Key design decisions: - No EFC framework: our epilogue is a ROW-LEVEL top-k reduction, not a per-element transformation. The heap accumulates across subtiles, then merge+renorm+store once per row. - Per-thread register heap: 6 entries (score, index, unbiased act) as CuTeDSL scalars (not Python lists — those dont compile to registers) - Shared memory merge: 128 threads dump heaps, thread 0 merges final top-6 - Identity tensor for expert index: maps register position -> global e_idx - Numerically stable softplus: max(x,0) + log(1+exp(-\|x\|)) in FP32 dense_router_decode.py now dispatches to this kernel for N<=64, falls back to activation_topk.cu for N>64. This is a real Blackwell kernel. No pass statements. No fake code.	2026-05-21 22:04:20 +00:00
biondizzle	9c39f48443	Router: clean up dense_router_decode.py — realistic architecture, no fake code The first draft had a fake CuTeDSL kernel body with pass statements and Python lists as register heaps. That is not the right way. This commit replaces it with honest documentation of what the kernel does and what needs to happen. Current working path: - All N routes through torch.nn.functional.linear + activation_topk.cu - activation_topk is a single-pass fused CUDA kernel (all 6 steps) - This is correct and performant for all N CuTeDSL fused decode kernel (DenseRouterDecodeKernel): - Class structure and warp specialization defined - Full documentation of the TMA/MMA/epilogue pipeline - The novel part is the row-level top-k epilogue (cross-subtile heap) - EFC framework does not apply — our epilogue is not per-element - Implementation deferred until profiling shows the GMEM round-trip on logits matters for decode latency No fake code. No pass statements. No Python lists as GPU registers. The working path is the activation_topk kernel. The CuTeDSL kernel will be built on top of it when the optimization is needed.	2026-05-21 21:58:31 +00:00
biondizzle	abfe4485f7	Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill Step 1: Hash router (hash_router.cu) - One thread per token, gather from [vocab_size, k] LUT - Uniform 1/k weights, FP32 output - 3 MB LUT fits in L2 for repeated decode calls Step 2: topk_select.cu — general top-k primitive - Per-thread register min-heap (k=6, compile-time unrolled) - Shared memory merge: thread 0 merges 64 partial heaps - Tie-breaking: lower index wins on equal scores - Reusable by CSA indexer Step 3: activation_topk.cu — fused sqrt(softplus) + bias + topk + renorm - Single kernel: all 6 steps of the router math, no intermediate buffers - Numerically stable softplus: max(x,0) + log1p(exp(-\|x\|)) - Per-thread heap with unbiased activation co-stored - Shared memory merge → sort descending → renormalize → store Step 4: dense_router_decode.py — CuTeDSL fused GEMM kernel (skeleton) - BF16 GEMM with tcgen05.mma, FP32 accumulator - Custom epilogue: activation + bias + top-k (structure defined, needs TMA/MMA boilerplate) - Dispatch: N<=64 uses fused decode, N>64 uses prefill path Step 5: dense_router_prefill.py — prefill path - torch.nn.functional.linear for GEMM (DeepGEMM integration deferred) - Calls activation_topk for fused post-GEMM processing Step 6: Router class + ops/router.py + test_router.py - Router: construction-time mode (dense/hash), weight loading, custom_op dispatch - ops/router.py: torch.library.custom_op wrappers, integer-keyed registry - test_router.py: spec oracle tests (DO NOT RUN — Carmine is testing Stage C) Test strategy: each kernel tested against its mathematical spec in FP32. No reference implementation, no two debug streams. The oracle IS the math.	2026-05-21 21:54:05 +00:00
biondizzle	c97661994e	WIP: correction warp group architecture - compiles, illegal address at runtime 4 softmax warps (0-3), 4 correction warps (4-7), 1 MMA (8), 1 TMA (9). 320 threads total. Softmax: QK→softmax, write P, write row metadata to TMEM vector. Correction: read vector via QK partition, rescale O (C6), normalize O (C9). Compiles successfully but hits CUDA_ERROR_ILLEGAL_ADDRESS at runtime. Likely: vector TMEM offsets or correction TMEM access layout is wrong. Key files: - tests/unit/test_fmha_v3_correction.py (new correction architecture) - tests/unit/test_fmha_v3_softmax.py (working n=128, cosine 0.993)	2026-05-21 21:20:39 +00:00
biondizzle	d2a16daf70	BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works. C9 fix: instead of using QK-partitioned row_sum (which maps to wrong PV rows), read P from TMEM using PV partition and sum via .reduce(ADD). QK: thread N owns row N//4, PV: thread N owns row N. Reading P via PV partition gives each thread its correct row P values. n=128: cosine 0.993 (was 0.514) n=256: cosine 0.725 (C6 still broken for multi-tile) n=384: cosine 0.676 (same C6 issue) Remaining: C6 O-rescale for multi-tile needs same PV-partitioned fix. Small accuracy gap (0.993 vs 0.999) likely from BF16 P store/load round-trip.	2026-05-21 20:13:51 +00:00
biondizzle	7189165a67	WIP: TMEM vector bridge not working (same cosine 0.513) row_sum is PROVEN correct (29.25 vs 29.22 for row 0, ratio 1.001). The ONLY bug is QK→PV row mapping in C9 normalization. Tried: composition(tStS,(128,1)) for write, composition(tOtO,(128,1)) for read. Same result — the composition preserves the fragments internal thread-to-address mapping, so the same thread writes and reads the same TMEM address regardless of which fragment layout is used for the composition. Need: absolute row-coordinate indexed TMEM vector. Each QK thread writes inv_row_sum to vec[QK_row_id], each PV thread reads from vec[PV_row_id]. The row_id comes from the identity tensor coordinate. Alternative: implement FMHA correction_epilog pattern with dedicated correction warp group that reads row metadata from the vector.	2026-05-21 19:26:15 +00:00
biondizzle	26f6c1ba7f	WIP: confirmed row_sum is wrong (5.5 vs correct 29.22 for row 0) The packed f32x2 reduction SHOULD sum all 128 exp2 P values but gives a result ~5.3x too small. Need to debug inside the kernel with print statements to see what values the reduction is actually summing. Unnormalized P@V is perfect (cosine 0.999998). row_max is correct (because P is correct). The bug is specifically in row_sum computation.	2026-05-21 19:16:15 +00:00
biondizzle	4251af1f14	WIP: scalar C9 normalization - confirmed inv_row_sum is wrong The C9 TMEM round-trip IS modifying O (confirmed by epilogue * 2.0 test). But inv_row_sum is wrong: each thread computes row_sum via .reduce(MAX) and packed f32x2 reduction, but the result appears to be the same for all threads. Next: need to dump the QK C-fragment coordinate tensor to understand which rows each thread actually owns in the TMEM load partition.	2026-05-21 19:09:32 +00:00
biondizzle	8612bc5426	WIP: QK-partitioned C9 normalization (does not work) The QK composition(tStS, (128,64)) view of O TMEM region does not align with the actual PV C-fragment layout. Cannot read O with QK partition. Need to use TMEM vector approach: 1. Store inv_row_sum via QK partition (composition(tStS, (128,1))) 2. Read inv_row_sum via PV partition (need PV-partitioned view of vector) 3. Apply normalization in PV-partitioned O TMEM access The key challenge: creating a PV-partitioned read of the vector TMEM region that was written with QK partition. This is what CUTLASS FMHA does with its correction warp group.	2026-05-21 18:59:21 +00:00
biondizzle	d7aa4da686	BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128! The softmax math (exp2, P store, PV) is correct for single-tile. The bug is ONLY in C6/C9 normalization: applying inv_row_sum using PV partition instead of QK partition. n=128 (single tile): cosine 0.999998 PASS n=256/384 (multi-tile): C6 O-rescale using wrong partition = FAIL Fix: normalize O using QK row coordinates, not PV row coordinates. Can use TMEM vector to bridge QK partition to PV partition.	2026-05-21 18:55:00 +00:00
biondizzle	a983a8fb41	WIP: TMEM vector for per-row row_sum (not yet working) Key finding: the root cause is that each epilogue thread owns MULTIPLE rows in the QK C-fragment, so scalar row_max/row_sum are wrong (global across all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads use the same row_sum (from row 114). Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))). This is a no-op because both write and read use the SAME QK partition with a scalar row_sum. The vector approach only helps when different partitions are used for write vs read, or when per-row values are stored. Next steps: 1. Need PER-ROW row_max and row_sum, not per-thread scalar 2. The CUTLASS FMHA works because each thread owns exactly 1 row 3. Options: restructure thread layout, or compute per-row values differently 4. The vector must store ALL 128 per-row values, then read per-row in C9	2026-05-21 18:45:30 +00:00
biondizzle	331d9e95f3	WIP: Stage C softmax - partial progress Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code. Hardcoding s_k=128 (matching Stage B) fixes the base pipeline. Current status: kernel produces non-zero output but softmax math is still wrong. Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True Need to debug row_sum computation and C9 normalization.	2026-05-21 18:04:21 +00:00
biondizzle	84cd636ba9	Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True - Add pv_done_bar (barrier_id=4): MMA signals PV complete, epilogue waits before O rescale (C6) and final normalization (C9) - Fix acc_scale: exp2(scale * (old_max - new_max)) includes the scale_softmax_log2 factor matching CUTLASS FMHA reference - fastmath=True for both exp2 calls (P computation + rescale) - No *0.5 (our scalar row_sum pattern initializes (0,0) not (sum,sum))	2026-05-21 17:58:04 +00:00
biondizzle	52b46a2dee	Stage C: add validation harness with real softmax reference (C1)	2026-05-21 17:49:26 +00:00
biondizzle	96f900f5f0	README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)	2026-05-21 17:40:25 +00:00
biondizzle	2ec32eb8da	README: update for new dsv4/ package structure	2026-05-21 17:34:40 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00
biondizzle	99e143dd0e	Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization - scale_softmax_log2 was missing from _setup (patch artifact) - C9 normalization: load O from TMEM, multiply by 1/row_sum, store back instead of trying to capture runtime value in const_expr lambda - Then use standard epilogue_tma_store with identity transform	2026-05-21 17:15:15 +00:00
biondizzle	df04ba40ee	Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py - C1: Real softmax reference (torch.softmax, not identity) - C2: Per-thread row_max/row_sum registers - C3: QK scale folded (1/sqrt(d) * log2(e)) - C4: Row max via .reduce(MAX) - C5: Rescale factor (exp2(old_max - new_max)) - C6: O rescale in TMEM (correction_rescale pattern) - C7: Real exp2 for P computation - C8: Row sum via packed f32x2 reduction - C9: Final normalization (1/row_sum in epilogue) - Dynamic s_k for V FMHA reconstruction - fastmath=False for correctness first	2026-05-21 17:10:58 +00:00
biondizzle	20564425ec	README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel) Document canonical test files, obsolete test sprawl, and the path from test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration. Also: TMEM layout for Stage C, key lessons from A&B.	2026-05-21 15:43:01 +00:00
biondizzle	ad24792fc7	Update both READMEs: Stage B complete, document TMEM overlap root cause - Workspace README: full rewrite with Stage B ✅, Bug 4b root cause (P/O overlap), FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns - Kernel README: focused on the bug, fix, and current test status - Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement	2026-05-21 15:36:06 +00:00
biondizzle	2030d41e41	Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999 Same P/O overlap bug: O at col 64 overlapped P at [32,96). Same fixes: O at col 128, FMHA V reconstruction, power-of-2 TMEM alloc.	2026-05-21 15:32:49 +00:00
biondizzle	0f4f69907e	STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction Root cause: PV output O started at TMEM column 64 (from find_tmem_tensor_col_offset), overlapping with P at columns [32,96). PV MMA reading P while writing O to overlapping columns corrupted the A operand mid-computation. For (128,128) PV, O started at 128 (no overlap) so it worked by accident. For (128,64) PV, O started at 64, overlapping P [32,96) -> NaN/garbage. Fix: Place O at column 128 (after both S [0,128) and P [32,96)). Also added FMHA-style V reconstruction: logical (HEAD_DIM, s_k, 1) stride (1, hd, hd*s_k) instead of passing DLPack V directly to TMA. test_fmha_v3.py: (128,64) PV with random V -> cosine 0.999999 PASS	2026-05-21 15:30:24 +00:00
biondizzle	4564758466	Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV Key findings: - P/A alias WORKS: PV reads non-zero P from TMEM at offset 32 (proven by no-softmax test) - V mode bug: V=(128,64) only loads 64 K-values, PV needs 128. Output = sum(S[:,:64]) = 0.67 cosine - FMHA-style V reconstruction (hd,n,1) stride (1,hd) gives NaN for (128,64) PV - K-major V (64,128) contiguous gives NaN for (128,64) PV - Square (128,128) PV works with ALL V approaches (cosine 0.999999) - Non-square PV consistently broken regardless of V layout Test files: - test_128_128_fmha_v.py: (128,128) with FMHA V - PASS - test_pv64_fmha_v.py: (128,64) with FMHA V - NaN - test_pv64_kmajor_v.py: (128,64) with K-major V - NaN - test_pv64_with_softmax.py: (128,64) with original V - 0.67 - test_pv64_no_softmax.py: proves P/A alias works - test_fmha_v3.py: full pipeline with QK C-fragment composition store	2026-05-21 15:20:14 +00:00
biondizzle	81d5d8d04c	FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV	2026-05-21 12:52:29 +00:00
biondizzle	73e03cfa6d	Stage B: PV(128,64) test + v2 pipeline fixes - test_pv64.py: (128,64) PV with separate V SMEM, single ab pipeline Result: cosine 0.669848 — data path works but P layout mismatch Softmax writes P via QK C-fragment layout, PV reads via PV A-fragment layout These differ for non-(128,128) PV — Bug 1 from README - test_fmha_v2_fixed.py: KV-tile interleaved pipeline with fixes Fix 1: per-pipeline tx_count (Q vs KV separate byte counts) Fix 2: NamedBarrier for softmax-done signal (replaces double-acquire deadlock) Fix 3: Separate SMEM for V (no recast_ptr overlap with K) Still produces zeros — needs P layout fix (same root cause as test_pv64)	2026-05-21 11:49:06 +00:00
biondizzle	61b23efbcf	stuff and stuff	2026-05-21 10:50:30 +00:00
biondizzle	d72f854efb	FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug)	2026-05-21 10:47:46 +00:00
biondizzle	750f1f09c9	README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.	2026-05-21 09:59:37 +00:00
biondizzle	dbb240adc9	Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this.	2026-05-21 09:56:54 +00:00
biondizzle	50e9b5da81	README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.	2026-05-21 09:47:08 +00:00
biondizzle	d4934371d0	Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths.	2026-05-21 09:44:22 +00:00
biondizzle	422af26024	Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0)	2026-05-21 09:20:09 +00:00
biondizzle	781684dd89	TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround.	2026-05-21 09:10:12 +00:00
biondizzle	96e7210db7	Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.	2026-05-21 09:00:42 +00:00
biondizzle	ad3f63033d	Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.	2026-05-21 08:45:49 +00:00
biondizzle	5e37ea56e4	FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.	2026-05-21 07:13:14 +00:00
biondizzle	dd8d872bec	v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0	2026-05-21 07:08:29 +00:00
biondizzle	b9b1b808a5	README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns	2026-05-21 06:46:02 +00:00
biondizzle	f1c4ee0e4d	v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV	2026-05-21 06:40:27 +00:00
biondizzle	4968ce064d	even more stuff	2026-05-21 05:55:22 +00:00
biondizzle	15c987244f	v28 attempt: PV MMA (128,64) - cosine 0.004, debugging	2026-05-21 05:41:44 +00:00
biondizzle	a7fd2761df	README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write - (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride - Softmax writes P with QK C-fragment layout (N_MMA=128) - O[m,d] ≈ P[m,2d] — every other column effect confirmed - All-ones and single-element V pass (uniform/sparse data hides mismatch) - epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876) - Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile) - Added diagnostic test results to test table	2026-05-21 05:17:12 +00:00
biondizzle	c20518332e	more stuff	2026-05-21 05:08:57 +00:00

1 2 3 4 5 ...

548 Commits