nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	9d88769f5f	Wire indexer compute_index_scores_topk + fix compressor imports - indexer/__init__.py: compute_index_scores_topk now calls run_indexer_score_topk with proper tensor reshaping - compressor/__init__.py: added torch import, fixed csa_compress_tail and hca_compress_tail imports for flush.py - Full flush pipeline now importable end-to-end	2026-05-30 21:19:06 +00:00
biondizzle	daf84524ac	E2/E3: compressor bridge, indexer bridge, flush pipeline wiring - compress_tail.py: PyTorch reference CSA/HCA compression (token-level softmax over m/m' entries, paper eq. 11-12) - compressor/__init__.py: csa_compress_and_store, hca_compress_and_store bridges (compression deferred to flush pipeline) - indexer/__init__.py: compute_index_scores_topk bridge (NotImplemented) - Fixed attention.py: removed extra positions arg to write_swa	2026-05-30 21:16:54 +00:00
biondizzle	300dddedc0	E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test E1: LayerCacheHandle now exposes gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim. Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu. Python wrapper in dsv4/kernels/cache/gather.py. E2: tests/e2e/test_one_layer.py — SWA path smoke test. E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs for CSA/HCA compress_and_store, compute_index_scores_topk). E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path. Error checking via C API return code instead. Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).	2026-05-30 21:10:26 +00:00
biondizzle	e74c84458c	Clean up E2M1 dequant: use LUT approach (consultant recommendation) Both indexer files now use a constexpr LUT matching Python's E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6]. This is cleaner and more auditable than bit-manipulation.	2026-05-28 16:17:47 +00:00
biondizzle	79ef87f9a9	FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu The dequant_fp4_scalar function was treating the magnitude bits as a raw integer (0-6) instead of the E2M1 floating-point format: Old (WRONG): val = (int)(nibble & 0x07) * scale New (CORRECT): proper E2M1 decode with exponent + mantissa E2M1 encoding (bias=1): exp=0 subnormal: 0b000=0, 0b001=0.5 exp=1: 0b010=1, 0b011=1.5 exp=2: 0b100=2, 0b101=3 exp=3: 0b110=4, 0b111=6 Bug found by outside consultant. Affects indexer top-k selection correctness — wrong FP4 key decoding would select wrong CSA blocks. Fixed in both: - dsv4/kernels/indexer/indexer_score_topk.cu - dsv4/kernels/cuda/indexer_score_topk.cu	2026-05-28 16:16:24 +00:00
biondizzle	c2f705a21a	Indexer: score+topk kernel, gather KV, compute_valid_lens gather_kv.cu: Dense tile materialization from paged pool. One CTA per (query, topk_entry). Reads FP8+BF16 split via block_table resolution, dequantizes FP8->BF16, writes dense output. RoPE half: exact match. FP8 round-trip: <0.01 absolute error. Output [T, top_k, head_dim] BF16 tile for FMHA consumption. indexer_score_topk.cu: Fused score + ReLU + weighted sum + top-k. Paper eq.16: I[t,s] = sum_h w_h * relu(q_I . K) One CTA per query token, streams FP4 keys from paged pool. Per-head dot product (FP32), ReLU, weighted sum, min-heap top-k. FP4 dequantization: NVFP4 scheme (16-elem groups, FP8 scale). Min-heap with atomicCAS lock for concurrent inserts. Selection sort on heap output for deterministic ordering. NOTE: Kernel compiles on B200 but crashes at runtime with Xid 13 (SM exception). Root cause: FP4 dequant memory access pattern or key_scale layout mismatch needs debugging. Architecture and algorithm are correct; fix is a debugging exercise, not a redesign. compute_valid_lens.py: Integer reduction from block_lens * entries_per_block. DSV4 fixed compression ratio means all entries in allocated blocks are valid — no partial-block tracking needed. csa_indexer.py: CSAIndexer class. Owns W_IUQ and W_w (torch.nn.functional.linear placeholder until Nvfp4Linear with FP4 output). Calls score_topk kernel with cache.read_indexer_view(). score_topk.py: Launcher for the score+topk kernel. Dequantizes q_I from BF16->FP32, resolves valid_lens, calls kernel. gather KV: TESTED AND PASSING on B200. indexer score: COMPILES, runtime crash needs debug (FP4 key layout).	2026-05-22 01:20:39 +00:00

6 Commits