56769cdbf5
README: add fire_b200_test docs, update multi-tile blocker with real findings
2026-05-22 17:41:23 +00:00
81eee05018
README: add test harness instructions
2026-05-22 17:09:53 +00:00
793e3243d5
README + MEMORY: update Stage C status to single-tile only, document multi-tile blocker
...
- Stage C works for n=128 (0.993) but multi-tile (n>128) is broken
- Root cause: tBgK slice hardcodes GMEM iteration to tile 0
- CuTeDSL TMA copy doesn't accept Python int as tile index
- Mike's combined K+V barrier fix compiles but deadlocks at runtime
- Fallback: kh.count // 2 (untested)
2026-05-22 16:32:31 +00:00
35d532c742
README: update Stage C status to WORKING, add CuTeDSL constraints and target architecture
2026-05-22 09:39:15 +00:00
96f900f5f0
README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)
2026-05-21 17:40:25 +00:00
2ec32eb8da
README: update for new dsv4/ package structure
2026-05-21 17:34:40 +00:00
20564425ec
README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel)
...
Document canonical test files, obsolete test sprawl, and the path from
test_fmha_v3.py → cutedsl/kernel/attention/fmha_kernel.py → vLLM integration.
Also: TMEM layout for Stage C, key lessons from A&B.
2026-05-21 15:43:01 +00:00
ad24792fc7
Update both READMEs: Stage B complete, document TMEM overlap root cause
...
- Workspace README: full rewrite with Stage B ✅ , Bug 4b root cause (P/O overlap),
FMHA V reconstruction, TMEM layout diagram, softmax store pattern, updated footguns
- Kernel README: focused on the bug, fix, and current test status
- Key lesson documented: NEVER use find_tmem_tensor_col_offset() as O placement
2026-05-21 15:36:06 +00:00
750f1f09c9
README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.
2026-05-21 09:59:37 +00:00
50e9b5da81
README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.
2026-05-21 09:47:08 +00:00
5e37ea56e4
FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.
2026-05-21 07:13:14 +00:00
b9b1b808a5
README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns
2026-05-21 06:46:02 +00:00
a7fd2761df
README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write
...
- (128,64) PV MMA A-fragment has N_MMA=64, reads P with wrong stride
- Softmax writes P with QK C-fragment layout (N_MMA=128)
- O[m,d] ≈ P[m,2d] — every other column effect confirmed
- All-ones and single-element V pass (uniform/sparse data hides mismatch)
- epi_tile must use PV cta_tile (partial fix: 0.01 → 0.876)
- Added footguns #9 (TMEM alias N_MMA match) and #10 (epi_tile)
- Added diagnostic test results to test table
2026-05-21 05:17:12 +00:00
0dc6fe4a7d
Stage B progress: PV works for square (128,128), broken for (128,64)
...
- Bug 1 (V MN-major): Fix applied
- Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0)
- Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate)
- Bug 4 (CURRENT): PV MMA broken for non-square output
- (128,128) PV with random V: cosine 0.999999 ✅
- (128,64) PV with MN-major V: cosine ~0.01 ❌
- Softmax packing, layout aliasing, pipeline ordering all verified correct
- Root cause unknown — likely epilogue/V layout/MMA tiler issue
Added test_pv_diag.py (V=I and random V, 128x128 output — PASS)
Added test_layout_compare.py (TMEM layout inspection)
Added test_inspect_types.py (TMEM pointer arithmetic verification)
Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix
Updated READMEs with current state
2026-05-21 04:40:28 +00:00
7a8945eb76
Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
...
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)
Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided
Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)
Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read
Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
467ade37b2
Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
...
Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.
Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)
2026-05-21 00:12:47 +00:00
97656a5cd1
Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
...
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr
Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
9f0528f150
Update README: reflect current state, add C128A/C4A topk + warmup fixes
2026-05-20 06:51:12 +00:00
bbba289bd8
feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
...
- native_swa_decode.py: BlackwellSWADecodeKernel
- CTA mapping: 1 CTA per (decode_token, q_head_group)
- Online softmax with KV tile streaming (16 tokens/tile)
- Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext
requires 32-bit aligned vector, no scalar fp8->bf16 support)
- Cosine 0.9999+ vs PyTorch batched SDPA reference
- Fallback _fallback_batched_sdp when CuTeDSL unavailable
- native_sparse_decode.py: BlackwellSparseDecodeKernel
- Combined SWA + compressed KV in single attention pass
- Supports CSA (cr=4) and HCA (cr=128) layers
- Sink weight merge on host side
- Cosine 0.9999+ vs combined SDPA reference
- fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires
vector<4xf8>, no scalar support). Pre-dequant is the workaround.
- vLLM wiring (attention.py):
- SWA-only layers: native_swa_decode_attention
- CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink
- csa_attention.py updated to use native kernels
- Tests: test_decode_pipeline.py, test_sparse_decode.py both passing
2026-05-20 05:46:15 +00:00
06bf4f482d
README: comprehensive update with current kernel status
2026-05-20 04:42:57 +00:00
a30d9eb523
Update README with final kernel status
2026-05-20 04:39:57 +00:00
aa8563c626
Fused SwiGLU epilogue with granularity-8 weight interleave
...
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
- Even subtiles = gate: SiLU in FP32 registers, save to register buffer
- Odd subtiles = up: silu(gate)*up from buffer
- Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
57d4cb714f
docs: rewrite README.md with current project state
...
- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory
2026-05-20 03:30:35 +00:00
5fb70b4cd2
Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean
2026-05-19 20:22:10 +00:00
31b9cfbdbd
Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.
2026-05-19 15:19:55 +00:00
914d27fee7
Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
...
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
2026-05-19 08:26:16 +00:00
b3451c74f8
Update README and CURRENT_BUG.md with current state
...
- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels
2026-05-18 20:05:03 +00:00
af087e655e
docs: update README — vLLM cudagraph inference running, output quality in progress
2026-05-16 21:40:59 +00:00
f7e29fdf1e
docs: update README with cudagraph compatibility work and decisions
2026-05-16 18:55:47 +00:00
e5370140cb
docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status
...
- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)
2026-05-16 05:43:33 +00:00
b04bff7e8b
feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
...
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import
docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)
deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl
README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
3ec9c3074b
docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub
...
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.
Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)
Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
- Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
- Handles slot-based routing, L1→SiLU→L2→scatter
- prepare_weights_from_dequantized() for weight prep
Tagged the-last-of-cutlass on the old C++ kernel state.
2026-05-16 03:33:16 +00:00
9908fd64d9
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
...
Major changes from initial TileLang prototype:
Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided
SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)
Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS
No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
M-dependent layout, cross-layer collisions)
Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM
Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
c2b752c2fe
Initial: TileLang NVFP4 mega_moe kernel package
...
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00