Commit Graph

5 Commits

Author SHA1 Message Date
a983a8fb41 WIP: TMEM vector for per-row row_sum (not yet working)
Key finding: the root cause is that each epilogue thread owns MULTIPLE rows
in the QK C-fragment, so scalar row_max/row_sum are wrong (global across
all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads
use the same row_sum (from row 114).

Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))).
This is a no-op because both write and read use the SAME QK partition
with a scalar row_sum. The vector approach only helps when different
partitions are used for write vs read, or when per-row values are stored.

Next steps:
1. Need PER-ROW row_max and row_sum, not per-thread scalar
2. The CUTLASS FMHA works because each thread owns exactly 1 row
3. Options: restructure thread layout, or compute per-row values differently
4. The vector must store ALL 128 per-row values, then read per-row in C9
2026-05-21 18:45:30 +00:00
331d9e95f3 WIP: Stage C softmax - partial progress
Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code.
Hardcoding s_k=128 (matching Stage B) fixes the base pipeline.

Current status: kernel produces non-zero output but softmax math is still wrong.
Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True
Need to debug row_sum computation and C9 normalization.
2026-05-21 18:04:21 +00:00
84cd636ba9 Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True
- Add pv_done_bar (barrier_id=4): MMA signals PV complete, epilogue
  waits before O rescale (C6) and final normalization (C9)
- Fix acc_scale: exp2(scale * (old_max - new_max)) includes the
  scale_softmax_log2 factor matching CUTLASS FMHA reference
- fastmath=True for both exp2 calls (P computation + rescale)
- No *0.5 (our scalar row_sum pattern initializes (0,0) not (sum,sum))
2026-05-21 17:58:04 +00:00
52b46a2dee Stage C: add validation harness with real softmax reference (C1) 2026-05-21 17:49:26 +00:00
3fb3c925af Restructure: cutedsl/ -> dsv4/ with proper layering
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00