Commit Graph

719 Commits

Author SHA1 Message Date
363dd893f0 test: dimension sweep to isolate GEMM bug 2026-05-15 18:51:09 +00:00
fee5a97ebb fix: cosine_similarity dim for M>0 2026-05-15 18:50:45 +00:00
f9330a1777 test: standalone M=1 GEMM test with deterministic data 2026-05-15 18:47:26 +00:00
1b63a46168 docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses 2026-05-15 18:35:00 +00:00
773967452f debug: fix gs scalar conversion + add traceback 2026-05-15 18:27:44 +00:00
df916b87eb debug: fix gs.item() for multi-element tensor 2026-05-15 18:09:41 +00:00
755f9ad567 debug: fix per_expert_alpha ref + clean up BF16 reference scaling 2026-05-15 17:55:11 +00:00
de8acc7965 debug: dump raw GEMM inputs + first 8 output values 2026-05-15 17:02:40 +00:00
9159cb6bb3 docs: add debug log — current state, hypotheses, fixes 2026-05-15 15:48:57 +00:00
2fd55a94c6 fix: weight reshape bug + igs double-count in BF16 reference 2026-05-15 15:46:16 +00:00
c421a668f3 debug: BF16 reference GEMM + cosine comparison for L1 2026-05-15 14:16:24 +00:00
995589ac8a debug: add FP4 quantization round-trip diagnostic 2026-05-15 13:41:09 +00:00
d0ed3d84a8 debug: add L2, SiLU, and scatter pipeline prints 2026-05-15 13:21:25 +00:00
da5572f497 clean: remove diagnostic scripts from repo 2026-05-15 12:50:14 +00:00
fd59222fc0 fix: stop folding global scale into float8 block scales
The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision.
Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision
zone where step size is 25%. This makes model output garbage despite finite values.

Fix: keep block scales as original float8, return global scales separately as
float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in
cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate
gate/up global scales, use gate_gs as alpha and apply up_correction ratio to
the up half post-GEMM.

weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf)
nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs
kernel.py: per_expert_alpha parameter in grouped GEMM
deepseek_v4.py: updated type hints and comments
2026-05-15 12:42:53 +00:00
56e62e916d revert: idx2crd remap approach — source-first needs hierarchical coords
cute::crd2idx requires hierarchical coordinates matching the layout's
nested shape, which we don't have from flat (m, k_sf). Reverted to
idx2crd dest-first approach. The real bug was cute::size vs
cute::cosize for allocation, not the remap direction.
2026-05-15 11:44:38 +00:00
d5949a23b4 fix: use cute::crd2idx for SF remap — layout_sf() not directly callable
CuTe Layout objects with hierarchical shapes can't be called directly
with flat (m, k_sf). Use cute::crd2idx(make_coord(m, k_sf), layout_sf)
to convert logical coordinates to physical indices.
2026-05-15 11:39:57 +00:00
9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
c2b752c2fe Initial: TileLang NVFP4 mega_moe kernel package
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00