nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	363dd893f0	test: dimension sweep to isolate GEMM bug	2026-05-15 18:51:09 +00:00
biondizzle	fee5a97ebb	fix: cosine_similarity dim for M>0	2026-05-15 18:50:45 +00:00
biondizzle	f9330a1777	test: standalone M=1 GEMM test with deterministic data	2026-05-15 18:47:26 +00:00
biondizzle	1b63a46168	docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses	2026-05-15 18:35:00 +00:00
biondizzle	773967452f	debug: fix gs scalar conversion + add traceback	2026-05-15 18:27:44 +00:00
biondizzle	df916b87eb	debug: fix gs.item() for multi-element tensor	2026-05-15 18:09:41 +00:00
biondizzle	755f9ad567	debug: fix per_expert_alpha ref + clean up BF16 reference scaling	2026-05-15 17:55:11 +00:00
biondizzle	de8acc7965	debug: dump raw GEMM inputs + first 8 output values	2026-05-15 17:02:40 +00:00
biondizzle	9159cb6bb3	docs: add debug log — current state, hypotheses, fixes	2026-05-15 15:48:57 +00:00
biondizzle	2fd55a94c6	fix: weight reshape bug + igs double-count in BF16 reference	2026-05-15 15:46:16 +00:00
biondizzle	c421a668f3	debug: BF16 reference GEMM + cosine comparison for L1	2026-05-15 14:16:24 +00:00
biondizzle	995589ac8a	debug: add FP4 quantization round-trip diagnostic	2026-05-15 13:41:09 +00:00
biondizzle	d0ed3d84a8	debug: add L2, SiLU, and scatter pipeline prints	2026-05-15 13:21:25 +00:00
biondizzle	da5572f497	clean: remove diagnostic scripts from repo	2026-05-15 12:50:14 +00:00
biondizzle	fd59222fc0	fix: stop folding global scale into float8 block scales The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision. Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision zone where step size is 25%. This makes model output garbage despite finite values. Fix: keep block scales as original float8, return global scales separately as float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate gate/up global scales, use gate_gs as alpha and apply up_correction ratio to the up half post-GEMM. weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf) nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs kernel.py: per_expert_alpha parameter in grouped GEMM deepseek_v4.py: updated type hints and comments	2026-05-15 12:42:53 +00:00
biondizzle	56e62e916d	revert: idx2crd remap approach — source-first needs hierarchical coords cute::crd2idx requires hierarchical coordinates matching the layout's nested shape, which we don't have from flat (m, k_sf). Reverted to idx2crd dest-first approach. The real bug was cute::size vs cute::cosize for allocation, not the remap direction.	2026-05-15 11:44:38 +00:00
biondizzle	d5949a23b4	fix: use cute::crd2idx for SF remap — layout_sf() not directly callable CuTe Layout objects with hierarchical shapes can't be called directly with flat (m, k_sf). Use cute::crd2idx(make_coord(m, k_sf), layout_sf) to convert logical coordinates to physical indices.	2026-05-15 11:39:57 +00:00
biondizzle	9908fd64d9	feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)	2026-05-15 11:38:18 +00:00
biondizzle	c2b752c2fe	Initial: TileLang NVFP4 mega_moe kernel package - nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe - transform_nvfp4_weights_for_mega_moe: weight transformation (tested) - SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs - MEGA_MOE_STATIC=1 support for pipeline testing - pyproject.toml for pip install	2026-05-13 15:44:51 +00:00

... 11 12 13 14 15

719 Commits