Commit Graph

128 Commits

Author SHA1 Message Date
054792c84e dangit 2026-05-12 18:42:39 +00:00
de055b1e77 syupid clankers 2026-05-12 18:26:37 +00:00
307574bc91 test: signal alarm timeout for kernel hang 2026-05-12 15:14:39 +00:00
fcd6de0a60 test: simplify SF fill to avoid shape mismatch 2026-05-12 15:13:16 +00:00
d4c557fddc test: fix float8 randn + SF int32 packing 2026-05-12 15:12:35 +00:00
28afc2406b test: add random FP4 data and kernel timeout 2026-05-12 15:11:41 +00:00
787d427847 test: fix NVFP4 mega_moe test dimensions for SMEM alignment 2026-05-12 15:07:35 +00:00
8737fd57c0 remove crap 2026-05-12 14:53:42 +00:00
52c3aefe73 bump cache busters to 33 for debug build 2026-05-12 13:10:37 +00:00
ca1d306890 fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8) 2026-05-12 12:23:43 +00:00
b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache 2026-05-12 11:15:06 +00:00
5840291ea3 fix staging kernel packed_k_mask double-count 2026-05-12 08:08:24 +00:00
5ea5b579c3 Trim banner, no code changes 2026-05-12 07:24:36 +00:00
74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
  - Changed from shift-by-23 to .to(torch.float32) in both copies
  - Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
  - Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
  into packed buffer (hidden/2 bytes), causing 2x overflow
  - Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
  - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
  quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
  patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit 2026-05-11 22:08:50 +00:00
79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix 2026-05-11 21:59:56 +00:00
c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759 fix: same reshape fix in main patch file 2026-05-11 21:05:54 +00:00
076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing 2026-05-11 21:04:53 +00:00
8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier 2026-05-11 21:02:19 +00:00
17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations 2026-05-11 20:30:14 +00:00
7a4403fa98 feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078 diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format 2026-05-11 19:55:41 +00:00
50a945bde4 bump cache buster 29 2026-05-11 19:51:48 +00:00
48b905406a diag: add CUDA sync after mega_moe finalize + forward to catch errors 2026-05-11 19:51:44 +00:00
35f6b66678 fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache 2026-05-11 19:40:08 +00:00
f32d6b5b48 bump cache buster to 27 2026-05-11 19:26:21 +00:00
cd24182e36 diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant 2026-05-11 19:26:12 +00:00
8ae2214bad fix: reorder Dockerfile ARG before COPY for proper cache busting 2026-05-11 18:48:07 +00:00
c4891e9ee2 fix: manual FP32→UE4M3 quant in Triton staging kernel
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c bump cache buster to 24 2026-05-11 16:12:56 +00:00
5faf9916eb fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.

Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
  exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale

Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.

Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d docs: comprehensive README update through build 22
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78 fix: import NVFP4 SymmBuffer from deep_gemm.mega 2026-05-11 08:05:50 +00:00
ff579c9767 fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da fix: add patch cache buster to Dockerfile 2026-05-11 07:19:10 +00:00
b532742530 debug: add shape/dtype logging to finalize_weights 2026-05-11 07:13:44 +00:00
b1cf4232ee feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch
- DeepseekV4MegaMoEExperts now uses native NVFP4 path
- finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of
  NVFP4→BF16→MXFP4 conversion
- forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- Experts stay in NVFP4. No MXFP4 conversion. Period.
2026-05-11 06:22:11 +00:00
a2e9b5f17f fix: add --enable-expert-parallel to compose command 2026-05-11 06:15:11 +00:00
c8564caf9d fix: patch vLLM deepseek_v4.py directly in image 2026-05-11 06:09:40 +00:00
7c8c6cd67f fix: add PYTHONPATH for deep_gemm import 2026-05-11 06:06:52 +00:00
cffb373759 fix: symlink NVRTC lib into cuda/lib64 for linker 2026-05-11 06:04:24 +00:00
983ba02c5b fix: add CUDA/NVRTC lib paths to Dockerfile 2026-05-11 06:02:13 +00:00
f0471ed1c2 fix: correct CR URL to atl.vultrcr.com 2026-05-11 05:59:06 +00:00
c234190a80 feat: add Dockerfile + build/push script for NVFP4 container
- Extends dream-build with DeepGEMM nvfp4-mega-moe kernel
- build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose
- CACHE_BUSTER parameter for forcing fresh clones
2026-05-11 05:57:49 +00:00
e963325b61 WIP: MegaMoE NVFP4 kernel + diagnostics
- Force use_mega_moe=True for NVFP4 pipeline
- DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales,
  float32 global/input scales), convert NVFP4→BF16→MXFP4 in
  finalize_weights for the DeepGEMM mega_moe kernel
- Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods
- Remove expert_dtype check blocking mega_moe
- Add diagnostics for wo_a and bf16 layer conversion
- Still WIP: attention layer bugs under investigation
2026-05-11 05:19:49 +00:00
7e2f219259 fix: banner uses _os instead of os (not yet imported) 2026-05-11 04:57:24 +00:00