Commit Graph

116 Commits

Author SHA1 Message Date
5ea5b579c3 Trim banner, no code changes 2026-05-12 07:24:36 +00:00
74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
  - Changed from shift-by-23 to .to(torch.float32) in both copies
  - Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
  - Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
  into packed buffer (hidden/2 bytes), causing 2x overflow
  - Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
  - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
  quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
  patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit 2026-05-11 22:08:50 +00:00
79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix 2026-05-11 21:59:56 +00:00
c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759 fix: same reshape fix in main patch file 2026-05-11 21:05:54 +00:00
076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing 2026-05-11 21:04:53 +00:00
8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier 2026-05-11 21:02:19 +00:00
17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations 2026-05-11 20:30:14 +00:00
7a4403fa98 feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078 diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format 2026-05-11 19:55:41 +00:00
50a945bde4 bump cache buster 29 2026-05-11 19:51:48 +00:00
48b905406a diag: add CUDA sync after mega_moe finalize + forward to catch errors 2026-05-11 19:51:44 +00:00
35f6b66678 fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache 2026-05-11 19:40:08 +00:00
f32d6b5b48 bump cache buster to 27 2026-05-11 19:26:21 +00:00
cd24182e36 diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant 2026-05-11 19:26:12 +00:00
8ae2214bad fix: reorder Dockerfile ARG before COPY for proper cache busting 2026-05-11 18:48:07 +00:00
c4891e9ee2 fix: manual FP32→UE4M3 quant in Triton staging kernel
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c bump cache buster to 24 2026-05-11 16:12:56 +00:00
5faf9916eb fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.

Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
  exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale

Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.

Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d docs: comprehensive README update through build 22
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78 fix: import NVFP4 SymmBuffer from deep_gemm.mega 2026-05-11 08:05:50 +00:00
ff579c9767 fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da fix: add patch cache buster to Dockerfile 2026-05-11 07:19:10 +00:00
b532742530 debug: add shape/dtype logging to finalize_weights 2026-05-11 07:13:44 +00:00
b1cf4232ee feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch
- DeepseekV4MegaMoEExperts now uses native NVFP4 path
- finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of
  NVFP4→BF16→MXFP4 conversion
- forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- Experts stay in NVFP4. No MXFP4 conversion. Period.
2026-05-11 06:22:11 +00:00
a2e9b5f17f fix: add --enable-expert-parallel to compose command 2026-05-11 06:15:11 +00:00
c8564caf9d fix: patch vLLM deepseek_v4.py directly in image 2026-05-11 06:09:40 +00:00
7c8c6cd67f fix: add PYTHONPATH for deep_gemm import 2026-05-11 06:06:52 +00:00
cffb373759 fix: symlink NVRTC lib into cuda/lib64 for linker 2026-05-11 06:04:24 +00:00
983ba02c5b fix: add CUDA/NVRTC lib paths to Dockerfile 2026-05-11 06:02:13 +00:00
f0471ed1c2 fix: correct CR URL to atl.vultrcr.com 2026-05-11 05:59:06 +00:00
c234190a80 feat: add Dockerfile + build/push script for NVFP4 container
- Extends dream-build with DeepGEMM nvfp4-mega-moe kernel
- build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose
- CACHE_BUSTER parameter for forcing fresh clones
2026-05-11 05:57:49 +00:00
e963325b61 WIP: MegaMoE NVFP4 kernel + diagnostics
- Force use_mega_moe=True for NVFP4 pipeline
- DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales,
  float32 global/input scales), convert NVFP4→BF16→MXFP4 in
  finalize_weights for the DeepGEMM mega_moe kernel
- Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods
- Remove expert_dtype check blocking mega_moe
- Add diagnostics for wo_a and bf16 layer conversion
- Still WIP: attention layer bugs under investigation
2026-05-11 05:19:49 +00:00
7e2f219259 fix: banner uses _os instead of os (not yet imported) 2026-05-11 04:57:24 +00:00
cf54b4755a fix CRITICAL #7: UE8M0 block scale misinterpreted as E4M3
scale_fmt=ue8m0 means weight_scale bytes are E8M0 format (power-of-2 only).
A simple .to(float32) misinterprets them as E4M3 (which has mantissa bits),
producing completely wrong block scale values and garbled output.

Fix: add _ue8m0_to_float32() that reinterprets raw uint8 bits as IEEE 754
exponent field: (raw_byte << 23).view(float32) = 2^(raw-127).

Applied to:
- _dequant_nvfp4_to_bf16 (BF16 layers: fused_wqa_wkv, wq_b, wo_b)
- _convert_nvfp4_to_fp8 (wo_a FP8 conversion)
- _reconstruct_compressor_weight (compressor fused_wkv_wgate)
- BF16->FP4 quantization path (stores as UE8M0, reads back correctly)
2026-05-11 04:37:33 +00:00
7febeaeb71 README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status 2026-05-11 04:28:38 +00:00
26aaaba4a2 Add version banner to patch — prints commit, arch, bugs fixed at startup
Ensures we can always verify what's running inside the container
from the docker logs. No functional changes.
2026-05-11 04:28:10 +00:00
67f9086a26 Fix critical dequantization bug: remove input_scale from weight dequant
input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight
dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale

Including input_scale made weights ~5000x too small, causing completely
garbled output (multilingual gibberish with repeating patterns).
2026-05-11 02:23:26 +00:00
02b8ea536f Update MEMORY.md and memory files with vLLM NVFP4 serving progress
Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
2026-05-11 02:02:49 +00:00
653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working
Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.
2026-05-11 02:01:46 +00:00
db16be8e5d S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4 2026-05-10 17:45:53 +00:00
6fd03a0aa0 vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
d88793dee6 Add vllm weight mapper patch and docker-compose 2026-05-10 09:33:48 +00:00
30608e3834 Config patches: document modelopt↔vllm gaps with NVIDIA reference 2026-05-10 08:59:28 +00:00
0d74b97fb2 Config patches doc + compress_ratios runtime patch in serve script 2026-05-10 08:23:11 +00:00
f65d4ab99f Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script 2026-05-10 07:54:34 +00:00