5ea5b579c3
Trim banner, no code changes
2026-05-12 07:24:36 +00:00
74af9984f6
Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
...
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
- Changed from shift-by-23 to .to(torch.float32) in both copies
- Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
- Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
into packed buffer (hidden/2 bytes), causing 2x overflow
- Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
- Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11
fix: use tl.split instead of indexing for E2M1 pair packing
...
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f
fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
...
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190
fix: syntax error — move triton imports before docstring, remove orphan @triton.jit
2026-05-11 22:08:50 +00:00
79d866995f
bump cache buster 32 for packed FP4 mxf4nvf4 fix
2026-05-11 21:59:56 +00:00
c85b84b0fe
fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
...
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759
fix: same reshape fix in main patch file
2026-05-11 21:05:54 +00:00
076d325c97
fix: use reshape instead of risky [0::2] slicing for E2M1 packing
2026-05-11 21:04:53 +00:00
8dc917c498
fix: topk_weights_out store missing topk_offsets multiplier
2026-05-11 21:02:19 +00:00
17ba5a9d7b
bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations
2026-05-11 20:30:14 +00:00
7a4403fa98
feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
...
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078
diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format
2026-05-11 19:55:41 +00:00
50a945bde4
bump cache buster 29
2026-05-11 19:51:48 +00:00
48b905406a
diag: add CUDA sync after mega_moe finalize + forward to catch errors
2026-05-11 19:51:44 +00:00
35f6b66678
fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache
2026-05-11 19:40:08 +00:00
f32d6b5b48
bump cache buster to 27
2026-05-11 19:26:21 +00:00
cd24182e36
diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant
2026-05-11 19:26:12 +00:00
8ae2214bad
fix: reorder Dockerfile ARG before COPY for proper cache busting
2026-05-11 18:48:07 +00:00
c4891e9ee2
fix: manual FP32→UE4M3 quant in Triton staging kernel
...
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c
bump cache buster to 24
2026-05-11 16:12:56 +00:00
5faf9916eb
fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
...
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.
Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale
Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188
docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
...
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.
Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d
docs: comprehensive README update through build 22
...
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78
fix: import NVFP4 SymmBuffer from deep_gemm.mega
2026-05-11 08:05:50 +00:00
ff579c9767
fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
...
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da
fix: add patch cache buster to Dockerfile
2026-05-11 07:19:10 +00:00
b532742530
debug: add shape/dtype logging to finalize_weights
2026-05-11 07:13:44 +00:00
b1cf4232ee
feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch
...
- DeepseekV4MegaMoEExperts now uses native NVFP4 path
- finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of
NVFP4→BF16→MXFP4 conversion
- forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- Experts stay in NVFP4. No MXFP4 conversion. Period.
2026-05-11 06:22:11 +00:00
a2e9b5f17f
fix: add --enable-expert-parallel to compose command
2026-05-11 06:15:11 +00:00
c8564caf9d
fix: patch vLLM deepseek_v4.py directly in image
2026-05-11 06:09:40 +00:00
7c8c6cd67f
fix: add PYTHONPATH for deep_gemm import
2026-05-11 06:06:52 +00:00
cffb373759
fix: symlink NVRTC lib into cuda/lib64 for linker
2026-05-11 06:04:24 +00:00
983ba02c5b
fix: add CUDA/NVRTC lib paths to Dockerfile
2026-05-11 06:02:13 +00:00
f0471ed1c2
fix: correct CR URL to atl.vultrcr.com
2026-05-11 05:59:06 +00:00
c234190a80
feat: add Dockerfile + build/push script for NVFP4 container
...
- Extends dream-build with DeepGEMM nvfp4-mega-moe kernel
- build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose
- CACHE_BUSTER parameter for forcing fresh clones
2026-05-11 05:57:49 +00:00
e963325b61
WIP: MegaMoE NVFP4 kernel + diagnostics
...
- Force use_mega_moe=True for NVFP4 pipeline
- DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales,
float32 global/input scales), convert NVFP4→BF16→MXFP4 in
finalize_weights for the DeepGEMM mega_moe kernel
- Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods
- Remove expert_dtype check blocking mega_moe
- Add diagnostics for wo_a and bf16 layer conversion
- Still WIP: attention layer bugs under investigation
2026-05-11 05:19:49 +00:00
7e2f219259
fix: banner uses _os instead of os (not yet imported)
2026-05-11 04:57:24 +00:00
cf54b4755a
fix CRITICAL #7 : UE8M0 block scale misinterpreted as E4M3
...
scale_fmt=ue8m0 means weight_scale bytes are E8M0 format (power-of-2 only).
A simple .to(float32) misinterprets them as E4M3 (which has mantissa bits),
producing completely wrong block scale values and garbled output.
Fix: add _ue8m0_to_float32() that reinterprets raw uint8 bits as IEEE 754
exponent field: (raw_byte << 23).view(float32) = 2^(raw-127).
Applied to:
- _dequant_nvfp4_to_bf16 (BF16 layers: fused_wqa_wkv, wq_b, wo_b)
- _convert_nvfp4_to_fp8 (wo_a FP8 conversion)
- _reconstruct_compressor_weight (compressor fused_wkv_wgate)
- BF16->FP4 quantization path (stores as UE8M0, reads back correctly)
2026-05-11 04:37:33 +00:00
7febeaeb71
README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status
2026-05-11 04:28:38 +00:00
26aaaba4a2
Add version banner to patch — prints commit, arch, bugs fixed at startup
...
Ensures we can always verify what's running inside the container
from the docker logs. No functional changes.
2026-05-11 04:28:10 +00:00
67f9086a26
Fix critical dequantization bug: remove input_scale from weight dequant
...
input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight
dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale
Including input_scale made weights ~5000x too small, causing completely
garbled output (multilingual gibberish with repeating patterns).
2026-05-11 02:23:26 +00:00
02b8ea536f
Update MEMORY.md and memory files with vLLM NVFP4 serving progress
...
Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
2026-05-11 02:02:49 +00:00
653e2d7a50
vLLM NVFP4 serving: full end-to-end pipeline working
...
Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.
Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
Uses deepgemm_post_process_fp8_weight_block for correct scale layout
weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend
Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale
Updated README with full technical documentation of all fixes.
2026-05-11 02:01:46 +00:00
db16be8e5d
S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4
2026-05-10 17:45:53 +00:00
6fd03a0aa0
vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
...
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
resilient loading for unknown params.
- Update docker-compose.yml: copy patched deepseek_v4.py over original at
container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).
- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
(doesn't work with worker processes), kept for reference.
- Update README.md: added vLLM serving run history table (S1-S10), documented
all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
resilient loading), added vLLM-specific bug list and key notes.
- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
d88793dee6
Add vllm weight mapper patch and docker-compose
2026-05-10 09:33:48 +00:00
30608e3834
Config patches: document modelopt↔vllm gaps with NVIDIA reference
2026-05-10 08:59:28 +00:00
0d74b97fb2
Config patches doc + compress_ratios runtime patch in serve script
2026-05-10 08:23:11 +00:00
f65d4ab99f
Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script
2026-05-10 07:54:34 +00:00