deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	5ea5b579c3	Trim banner, no code changes	2026-05-12 07:24:36 +00:00
biondizzle	74af9984f6	Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul - Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0 - Changed from shift-by-23 to .to(torch.float32) in both copies - Fix fold_global_scale in DeepGEMM mega/__init__.py - Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU - Split 8-group pack into two int32 writes (groups 0-3, 4-7) - Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem) into packed buffer (hidden/2 bytes), causing 2x overflow - Now packs even/odd nibble pairs correctly - Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3 - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick - Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11, quant_module_patched.py, patch_finegrained_fp8_blackwell.py, patch_vllm_weights.py - Remove SCALE-FMT diagnostic histogram clutter - Update stale UE8M0 comments throughout - Rewrite README: clean instructions, confirmed format details	2026-05-12 05:52:30 +00:00
biondizzle	a36bf47f11	fix: use tl.split instead of indexing for E2M1 pair packing Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]). Use tl.split() which splits the last axis into two tensors.	2026-05-11 22:39:38 +00:00
biondizzle	27dbf2850f	fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization Triton can't compile deeply nested tl.where. Use arithmetic instead: idx = sum(abs_s >= threshold_i) for 7 threshold values.	2026-05-11 22:23:05 +00:00
biondizzle	3d1f3de190	fix: syntax error — move triton imports before docstring, remove orphan @triton.jit	2026-05-11 22:08:50 +00:00
biondizzle	79d866995f	bump cache buster 32 for packed FP4 mxf4nvf4 fix	2026-05-11 21:59:56 +00:00
biondizzle	c85b84b0fe	fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte) Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element. L1→L2 handoff uses unpacked format (same byte count as FP8). No bandwidth savings at L1→L2 for v1 — can optimize later.	2026-05-11 21:29:33 +00:00
biondizzle	01cfd02759	fix: same reshape fix in main patch file	2026-05-11 21:05:54 +00:00
biondizzle	076d325c97	fix: use reshape instead of risky [0::2] slicing for E2M1 packing	2026-05-11 21:04:53 +00:00
biondizzle	8dc917c498	fix: topk_weights_out store missing topk_offsets multiplier	2026-05-11 21:02:19 +00:00
biondizzle	17ba5a9d7b	bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations	2026-05-11 20:30:14 +00:00
biondizzle	7a4403fa98	feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales mxf4nvf4 requires FP4×FP4, not FP8×FP4. - New staging kernel: E2M1 nearest-neighbor quantization - Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales - Added CUDA sync diagnostics for error localization	2026-05-11 20:29:36 +00:00
biondizzle	0fd2d4f078	diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format	2026-05-11 19:55:41 +00:00
biondizzle	50a945bde4	bump cache buster 29	2026-05-11 19:51:48 +00:00
biondizzle	48b905406a	diag: add CUDA sync after mega_moe finalize + forward to catch errors	2026-05-11 19:51:44 +00:00
biondizzle	35f6b66678	fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache	2026-05-11 19:40:08 +00:00
biondizzle	f32d6b5b48	bump cache buster to 27	2026-05-11 19:26:21 +00:00
biondizzle	cd24182e36	diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant	2026-05-11 19:26:12 +00:00
biondizzle	8ae2214bad	fix: reorder Dockerfile ARG before COPY for proper cache busting	2026-05-11 18:48:07 +00:00
biondizzle	c4891e9ee2	fix: manual FP32→UE4M3 quant in Triton staging kernel Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually: extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant), handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.	2026-05-11 16:38:52 +00:00
biondizzle	436109081c	bump cache buster to 24	2026-05-11 16:12:56 +00:00
biondizzle	5faf9916eb	fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB. For NVFP4 (UE4M3), both activation and weight scales must be UE4M3. Changes to _stage_deepseek_v4_mega_moe_inputs_kernel: - GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16 - Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0 exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32. - FP8 activation quantized against UE4M3 rounded scale Also updated class docstring (was stale MXFP4 conversion description).	2026-05-11 16:12:36 +00:00
biondizzle	220649c188	docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) Build 17-18 'scale_vec not supported' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled instructions. No need to fall back to mxf8f6f4 with UE8M0. Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep native UE4M3 scales + block16. No scale conversion needed.	2026-05-11 14:24:13 +00:00
biondizzle	cfead0012d	docs: comprehensive README update through build 22 - Full architecture diagram and NVFP4→vLLM bridge details - All 8 bugs documented with fixes - SM100 hardware limitation (mxf4nvf4 unsupported) - MegaMoE kernel architecture and debugging log (builds 1-22) - Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+) - Container build pipeline, NVFP4 format spec, hard rules	2026-05-11 13:53:41 +00:00
biondizzle	8cb23bdb78	fix: import NVFP4 SymmBuffer from deep_gemm.mega	2026-05-11 08:05:50 +00:00
biondizzle	ff579c9767	fix: use NVFP4 SymmBuffer (2x SF size for group_size=16) The NVFP4 mega_moe kernel needs a larger symmetric buffer because group_size=16 produces 2x more scale factor entries than MXFP4's 32. Switch from deep_gemm.get_symm_buffer_for_mega_moe to deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.	2026-05-11 07:49:11 +00:00
biondizzle	1da40c53da	fix: add patch cache buster to Dockerfile	2026-05-11 07:19:10 +00:00
biondizzle	b532742530	debug: add shape/dtype logging to finalize_weights	2026-05-11 07:13:44 +00:00
biondizzle	b1cf4232ee	feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch - DeepseekV4MegaMoEExperts now uses native NVFP4 path - finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of NVFP4→BF16→MXFP4 conversion - forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16) - Experts stay in NVFP4. No MXFP4 conversion. Period.	2026-05-11 06:22:11 +00:00
biondizzle	a2e9b5f17f	fix: add --enable-expert-parallel to compose command	2026-05-11 06:15:11 +00:00
biondizzle	c8564caf9d	fix: patch vLLM deepseek_v4.py directly in image	2026-05-11 06:09:40 +00:00
biondizzle	7c8c6cd67f	fix: add PYTHONPATH for deep_gemm import	2026-05-11 06:06:52 +00:00
biondizzle	cffb373759	fix: symlink NVRTC lib into cuda/lib64 for linker	2026-05-11 06:04:24 +00:00
biondizzle	983ba02c5b	fix: add CUDA/NVRTC lib paths to Dockerfile	2026-05-11 06:02:13 +00:00
biondizzle	f0471ed1c2	fix: correct CR URL to atl.vultrcr.com	2026-05-11 05:59:06 +00:00
biondizzle	c234190a80	feat: add Dockerfile + build/push script for NVFP4 container - Extends dream-build with DeepGEMM nvfp4-mega-moe kernel - build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose - CACHE_BUSTER parameter for forcing fresh clones	2026-05-11 05:57:49 +00:00
biondizzle	e963325b61	WIP: MegaMoE NVFP4 kernel + diagnostics - Force use_mega_moe=True for NVFP4 pipeline - DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales, float32 global/input scales), convert NVFP4→BF16→MXFP4 in finalize_weights for the DeepGEMM mega_moe kernel - Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods - Remove expert_dtype check blocking mega_moe - Add diagnostics for wo_a and bf16 layer conversion - Still WIP: attention layer bugs under investigation	2026-05-11 05:19:49 +00:00
biondizzle	7e2f219259	fix: banner uses _os instead of os (not yet imported)	2026-05-11 04:57:24 +00:00
biondizzle	cf54b4755a	fix CRITICAL #7 : UE8M0 block scale misinterpreted as E4M3 scale_fmt=ue8m0 means weight_scale bytes are E8M0 format (power-of-2 only). A simple .to(float32) misinterprets them as E4M3 (which has mantissa bits), producing completely wrong block scale values and garbled output. Fix: add _ue8m0_to_float32() that reinterprets raw uint8 bits as IEEE 754 exponent field: (raw_byte << 23).view(float32) = 2^(raw-127). Applied to: - _dequant_nvfp4_to_bf16 (BF16 layers: fused_wqa_wkv, wq_b, wo_b) - _convert_nvfp4_to_fp8 (wo_a FP8 conversion) - _reconstruct_compressor_weight (compressor fused_wkv_wgate) - BF16->FP4 quantization path (stores as UE8M0, reads back correctly)	2026-05-11 04:37:33 +00:00
biondizzle	7febeaeb71	README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status	2026-05-11 04:28:38 +00:00
biondizzle	26aaaba4a2	Add version banner to patch — prints commit, arch, bugs fixed at startup Ensures we can always verify what's running inside the container from the docker logs. No functional changes.	2026-05-11 04:28:10 +00:00
biondizzle	67f9086a26	Fix critical dequantization bug: remove input_scale from weight dequant input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale Including input_scale made weights ~5000x too small, causing completely garbled output (multilingual gibberish with repeating patterns).	2026-05-11 02:23:26 +00:00
biondizzle	02b8ea536f	Update MEMORY.md and memory files with vLLM NVFP4 serving progress Server running on B200 port 8000 with full NVFP4→vLLM bridge. All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.	2026-05-11 02:02:49 +00:00
biondizzle	653e2d7a50	vLLM NVFP4 serving: full end-to-end pipeline working Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention. Server loads and serves tokens on 8x B200 with TP=8, EP=8. Key changes: - wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum Uses deepgemm_post_process_fp8_weight_block for correct scale layout weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar) Block scale filled with fp8_scale (NOT all-ones -- causes garbage output) - Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod - Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor) - MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend Bugs fixed: 1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor 2. Block scale dtype: float32 (not float8_e4m3fn) 3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0 4. Compressor indexer shape mismatch: wrong checkpoint key prefix 5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale Updated README with full technical documentation of all fixes.	2026-05-11 02:01:46 +00:00
biondizzle	db16be8e5d	S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4	2026-05-10 17:45:53 +00:00
biondizzle	6fd03a0aa0	vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs - Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4 weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn, compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params, skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and resilient loading for unknown params. - Update docker-compose.yml: copy patched deepseek_v4.py over original at container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel). - Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach (doesn't work with worker processes), kept for reference. - Update README.md: added vLLM serving run history table (S1-S10), documented all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel, resilient loading), added vLLM-specific bug list and key notes. - Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.	2026-05-10 16:14:17 +00:00
biondizzle	d88793dee6	Add vllm weight mapper patch and docker-compose	2026-05-10 09:33:48 +00:00
biondizzle	30608e3834	Config patches: document modelopt↔vllm gaps with NVIDIA reference	2026-05-10 08:59:28 +00:00
biondizzle	0d74b97fb2	Config patches doc + compress_ratios runtime patch in serve script	2026-05-10 08:23:11 +00:00
biondizzle	f65d4ab99f	Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script	2026-05-10 07:54:34 +00:00

1 2 3

116 Commits