deepseek-v4-quant

Author	SHA1	Message	Date
biondizzle	054792c84e	dangit	2026-05-12 18:42:39 +00:00
biondizzle	de055b1e77	syupid clankers	2026-05-12 18:26:37 +00:00
biondizzle	307574bc91	test: signal alarm timeout for kernel hang	2026-05-12 15:14:39 +00:00
biondizzle	fcd6de0a60	test: simplify SF fill to avoid shape mismatch	2026-05-12 15:13:16 +00:00
biondizzle	d4c557fddc	test: fix float8 randn + SF int32 packing	2026-05-12 15:12:35 +00:00
biondizzle	28afc2406b	test: add random FP4 data and kernel timeout	2026-05-12 15:11:41 +00:00
biondizzle	787d427847	test: fix NVFP4 mega_moe test dimensions for SMEM alignment	2026-05-12 15:07:35 +00:00
biondizzle	8737fd57c0	remove crap	2026-05-12 14:53:42 +00:00
biondizzle	52c3aefe73	bump cache busters to 33 for debug build	2026-05-12 13:10:37 +00:00
biondizzle	ca1d306890	fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8)	2026-05-12 12:23:43 +00:00
biondizzle	b8f95ffad3	docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache	2026-05-12 11:15:06 +00:00
biondizzle	5840291ea3	fix staging kernel packed_k_mask double-count	2026-05-12 08:08:24 +00:00
biondizzle	5ea5b579c3	Trim banner, no code changes	2026-05-12 07:24:36 +00:00
biondizzle	74af9984f6	Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul - Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0 - Changed from shift-by-23 to .to(torch.float32) in both copies - Fix fold_global_scale in DeepGEMM mega/__init__.py - Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU - Split 8-group pack into two int32 writes (groups 0-3, 4-7) - Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem) into packed buffer (hidden/2 bytes), causing 2x overflow - Now packs even/odd nibble pairs correctly - Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3 - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick - Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11, quant_module_patched.py, patch_finegrained_fp8_blackwell.py, patch_vllm_weights.py - Remove SCALE-FMT diagnostic histogram clutter - Update stale UE8M0 comments throughout - Rewrite README: clean instructions, confirmed format details	2026-05-12 05:52:30 +00:00
biondizzle	a36bf47f11	fix: use tl.split instead of indexing for E2M1 pair packing Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]). Use tl.split() which splits the last axis into two tensors.	2026-05-11 22:39:38 +00:00
biondizzle	27dbf2850f	fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization Triton can't compile deeply nested tl.where. Use arithmetic instead: idx = sum(abs_s >= threshold_i) for 7 threshold values.	2026-05-11 22:23:05 +00:00
biondizzle	3d1f3de190	fix: syntax error — move triton imports before docstring, remove orphan @triton.jit	2026-05-11 22:08:50 +00:00
biondizzle	79d866995f	bump cache buster 32 for packed FP4 mxf4nvf4 fix	2026-05-11 21:59:56 +00:00
biondizzle	c85b84b0fe	fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte) Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element. L1→L2 handoff uses unpacked format (same byte count as FP8). No bandwidth savings at L1→L2 for v1 — can optimize later.	2026-05-11 21:29:33 +00:00
biondizzle	01cfd02759	fix: same reshape fix in main patch file	2026-05-11 21:05:54 +00:00
biondizzle	076d325c97	fix: use reshape instead of risky [0::2] slicing for E2M1 packing	2026-05-11 21:04:53 +00:00
biondizzle	8dc917c498	fix: topk_weights_out store missing topk_offsets multiplier	2026-05-11 21:02:19 +00:00
biondizzle	17ba5a9d7b	bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations	2026-05-11 20:30:14 +00:00
biondizzle	7a4403fa98	feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales mxf4nvf4 requires FP4×FP4, not FP8×FP4. - New staging kernel: E2M1 nearest-neighbor quantization - Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales - Added CUDA sync diagnostics for error localization	2026-05-11 20:29:36 +00:00
biondizzle	0fd2d4f078	diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format	2026-05-11 19:55:41 +00:00
biondizzle	50a945bde4	bump cache buster 29	2026-05-11 19:51:48 +00:00
biondizzle	48b905406a	diag: add CUDA sync after mega_moe finalize + forward to catch errors	2026-05-11 19:51:44 +00:00
biondizzle	35f6b66678	fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache	2026-05-11 19:40:08 +00:00
biondizzle	f32d6b5b48	bump cache buster to 27	2026-05-11 19:26:21 +00:00
biondizzle	cd24182e36	diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant	2026-05-11 19:26:12 +00:00
biondizzle	8ae2214bad	fix: reorder Dockerfile ARG before COPY for proper cache busting	2026-05-11 18:48:07 +00:00
biondizzle	c4891e9ee2	fix: manual FP32→UE4M3 quant in Triton staging kernel Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually: extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant), handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.	2026-05-11 16:38:52 +00:00
biondizzle	436109081c	bump cache buster to 24	2026-05-11 16:12:56 +00:00
biondizzle	5faf9916eb	fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB. For NVFP4 (UE4M3), both activation and weight scales must be UE4M3. Changes to _stage_deepseek_v4_mega_moe_inputs_kernel: - GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16 - Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0 exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32. - FP8 activation quantized against UE4M3 rounded scale Also updated class docstring (was stale MXFP4 conversion description).	2026-05-11 16:12:36 +00:00
biondizzle	220649c188	docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) Build 17-18 'scale_vec not supported' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled instructions. No need to fall back to mxf8f6f4 with UE8M0. Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep native UE4M3 scales + block16. No scale conversion needed.	2026-05-11 14:24:13 +00:00
biondizzle	cfead0012d	docs: comprehensive README update through build 22 - Full architecture diagram and NVFP4→vLLM bridge details - All 8 bugs documented with fixes - SM100 hardware limitation (mxf4nvf4 unsupported) - MegaMoE kernel architecture and debugging log (builds 1-22) - Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+) - Container build pipeline, NVFP4 format spec, hard rules	2026-05-11 13:53:41 +00:00
biondizzle	8cb23bdb78	fix: import NVFP4 SymmBuffer from deep_gemm.mega	2026-05-11 08:05:50 +00:00
biondizzle	ff579c9767	fix: use NVFP4 SymmBuffer (2x SF size for group_size=16) The NVFP4 mega_moe kernel needs a larger symmetric buffer because group_size=16 produces 2x more scale factor entries than MXFP4's 32. Switch from deep_gemm.get_symm_buffer_for_mega_moe to deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.	2026-05-11 07:49:11 +00:00
biondizzle	1da40c53da	fix: add patch cache buster to Dockerfile	2026-05-11 07:19:10 +00:00
biondizzle	b532742530	debug: add shape/dtype logging to finalize_weights	2026-05-11 07:13:44 +00:00
biondizzle	b1cf4232ee	feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch - DeepseekV4MegaMoEExperts now uses native NVFP4 path - finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of NVFP4→BF16→MXFP4 conversion - forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16) - Experts stay in NVFP4. No MXFP4 conversion. Period.	2026-05-11 06:22:11 +00:00
biondizzle	a2e9b5f17f	fix: add --enable-expert-parallel to compose command	2026-05-11 06:15:11 +00:00
biondizzle	c8564caf9d	fix: patch vLLM deepseek_v4.py directly in image	2026-05-11 06:09:40 +00:00
biondizzle	7c8c6cd67f	fix: add PYTHONPATH for deep_gemm import	2026-05-11 06:06:52 +00:00
biondizzle	cffb373759	fix: symlink NVRTC lib into cuda/lib64 for linker	2026-05-11 06:04:24 +00:00
biondizzle	983ba02c5b	fix: add CUDA/NVRTC lib paths to Dockerfile	2026-05-11 06:02:13 +00:00
biondizzle	f0471ed1c2	fix: correct CR URL to atl.vultrcr.com	2026-05-11 05:59:06 +00:00
biondizzle	c234190a80	feat: add Dockerfile + build/push script for NVFP4 container - Extends dream-build with DeepGEMM nvfp4-mega-moe kernel - build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose - CACHE_BUSTER parameter for forcing fresh clones	2026-05-11 05:57:49 +00:00
biondizzle	e963325b61	WIP: MegaMoE NVFP4 kernel + diagnostics - Force use_mega_moe=True for NVFP4 pipeline - DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales, float32 global/input scales), convert NVFP4→BF16→MXFP4 in finalize_weights for the DeepGEMM mega_moe kernel - Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods - Remove expert_dtype check blocking mega_moe - Add diagnostics for wo_a and bf16 layer conversion - Still WIP: attention layer bugs under investigation	2026-05-11 05:19:49 +00:00
biondizzle	7e2f219259	fix: banner uses _os instead of os (not yet imported)	2026-05-11 04:57:24 +00:00

1 2 3

128 Commits