2e674f87c1
ba fongol again
2026-05-12 21:47:14 +00:00
5d127d8294
ba fongol again
2026-05-12 21:47:01 +00:00
52cf3f2e25
ba fongol again
2026-05-12 21:46:42 +00:00
02decb486e
ba fongol
2026-05-12 21:34:08 +00:00
48f1f9dc5e
clanker nonsense again
2026-05-12 21:30:36 +00:00
5cabc1f7d9
clanker nonsense again
2026-05-12 21:29:59 +00:00
25a2d4e6ad
clanker nonsense
2026-05-12 21:28:50 +00:00
d88ea9842b
fix: add missing staging_kernel.py to Dockerfile — BF16→E2M1+UE4M3 quantization was never in container
2026-05-12 21:21:24 +00:00
91d7d9bad7
fucken a
2026-05-12 21:18:48 +00:00
d68e113af1
remove spammy shit
2026-05-12 20:57:04 +00:00
f0652693a6
dangit again
2026-05-12 19:13:01 +00:00
054792c84e
dangit
2026-05-12 18:42:39 +00:00
de055b1e77
syupid clankers
2026-05-12 18:26:37 +00:00
307574bc91
test: signal alarm timeout for kernel hang
2026-05-12 15:14:39 +00:00
fcd6de0a60
test: simplify SF fill to avoid shape mismatch
2026-05-12 15:13:16 +00:00
d4c557fddc
test: fix float8 randn + SF int32 packing
2026-05-12 15:12:35 +00:00
28afc2406b
test: add random FP4 data and kernel timeout
2026-05-12 15:11:41 +00:00
787d427847
test: fix NVFP4 mega_moe test dimensions for SMEM alignment
2026-05-12 15:07:35 +00:00
8737fd57c0
remove crap
2026-05-12 14:53:42 +00:00
52c3aefe73
bump cache busters to 33 for debug build
2026-05-12 13:10:37 +00:00
ca1d306890
fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8)
2026-05-12 12:23:43 +00:00
b8f95ffad3
docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache
2026-05-12 11:15:06 +00:00
5840291ea3
fix staging kernel packed_k_mask double-count
2026-05-12 08:08:24 +00:00
5ea5b579c3
Trim banner, no code changes
2026-05-12 07:24:36 +00:00
74af9984f6
Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
...
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
- Changed from shift-by-23 to .to(torch.float32) in both copies
- Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
- Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
into packed buffer (hidden/2 bytes), causing 2x overflow
- Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
- Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11
fix: use tl.split instead of indexing for E2M1 pair packing
...
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f
fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
...
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190
fix: syntax error — move triton imports before docstring, remove orphan @triton.jit
2026-05-11 22:08:50 +00:00
79d866995f
bump cache buster 32 for packed FP4 mxf4nvf4 fix
2026-05-11 21:59:56 +00:00
c85b84b0fe
fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
...
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759
fix: same reshape fix in main patch file
2026-05-11 21:05:54 +00:00
076d325c97
fix: use reshape instead of risky [0::2] slicing for E2M1 packing
2026-05-11 21:04:53 +00:00
8dc917c498
fix: topk_weights_out store missing topk_offsets multiplier
2026-05-11 21:02:19 +00:00
17ba5a9d7b
bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations
2026-05-11 20:30:14 +00:00
7a4403fa98
feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
...
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078
diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format
2026-05-11 19:55:41 +00:00
50a945bde4
bump cache buster 29
2026-05-11 19:51:48 +00:00
48b905406a
diag: add CUDA sync after mega_moe finalize + forward to catch errors
2026-05-11 19:51:44 +00:00
35f6b66678
fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache
2026-05-11 19:40:08 +00:00
f32d6b5b48
bump cache buster to 27
2026-05-11 19:26:21 +00:00
cd24182e36
diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant
2026-05-11 19:26:12 +00:00
8ae2214bad
fix: reorder Dockerfile ARG before COPY for proper cache busting
2026-05-11 18:48:07 +00:00
c4891e9ee2
fix: manual FP32→UE4M3 quant in Triton staging kernel
...
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c
bump cache buster to 24
2026-05-11 16:12:56 +00:00
5faf9916eb
fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
...
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.
Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale
Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188
docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
...
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.
Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d
docs: comprehensive README update through build 22
...
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78
fix: import NVFP4 SymmBuffer from deep_gemm.mega
2026-05-11 08:05:50 +00:00
ff579c9767
fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
...
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da
fix: add patch cache buster to Dockerfile
2026-05-11 07:19:10 +00:00