Commit Graph

104 Commits

Author SHA1 Message Date
76e9b078a2 more debug2 2026-05-15 05:08:53 +00:00
912e4622d7 more debug 2026-05-15 04:53:26 +00:00
c7f6a1dc4d fix: transpose B and SFB on the Python side at weight-load time, and adjust the SFB remap kernel to read from column-major source layout 2026-05-15 04:35:45 +00:00
c56cc34ae1 fix: LayoutBTag is now RowMajor 2026-05-15 04:30:27 +00:00
9975558c23 Add always-on alpha/x_sf debug prints for L1 and L2 GEMM calls 2026-05-15 03:59:07 +00:00
9c318c3353 force no cache 2026-05-15 03:52:00 +00:00
ff6bb32684 Plumb global scale as GEMM alpha instead of folding into UE4M3
stage_activation now returns (x_fp4, x_sf, input_global_scale).
The global scale is applied as the CUTLASS GEMM alpha parameter
in the epilogue: D = alpha * A @ B, avoiding the fp32→UE4M3
round-trip that folding would introduce.

Changes:
- stage_activation: returns global scale as 3rd value
- cutlass_nvfp4_gemm C++ binding: alpha param (was hardcoded 1.0)
- cutlass_grouped_nvfp4_gemm: passes alpha to per-expert GEMM
- nvfp4_mega_moe_l1/l2: accept alpha, pass to grouped GEMM
- nvfp4_moe_full: reads symm_buffer.input_global_scale for L1,
  uses stage_activation's returned global scale for L2
- SymmBuffer: added input_global_scale field
- vllm patch: stores global scale from stage_activation
2026-05-15 03:32:19 +00:00
d547da2948 stage_activation: add per-tensor global scale matching NVFP4 spec
Without a global scale, block scales (block_max / 6.0) could exceed
UE4M3 max (448.0) for large activations, causing saturation and garbage
MoE outputs. The degeneration pattern (positions 1-5 OK, then constant
spaces) is consistent with UE4M3 overflow: first few tokens have small
activations that fit, but once SiLU(mul(gate, up)) produces larger
values, block scales overflow and the GEMM produces zeros/garbage.

Fix: compute input_global_scale = amax / (6.0 * 448.0), normalize
before block quantization, then fold global scale back into block
scales (same as weight_transform.py folds weight_scale_2). This
ensures block scales are always ≤ 448.0 in UE4M3 range.
2026-05-15 03:27:47 +00:00
108ff07569 debug: remove one-shot gate from logit dump, log every forward 2026-05-15 03:01:05 +00:00
3600a4b06a debug: add logit quality dump in compute_logits (ungated, once) 2026-05-15 02:37:23 +00:00
29f8b8c174 fix: load lm_head.weight in outer model before forwarding to inner
lm_head lives on DeepseekV4ForCausalLM, not DeepseekV4Model. The inner
load_weights silently drops it (not in params_dict). Extract it in the
outer loader, load it directly, then forward the rest to the inner model.
2026-05-15 02:17:16 +00:00
46536e5ccf fix: hc param renames missing leading dot
.attn_hc.base -> hc_attn_base produced layers.0hc_attn_base (no dot).
Need .hc_attn_base to preserve the dot separator.
2026-05-15 01:54:53 +00:00
086f3fa5c5 fix: hc params dot→underscore + compressor position_bias→ape combined rule
Two fixes:
1. attn_hc.base → hc_attn_base (underscore not dot before base/fn/scale)
   Same for fn, scale, and ffn_hc variants.
2. compressor.position_bias → compressor.ape was never firing because
   the .self_attn.compressor. rule matched first (break). Added combined
   .self_attn.compressor.position_bias → .attn.mla_attn.compressor.ape.
2026-05-15 01:29:00 +00:00
44d4b6c225 fix: add missing renames for Hadamard coding + compressor.ape
Three more dropped checkpoint→model mappings:

1. hc_head: checkpoint has hc_head.hc_base/fn/scale, model has
   hc_head_base/fn/scale (underscore not dot separator)
2. attn_hc/ffn_hc: checkpoint has .attn_hc. and .ffn_hc., model has
   .hc_attn. and .hc_ffn. (word order reversed)
3. compressor.position_bias → compressor.ape: checkpoint name is
   position_bias, model attr is ape (absolute position encoding)

All 461 remaining zero params should now be just indexer.k_norm.bias
(legit zero - no bias in checkpoint, only weight).
2026-05-15 01:16:19 +00:00
af6583eb19 fix: unpack uint8 NVFP4→bf16 for non-stacked params (weights_proj)
indexer.weights_proj is uint8 [64,3584] in checkpoint but bf16 [64,7168]
in model. The uint8→bf16 unpack logic only ran in the stacked_params
loop, so non-stacked NVFP4 params hit a size mismatch assertion.
2026-05-15 00:51:50 +00:00
e6ed9facf3 fix: indexer + shared_experts + compressor checkpoint→model key renames
Three categories of missed renames in CKPT_KEY_SUBST:

1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired
   but break prevented .mlp.→.ffn. from also applying, producing
   mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp.
   in the pattern. Added missing .shared_experts.down_proj. rule.

2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the
   generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing
   wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*.
   Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm,
   position_bias→compressor.ape, gate_proj→compressor.wgate,
   kv_proj→compressor.wkv) before the generic compressor rule.

3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv.
   pattern could never fire because .self_attn.compressor. matched first
   (break). Merged into combined patterns that handle both the
   self_attn.compressor→attn.mla_attn.compressor path AND the projection
   rename in one step.
2026-05-15 00:39:37 +00:00
21018fca8a fix: shared_experts missing ffn. prefix in checkpoint→model rename
Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight
but model params are layers.N.ffn.shared_experts.gate_up_proj.weight.
The .ffn. was missing from the rename, so stacked gate_up_proj
never matched params_dict.
2026-05-15 00:17:59 +00:00
483046b9d6 fix: shared_experts gate_up_proj stacking was skipped by .experts. check
The stacking logic skipped any key containing '.experts.' to avoid
MoE routed expert weights. But 'shared_experts' also matches that
substring, so gate_proj and up_proj were never stacked into
gate_up_proj. Changed to '.ffn.experts.' which only matches the
routed experts path.

Also includes POST-LOAD all-zero param scan.
2026-05-15 00:08:04 +00:00
78dc83dc6e a little more debug1 2026-05-15 00:02:22 +00:00
8dbd616add a little more debug1 2026-05-15 00:02:00 +00:00
756ea2192f clean up and possible big fix 2026-05-14 23:41:10 +00:00
9f01307c5b debug more7 2026-05-14 23:20:19 +00:00
e4f52c8900 debug more5 2026-05-14 23:01:59 +00:00
e46ff41569 debug more4 2026-05-14 22:50:51 +00:00
fd5f04eb15 debug more3 2026-05-14 22:36:34 +00:00
7573f12659 debug more2 2026-05-14 22:26:22 +00:00
11bbf675af debug more 2026-05-14 22:21:30 +00:00
ce4c4b6fcb debug empty output 2026-05-14 22:13:32 +00:00
09d1307d78 damn clankers2 2026-05-14 20:34:51 +00:00
5bbe51357c damn clankers 2026-05-14 20:23:42 +00:00
6aae8f1393 more fixes7 2026-05-14 20:11:37 +00:00
4363eee2ce more fixes6 2026-05-14 20:08:25 +00:00
40b980b9d6 more fixes5 2026-05-14 19:55:34 +00:00
d56e86b40e more fixes4 2026-05-14 19:51:56 +00:00
bf17bd3fc4 more fixes3 2026-05-14 19:47:02 +00:00
c68f4e9d6e more fixes2 2026-05-14 19:43:24 +00:00
4749a92fca more fixes 2026-05-14 19:39:16 +00:00
1ceff541b0 more fixes 2026-05-14 19:35:39 +00:00
3be051e140 fix 2026-05-14 19:29:47 +00:00
57512d5f0d clean up 2026-05-14 19:20:08 +00:00
0d8e1bd035 restructure: move Dockerfile and docker-compose to root, docker/ → vllm/ 2026-05-14 18:47:30 +00:00
878ad4fc5b fix Dockerfile patch paths and add explicit env vars for debug suppression 2026-05-14 18:44:08 +00:00
072a1d4a0b clean up 2026-05-14 18:40:15 +00:00
1150e325bb Consolidate serving into kernel repo
- Dockerfile: COPY kernel source instead of git clone
- docker-compose: build context at repo root, all debug flags OFF
- vLLM patches: deepseek_v4.py, staging_kernel.py, deepseek_v4_attention.py
- serve_vllm.py script
- .dockerignore to keep image clean
2026-05-14 18:20:20 +00:00
2687d1fc53 fix: convert global expert IDs to local before GEMM
vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383).
Our weight tensors are indexed by LOCAL IDs (0..47 per rank).
Each rank r handles experts [r*48, r*48+47]. Without conversion,
topk_ids like 137, 222, 378 would index way out of bounds in the
weight tensor (shape (48, N, K)), producing garbage.

Derive experts_start_idx from the topk_ids and subtract to get
local IDs. This was why all ranks except rank 0 produced zero
expert matches → zero output → garbage text.
2026-05-14 17:43:58 +00:00
128ff84358 fix: 384 experts (not 256), clarify cross-rank reduce is in caller
DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
2026-05-14 17:33:59 +00:00
1c15dadaa5 cleanup: remove dead _pack_ue4m3_to_uint32, fix data format docs
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
2026-05-14 17:28:12 +00:00
008f8cccbd docs: comprehensive README with SF remap probe data, bug history, coordinate table
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
2026-05-14 17:02:53 +00:00
1e0cea055c cleanup: remove all debug printfs from CUDA kernel and weight_transform
Removed printf from remap kernel (flat_rank dump, coordinate probes,
first-coord log). Removed weight_scale_2 debug prints from
weight_transform.py. Production-ready now.
2026-05-14 16:57:32 +00:00
839835cba4 fix: correct SF remap coordinate extraction for flat_rank=8
m = f0 + f1*32 + f2*128  (CuTe 'first sub varies fastest')
k_sf = f4 + f5*4
f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate.
Previous formula (f3*2+f2)*128 was catastrophically wrong — mapped
everything to m=0 or m=huge.
2026-05-14 16:40:48 +00:00