Commit Graph

86 Commits

Author SHA1 Message Date
78dc83dc6e a little more debug1 2026-05-15 00:02:22 +00:00
8dbd616add a little more debug1 2026-05-15 00:02:00 +00:00
756ea2192f clean up and possible big fix 2026-05-14 23:41:10 +00:00
9f01307c5b debug more7 2026-05-14 23:20:19 +00:00
e4f52c8900 debug more5 2026-05-14 23:01:59 +00:00
e46ff41569 debug more4 2026-05-14 22:50:51 +00:00
fd5f04eb15 debug more3 2026-05-14 22:36:34 +00:00
7573f12659 debug more2 2026-05-14 22:26:22 +00:00
11bbf675af debug more 2026-05-14 22:21:30 +00:00
ce4c4b6fcb debug empty output 2026-05-14 22:13:32 +00:00
09d1307d78 damn clankers2 2026-05-14 20:34:51 +00:00
5bbe51357c damn clankers 2026-05-14 20:23:42 +00:00
6aae8f1393 more fixes7 2026-05-14 20:11:37 +00:00
4363eee2ce more fixes6 2026-05-14 20:08:25 +00:00
40b980b9d6 more fixes5 2026-05-14 19:55:34 +00:00
d56e86b40e more fixes4 2026-05-14 19:51:56 +00:00
bf17bd3fc4 more fixes3 2026-05-14 19:47:02 +00:00
c68f4e9d6e more fixes2 2026-05-14 19:43:24 +00:00
4749a92fca more fixes 2026-05-14 19:39:16 +00:00
1ceff541b0 more fixes 2026-05-14 19:35:39 +00:00
3be051e140 fix 2026-05-14 19:29:47 +00:00
57512d5f0d clean up 2026-05-14 19:20:08 +00:00
0d8e1bd035 restructure: move Dockerfile and docker-compose to root, docker/ → vllm/ 2026-05-14 18:47:30 +00:00
878ad4fc5b fix Dockerfile patch paths and add explicit env vars for debug suppression 2026-05-14 18:44:08 +00:00
072a1d4a0b clean up 2026-05-14 18:40:15 +00:00
1150e325bb Consolidate serving into kernel repo
- Dockerfile: COPY kernel source instead of git clone
- docker-compose: build context at repo root, all debug flags OFF
- vLLM patches: deepseek_v4.py, staging_kernel.py, deepseek_v4_attention.py
- serve_vllm.py script
- .dockerignore to keep image clean
2026-05-14 18:20:20 +00:00
2687d1fc53 fix: convert global expert IDs to local before GEMM
vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383).
Our weight tensors are indexed by LOCAL IDs (0..47 per rank).
Each rank r handles experts [r*48, r*48+47]. Without conversion,
topk_ids like 137, 222, 378 would index way out of bounds in the
weight tensor (shape (48, N, K)), producing garbage.

Derive experts_start_idx from the topk_ids and subtract to get
local IDs. This was why all ranks except rank 0 produced zero
expert matches → zero output → garbage text.
2026-05-14 17:43:58 +00:00
128ff84358 fix: 384 experts (not 256), clarify cross-rank reduce is in caller
DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
2026-05-14 17:33:59 +00:00
1c15dadaa5 cleanup: remove dead _pack_ue4m3_to_uint32, fix data format docs
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
2026-05-14 17:28:12 +00:00
008f8cccbd docs: comprehensive README with SF remap probe data, bug history, coordinate table
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
2026-05-14 17:02:53 +00:00
1e0cea055c cleanup: remove all debug printfs from CUDA kernel and weight_transform
Removed printf from remap kernel (flat_rank dump, coordinate probes,
first-coord log). Removed weight_scale_2 debug prints from
weight_transform.py. Production-ready now.
2026-05-14 16:57:32 +00:00
839835cba4 fix: correct SF remap coordinate extraction for flat_rank=8
m = f0 + f1*32 + f2*128  (CuTe 'first sub varies fastest')
k_sf = f4 + f5*4
f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate.
Previous formula (f3*2+f2)*128 was catastrophically wrong — mapped
everything to m=0 or m=huge.
2026-05-14 16:40:48 +00:00
1ef2fbc2fd debug: more indices for SF layout dump 2026-05-14 16:26:15 +00:00
c4b5b52a33 debug: single-thread SF layout dump at specific indices 2026-05-14 16:13:05 +00:00
17e6033ade debug: print specific indices for SF layout coordinate decomposition 2026-05-14 15:57:55 +00:00
8ee3f90e44 debug: handle flat_rank=8 for SF remap, add coordinate dump
Previous approach assumed rank 2-6, but actual rank is 8.
For R==8: 4 M sub-indices (inner_32, inner_4, tile_interleave, tile_m)
          4 K sub-indices (inner_16, inner_4_k, tile_k_interleave, tile_k)
m = (f3*2 + f2)*128 + f0*4 + f1
k_sf = f5 + f6*4  (tentative, needs printf verification)
Added printf of all 8 flat values for first 3 indices.
2026-05-14 15:45:52 +00:00
d2c1c76f5b debug: idx2crd+flatten approach with printf to determine flat_rank
Going back to the idx2crd approach which compiles and runs.
Added printf for flat_rank, MN, K_sf, and first coordinate extraction.
Handles ranks 2-6 with logical (m, k_sf) extraction.
This will tell us the actual flat_rank and whether our extraction is correct.
2026-05-14 15:34:46 +00:00
2ac3a7d631 fix: construct nested coordinate for CuTe layout shape ((32,4), K)
layout_sf(m, k_elem) with flat ints fails: Mismatched Ranks because
the layout shape is ((32,4), K_padded), not (M, K).
Decompose m into (inner_m, sub_m) = (m/4, m%4) to match the (32,4)
sub-shape, and pass as make_tuple(make_tuple(inner, sub), k_elem).
2026-05-14 15:32:12 +00:00
593ae998f8 fix: clean rewrite of cutlass_nvfp4_gemm.cu — no more file splicing
Removed dead code from old idx2crd approach. File is now clean:
- Source-iterating SF remap kernel with layout_sf(m, k_elem)
- Zero-init dest buffers before remap
- Proper extern C wrapping
2026-05-14 15:31:03 +00:00
196ee37fdb fix: rewrite SF remap kernel — source-iterating with layout_sf(m, k_elem)
Ripped out idx2crd + flatten + get<> approach entirely. New kernel
iterates over source indices (m, k_group) and uses layout_sf(m, k_elem)
to compute the CUTLASS destination offset. CuTe handles nested shape
decomposition internally — no rank inspection needed.

K coordinate is in element-space (k_group * SFVecSize) as the layout
expects. Iterates over groups (not every element) since all 16 elements
within a group share one SF byte — avoids 16x redundant writes.

Grid size based on source count (MN * K_sf), not dest buffer size.
2026-05-14 15:28:44 +00:00
fb390b24e2 debug: add printf to SF remap kernel to check flat_rank and layout shape 2026-05-14 15:24:18 +00:00
8f5322ca31 fix: add missing extern "C" opening brace lost during file reconstruction 2026-05-14 15:04:43 +00:00
a8bd962452 fix: SF remap — iterate dest indices, extract logical (m, k_sf) from nested coord
The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's
layout operator requires coordinates matching the nested shape rank, and
passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks.

New approach: iterate over CUTLASS dest indices, use idx2crd to get the
hierarchical coordinate, flatten it, then extract logical (m, k_sf) by
interpreting the flattened sub-coordinates correctly:
  flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M*128 + inner_M*4 + sub_M
  flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K
  (inner_K is within one SF group — same byte, so ignored for k_sf)

Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both
M sub-indices. K information was never extracted, so only k_group=0 worked.

Dest buffer is zero-initialized so padding slots (where m >= MN or
k_sf >= K_sf) stay zero.
2026-05-14 15:01:47 +00:00
395cc31883 fix: use layout_sf(m, k_elem) instead of make_coord for nested shapes
make_coord(m, k_elem) produces rank-2 coord, but tile_to_shape creates
nested shapes like ((32,4,tiles_m), (16,4,tiles_k)) which expect
matching nested coords. layout_sf(m, k) operator handles hierarchical
projection automatically.
2026-05-14 14:57:13 +00:00
d90967d6e9 fix: SF remap — element-space K coords + zero-init dest buffer
Two fixes:
1. CuTe layout uses element-space K, not group-space. k_group=3 with
   SFVecSize=16 maps to k_elem=48 in the layout, not k=3.
   Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize
   before passing to layout_sf().

2. Zero-init CUTLASS dest buffer before remap. The layout pads to
   tile boundaries (128x64), so dest is larger than M*K_sf. Unmapped
   padding slots reading garbage causes sporadic wrong results.
   Also fixed grid size to use source count (M*K_sf), not dest size.
2026-05-14 14:54:18 +00:00
5968ebad9f fix: SF remap was using idx2crd+flatten which gives atom sub-indices, not logical (m,k)
The remap kernel iterated over CUTLASS linear indices and tried to
reverse-map with idx2crd + flatten. But flatten() on the nested CuTe
coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level
sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA
to map to m*K_sf+0, losing K-group information entirely.

Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced
zero change. Only K-group 0 was being read.

Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe
layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index.
This is the correct forward direction that CuTe handles natively.

Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4
data path is correct. The bug was purely in the SF remap.
2026-05-14 14:51:02 +00:00
cf796e37cf debug: add weight_scale_2 shape/value logging in weight transform 2026-05-14 14:19:35 +00:00
879adc324d fix: _fold_global_scale — remove broken logical_widths branch
The logical_widths branch took expert 0 and 1's global scales and
applied them to ALL experts. For L1 with logical_widths=[3072,3072],
every expert got expert-0's scale on its gate half and expert-1's
scale on its up half. All other experts' global scales were discarded.

The else branch correctly broadcasts each expert's own (E,1) global
scale across (E, N, K//16). Removed the dead logical_widths code.
2026-05-14 14:17:44 +00:00
ef9cd023a9 fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32
PyTorch doesn't implement bitwise_and/shift for UInt32 on CUDA.
Cast to int32 first, then extract bytes, then uint8 → view float8.
2026-05-14 13:44:42 +00:00
1c39e21d87 fix: remove broken L1 weight interleave
The interleave assumed gate/up were pre-interleaved in groups of 16
and that we needed 2CTA UMMA layout. Both wrong:
1. vLLM w13_weight is plain concat [gate; up] along output dim
2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA

The interleave was shuffling weights into nonsense, making L1 GEMM
compute the wrong thing, and chunk(2) would split wrong halves.
2026-05-14 13:05:45 +00:00