nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	8758bc93ca	crap shoot	2026-05-18 11:13:29 +00:00
biondizzle	b8df4a8cc5	Fix NaN check: use os.environ gate instead of is_current_stream_capturing torch.cuda.is_current_stream_capturing() returns bool, which breaks Dynamo FX tracing (non-Tensor output). Switch to env var gate: CLAWMINE_NAN_CHECK=1 enables NaN/Inf detection. Dynamo evaluates os.environ at trace time — if the env var is not set, the entire NaN check block is compiled away. Set it before first inference to get NaN detection during prefill only.	2026-05-18 02:20:14 +00:00
biondizzle	0c02d84514	Add NaN/Inf detection in DeepseekV4Model.forward layer loop - Checks every layer during prefill (not during cudagraph capture) - is_current_stream_capturing() gate prevents CPU-GPU syncs during capture - Prints amax every 10 layers for magnitude tracking - Breaks on first NaN/Inf to avoid wasting compute	2026-05-17 23:37:12 +00:00
biondizzle	bedcfc4dab	Pipeline test: use max_num_tokens=8192 matching vLLM	2026-05-17 23:04:44 +00:00
biondizzle	c45364b3a8	Add MoE scale ratio output	2026-05-17 22:58:27 +00:00
biondizzle	bf99ad49ec	Print both MoE and residual cosine	2026-05-17 22:56:56 +00:00
biondizzle	8637020487	Fix multi-layer test: add residual connections	2026-05-17 22:55:40 +00:00
biondizzle	11dce13afe	Add multi-layer pipeline test to check error accumulation	2026-05-17 22:53:28 +00:00
biondizzle	87582fc9f7	HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph	2026-05-17 22:28:32 +00:00
biondizzle	8717e0e411	Fix warmup: use same padded GEMM path as run(), add swiglu_limit clamping	2026-05-17 22:03:48 +00:00
biondizzle	d332f4f900	Add NaN debug checks after L1 and L2 GEMM	2026-05-17 22:02:24 +00:00
biondizzle	e65f2b2ba2	Update CURRENT_BUG.md with Bug 26 fix	2026-05-17 21:36:25 +00:00
biondizzle	72628fb689	Full pipeline test: runner vs BF16 reference	2026-05-17 21:29:16 +00:00
biondizzle	2796bd81e8	Fix: scatter FP4 as uint8 (float4 doesn't support index_put)	2026-05-17 21:28:04 +00:00
biondizzle	364f8372bb	Fix FP4 buffer shapes: D//2 for packed dimensions	2026-05-17 21:26:46 +00:00
biondizzle	5e4d674736	Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf	2026-05-17 21:25:58 +00:00
biondizzle	803e7160d8	Fix: allocate FP4 buffers as uint8 then view-cast	2026-05-17 21:25:04 +00:00
biondizzle	7256070dd3	FIX Bug 26: quantize slot tokens, not padded buffer The runner was quantizing the padded_hidden (4096 rows) and then taking x_sf[:num_slots] (first 48 rows). This only got scales for expert 0 (the first 48 rows of the padded buffer), not the scales for tokens scattered across padded positions (expert 1 at row 128, etc). Fix: quantize slot_hidden (sorted tokens, num_slots rows) to get correct per-token x_sf, then scatter x_fp4 into padded FP4 buffer for the GEMM. The scale assembly now receives the correct x_sf. Added hidden_fp4 and activated_fp4 padded buffers for FP4 scatter.	2026-05-17 21:24:43 +00:00
biondizzle	4d0b6d889d	Set runner weights before _ensure_stacked	2026-05-17 21:22:50 +00:00
biondizzle	b7acac5e4e	Call _ensure_stacked() before using runner buffers	2026-05-17 21:22:30 +00:00
biondizzle	1acf01fc1a	Fix token_indices: repeat each token ID top_k times, not arange	2026-05-17 21:22:11 +00:00
biondizzle	a478ca4746	Debug: trace runner logic step by step, test L1 GEMM	2026-05-17 21:21:45 +00:00
biondizzle	a100bd11c1	Simplify pipeline test: BF16 ref + bridge ref + full runner	2026-05-17 21:20:41 +00:00
biondizzle	6eade5e7f8	Fix: gs values are floats not tensors	2026-05-17 21:19:47 +00:00
biondizzle	b05a38a9bd	Test stages 1-2 first: sort + L1 GEMM	2026-05-17 21:19:23 +00:00
biondizzle	9728604ea1	Pipeline test: stage-by-stage with BF16 reference comparison	2026-05-17 21:19:17 +00:00
biondizzle	7fff5fd39b	Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes	2026-05-17 21:18:20 +00:00
biondizzle	4ef345773d	Rewrite pipeline test: load real weights, step-by-step vs BF16 reference	2026-05-17 21:17:18 +00:00
biondizzle	b43541afdd	Fix test path setup	2026-05-17 21:00:00 +00:00
biondizzle	490ddfa294	Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours)	2026-05-17 20:58:06 +00:00
biondizzle	c1bb551446	Fix weight loading: skip already-loaded experts correctly	2026-05-17 18:15:51 +00:00
biondizzle	955d7533f2	Use system Python for pipeline test (CuTeDSL in system site-packages)	2026-05-17 18:13:42 +00:00
biondizzle	925e390b93	Fix import: use direct import from vllm/ subdirectory	2026-05-17 18:12:53 +00:00
biondizzle	cd6144b832	Fix imports: all functions are in cutedsl.bridge, not separate modules	2026-05-17 18:11:03 +00:00
biondizzle	5e63a0d8a3	Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference	2026-05-17 18:10:05 +00:00
biondizzle	e51eafe288	Rewrite pipeline test: compare runner vs reference with real weights, step-by-step	2026-05-17 18:08:33 +00:00
biondizzle	e38d60a6e8	Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline	2026-05-17 18:07:44 +00:00
biondizzle	22e0370e6e	Fix AttributeError: DeepseekV4MegaMoEExperts has no swiglu_limit Get swiglu_limit from vllm_config.model_config.hf_config instead of self (it was only set on the parent DeepseekV4MoE class).	2026-05-17 18:06:44 +00:00
biondizzle	6692166d0f	Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets	2026-05-17 17:56:04 +00:00
biondizzle	a10c582cf4	Add swiglu_limit=10.0 activation clamping (was missing) DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps: - silu(gate) to max 10.0 - up to [-10.0, 10.0] Our runner was doing plain F.silu(gate) * up without clamping. Large gate values could produce unbounded SiLU output, causing numerical issues in the L2 GEMM. This is likely contributing to garbage model output.	2026-05-17 17:52:16 +00:00
biondizzle	3f2f4e1882	Fix cudaErrorStreamCaptureUnsupported: no dynamic GPU-tensor slicing Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA operation not permitted during stream capture. Use full pre-allocated buffers instead of dynamic slices. The GEMM only reads rows indicated by expert_offsets, ignoring the zero padding. Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to scale assembly so it only processes real token scale data.	2026-05-17 17:24:26 +00:00
biondizzle	11b5aa5e37	Scale assembly: full-buffer swizzle, zero CPU syncs, no Python loops Removed .cpu().tolist() and per-expert Python loops. Apply the Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once. The buffer is already 128-row aligned (padded per expert) and 4-col aligned, so the full-buffer swizzle produces the correct layout. The GEMM reads scale_a using padded_expert_offsets, which matches the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.	2026-05-17 16:59:51 +00:00
biondizzle	94dec5922d	Scale assembly Phase 2: use CPU-computed offsets for Python slicing GPU scalars can't be used for Python indexing (requires sync). Compute padded_expert_offsets on CPU via .cpu().tolist() for the Python loop. This is OK for cudagraph: Python code only runs during capture, not replay. The GPU kernel launches recorded during capture are deterministic.	2026-05-17 16:56:52 +00:00
biondizzle	49c28e6562	Fix: use real padded expert offsets instead of fixed layout Root cause of garbage output: fixed-layout padding with max_chunks=ceil(avg) was too small for uneven expert assignment. Tokens beyond max_chunks*128 per expert were silently dropped (clamped_local overwrote the same row). Fix: compute padded_expert_offsets from actual tokens_per_expert (padded to 128). No clamping needed — each expert gets exactly the space it needs. Pass padded_expert_offsets to scale assembly and GEMM.	2026-05-17 16:55:47 +00:00
biondizzle	87a223f1ac	Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses	2026-05-17 16:52:40 +00:00
biondizzle	c03438fc4e	crap shoot	2026-05-17 16:25:38 +00:00
biondizzle	7c16f3cb46	Fix: init shared dict before using it, remove duplicate _output_buf	2026-05-17 16:06:58 +00:00
biondizzle	ea8acf9852	Share padded_x_sf and output buffers across layers to save ~300 MB Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB. Sharing reduces to ~3.6 MB total. Layers run sequentially during both capture and replay.	2026-05-17 16:05:53 +00:00
biondizzle	3d0b1408b4	Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status	2026-05-17 15:52:06 +00:00
biondizzle	455ecb5631	Fix: define padded_max_slots before using it in shared buffer allocation	2026-05-17 15:47:38 +00:00

1 2 3 4 5 ...

267 Commits