0612c1ab54
use proper backend
2026-05-19 02:08:18 +00:00
f74447bfd0
Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
...
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
→ .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)
Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
(was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights
Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
a83d364d45
Switch to cudagraph_mode=NONE (not enforce-eager) for real inference testing
2026-05-18 15:05:52 +00:00
a83c332059
Fix docker-compose: remove orphaned compilation-config arg, enforce-eager mode
2026-05-18 12:54:14 +00:00
9e7639fba4
Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
...
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
8758bc93ca
crap shoot
2026-05-18 11:13:29 +00:00
87a223f1ac
Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses
2026-05-17 16:52:40 +00:00
c03438fc4e
crap shoot
2026-05-17 16:25:38 +00:00
366a0240a5
vllm tweaks
2026-05-17 07:14:58 +00:00
34c43958d0
vllm tweaks
2026-05-17 07:10:16 +00:00
da31ce7e1a
allow for cuda graphs again
2026-05-16 19:23:41 +00:00
99c11c218d
fucken a
2026-05-16 08:39:13 +00:00
a51ef3d2cf
fucken a
2026-05-16 08:23:27 +00:00
72bf750a0b
fix: revert to eager mode — CUDA graphs OOM with 175GB model
...
CUDA graph capture needs extra memory on top of the model weights.
With 175GB model on 178GB GPUs, there's no room.
Going back to --enforce-eager with 10-min RPC timeout. The first
inference request will be slow (2-3 min JIT compilation) but won't
crash. Subsequent requests are fast.
CUDA graph mode requires either more GPU memory or a smaller model.
2026-05-16 08:07:44 +00:00
8496ac99bc
dang clonkurs
2026-05-16 06:28:16 +00:00
2e4ff6b8d4
fix: increase vLLM RPC timeout to 10 min for first-request JIT
...
First inference triggers Triton/TileLang kernel JIT compilation (2-3 min).
The default 5-min RPC timeout kills the engine. Bumped to 10 min via
VLLM_RPC_TIMEOUT_MS so the first request survives compilation.
Not ideal — would prefer to warm up the kernels during startup.
But CUDA graphs don't work well with grouped GEMMs and variable
expert counts. Will investigate vLLM warmup shape config later.
2026-05-16 06:02:11 +00:00
830f042443
fix: PYTHONUNBUFFERED=1 so progress bars stream in real-time
...
Python buffers stdout by default. Docker only sees the buffer dumps,
so all progress bars appear at once when the step completes.
PYTHONUNBUFFERED=1 disables buffering — prints flush immediately.
2026-05-16 04:18:07 +00:00
b04bff7e8b
feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
...
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import
docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)
deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl
README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
9908fd64d9
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
...
Major changes from initial TileLang prototype:
Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided
SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)
Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS
No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
M-dependent layout, cross-layer collisions)
Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM
Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00