cleanup: remove debug prints, ready for testing

Current state: - Token indices on CPU (avoids CuTeDSL GPU memory corruption) - Scale assembly uses per-expert swizzle + scatter (matches reference) - compute_activation_global_scales warmup gets ~0.97 cosine - expert_offsets passed without leading 0 (matches pipeline) - layertest + cudagraph_test pass
2026-05-17 08:30:41 +00:00
parent d635dcbbb6
commit 1330e2b2cf
1 changed files with 0 additions and 6 deletions
--- a/vllm/nvfp4_cutedsl.py
+++ b/vllm/nvfp4_cutedsl.py
@@ -242,12 +242,6 @@ class CuTeDSLMoERunner:
            sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
            slot_hidden = hidden_states_sample[sorted_token_ids]
            
-            # Debug: verify slot_hidden
-            torch.cuda.synchronize()
-            _slot_check = sorted_token_ids[:8].cpu().tolist()
-            _slot_amax = slot_hidden.abs().max().item()
-            print(f"  Warmup: sorted_token_ids[:8]={_slot_check}, slot_hidden amax={_slot_amax:.6f}")
-            
            # L1: get exact gs from quantize_to_nvfp4
            _, _, l1_gs = quantize_to_nvfp4(slot_hidden)