Major refactor to eliminate all post-load hacks:
- deepseek_v4.py: use upstream model with NVFP4 weight mapper only
(gate_proj→w1, up_proj→w3, down_proj→w2, .self_attn→.attn, .mlp→.ffn)
- Add CuTeDSLMoEExperts as a FusedMoEExpertsModular subclass
that wraps our CuTeDSL runner as a proper vLLM MoE backend
- Register CUTEDSL backend in the NVFP4 oracle
- Use ModelOptNvFp4Config for quantization dispatch (not DeepseekV4FP8Config)
- ModelOptNvFp4LinearMethod handles NVFP4 attention/shared expert projections
- Remove nvfp4_cutedsl.py, cutedsl_quant_method.py, utils.py from Dockerfile
- CuTeDSL runner moved to cutedsl/runner.py for clean imports
- cos_sin_cache float32 fix in deepseek_v4_attention.py
No more monkey-patching, no _convert_nvfp4_post_load, no CuTeDSLNvfp4Method.