diff --git a/MEMORY.md b/MEMORY.md deleted file mode 100644 index e356c2f..0000000 --- a/MEMORY.md +++ /dev/null @@ -1,29 +0,0 @@ -# MEMORY.md β€” Long-Term Memory - -## Mike -- Working on DeepSeek V4 Pro NVFP4 quantization + vLLM serving on B200 node -- B200 node: 45.76.247.107, root, password in project .env -- Repo: https://sweetapi.com/biondizzle/deepseek-v4-quant.git (modelopt-nvfp4 branch) - -## DeepSeek V4 NVFP4 Project -- Successfully quantized: 881GB NVFP4 (Run 11), 8Γ— B200, $161/run -- modelopt 0.45.0.dev64 + transformers 5.8.0.dev0 -- **vLLM server running on B200 port 8000** as of May 11, 2026 πŸŽ‰ -- We built the entire NVFP4β†’vLLM bridge from scratch (NVIDIA hasn't done this) -- Abandoned mega_moe (no kernel, format mismatch), using standard FusedMoE instead - -### Key Technical Decisions -- **wo_a**: NVFP4β†’BF16β†’FP8 with DeepGEMM block-scale format for BMM einsum -- **Attention layers**: NVFP4β†’BF16 dequantization, UnquantizedLinearMethod -- **Compressor**: Reconstructed fused_wkv_wgate from separate kv_proj+gate_proj in checkpoint -- **MoE experts**: Stay NVFP4, use FLASHINFER_TRTLLM FusedMoE backend - -### Critical Bugs Fixed (May 11) -1. DeepGEMM `sf.dim()` crash: weight_scale_inv must be DeepGEMM-formatted block scale tensor -2. Compressor indexer shape mismatch: checkpoint keys have `.indexer.` sub-path -3. All-ones block scale β†’ garbage output: must use `torch.full(..., fp8_scale)` not `torch.ones` -4. Block scale dtype: must be float32, not float8_e4m3fn - -### Outstanding -- Output quality under investigation β€” FP4 is aggressive quantization -- All code in patches/deepseek_v4.py on modelopt-nvfp4 branch diff --git a/index.yaml b/index.yaml deleted file mode 100644 index b7e81dd..0000000 --- a/index.yaml +++ /dev/null @@ -1,3 +0,0 @@ -apiVersion: v1 -entries: {} -generated: "2026-04-17T19:18:02.693243217Z" diff --git a/memory/2026-05-08-dataset-cache-fix.md b/memory/2026-05-08-dataset-cache-fix.md deleted file mode 100644 index 6605bd0..0000000 --- a/memory/2026-05-08-dataset-cache-fix.md +++ /dev/null @@ -1,59 +0,0 @@ -# Session: 2026-05-08 16:49:00 UTC - -- **Session Key**: agent:main:main -- **Session ID**: 9143ad68-d6c2-4b80-8dca-6aaa80b37c8b -- **Source**: webchat - -## Conversation Summary - -assistant: Token saved. Now pre-download the dataset: -assistant: Access denied even with the token. The token might not have accepted the terms for this specific dataset. Let me check if the modelopt script actually needs this dataset or if we can just use `--dataset` to override: -assistant: You're right, my bad. Let me persist the token properly and pre-download the dataset.Token saved. Now pre-download the dataset:Access denied even with the token. The token might not have accepted the terms for this specific dataset. Let me check if the modelopt script actually needs this dataset or if we can just use `--dataset` to override: -user: System: [2026-05-08 15:38:05 UTC] Exec completed (marine-n, code 0) :: Fetching 204 files: 0%| | 0/204 [00:00 .attn.mla_attn.compressor. .shared_experts.w2 -> .shared_experts.down_proj === orig_to_new_prefix === layers. -> model.layers. embe… -System: [2026-05-10 09:34:38 UTC] Exec completed (briny-ri, code 0) :: ver pid=1)[0;0m [33mWARNING[0m [90m05-10 09:34:35[0m [90m[vllm.py:1252][0m Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implem… - -Sender (untrusted metadata): -```json -{ - "label": "openclaw-control-ui", - "id": "openclaw-control-ui" -} -``` - -[Sun 2026-05-10 09:35 UTC] you did this in detached mode? If you did ill watch and tell you when we're good -assistant: Yeah, it's detached β€” `docker compose up -d`. The container's already past the weight mapper crash (the patch worked πŸŽ‰), so it's in the model loading phase now. 881GB across 8 GPUs takes a hot minute. - -You can tail it with: -``` -docker compose -f /root/nvidia-meeting/docker-compose.yml logs -f -``` - -What to look for: -- **Good sign**: `Loading safetensors checkpoint shards: X/95` progressing -- **Bad sign**: Another traceback, `UnboundLocalError` again, or `KeyError` -- **Victory**: `Uvicorn running on http://0.0.0.0:8000` or `Application startup complete` - -If it crashes again, save the last 50 lines of logs and I'll dig into whatever's next. The mapper patch covered the `self_attn` β†’ `attn.mla_attn` renaming, but there might be other weight name mismatches I couldn't fully predict without seeing what vllm's internal model actually expects vs what modelopt gave us. -user: Sender (untrusted metadata): -```json -{ - "label": "openclaw-control-ui", - "id": "openclaw-control-ui" -} -``` - -[Sun 2026-05-10 10:58 UTC] im in bed. check if it came up or if we neeed more patches -assistant: Crashed again. Container exited. Let me get the root cause β€” the worker error: -assistant: Same `UnboundLocalError: name_mapped`. The patch didn't take. Let me check if the patch script actually ran: -assistant: The patch ran in the APIServer process (pid=1) but the worker processes (pid=1068-1075) are separate processes spawned later. They don't inherit the monkeypatch! The `WeightsMapper` is rebuilt from scratch inside each worker. I need a different approach β€” patch the actual file on disk inside the container, or use a volume mount to override it. - -Let me go deeper. The real issue is in `utils.py`'s `_load_module` β†’ `AutoWeightsLoader.load_weights` β†’ `_map_name`. The `name_mapped` is set by the mapper. If the mapper can't map a name, `name_mapped` is never assigned. Let me look at that function: -assistant: Wait β€” the error is at `deepseek_v4.py` line 1579, which is `loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)`. But `name_mapped` isn't in that code. It's in the `deepseek_v2.py` code. Let me check if the DeepSeek V4 model is using the V2 load_weights somehow, or if the error is coming from the `AutoWeightsLoader` internally: -assistant: There's ANOTHER `load_weights` in `deepseek_v4.py` at line 1417 β€” that's NOT the `DeepseekV4ForCausalLM.load_weights` at line 1579. This is a different class (probably `DeepseekV4Model` or a sub-model). Let me see: -assistant: There it is. Line 1417: `loaded_params.add(name_mapped)`. If none of the `expert_mapping` entries match (the `continue` on line 1403 skips), `name_mapped` is never set, and it falls through to `loaded_params.add(name_mapped)` which blows up. - -This is the MoE expert weight loader. The expert weights from modelopt have names like `model.layers.0.mlp.experts.0.gate_proj.weight` but this code expects a different naming convention. The `expert_mapping` comes from `get_expert_mapping()`. Let me see what it expects: -user: Sender (untrusted metadata): -```json -{ - "label": "openclaw-control-ui", - "id": "openclaw-control-ui" -} -``` - -[Sun 2026-05-10 13:06 UTC] what happened? diff --git a/memory/2026-05-10.md b/memory/2026-05-10.md deleted file mode 100644 index 7653b60..0000000 --- a/memory/2026-05-10.md +++ /dev/null @@ -1,198 +0,0 @@ -# 2026-05-10 - -## DeepSeek V4 Pro NVFP4 β€” vLLM Serving Debug Session - -- Quantization completed successfully (Run 11, 881GB NVFP4) -- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint -- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4 -- NVIDIA themselves haven't gotten this far β€” we're in uncharted territory - -### What we fixed: -- Expert weight name mapping (gate_projβ†’w1, up_projβ†’w3, down_projβ†’w2) -- mlpβ†’ffn module naming -- Attention: self_attnβ†’attn.mla_attn, kv_projβ†’wkv, etc. -- Compressor: kv_projβ†’wkv, gate_projβ†’wgate -- kv_norm moved from compressor to attention level -- Class attribute patching (hf_to_vllm_mapper) -- Source file patching (workers are separate processes) -- E2M1 FP4β†’BF16 unpacking for stacked attention params -- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear - -### What we abandoned: -- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks) -- Runtime monkey-patching: Workers don't inherit patches - -### Open issues (stop point): -1. MergedColumnParallelLinear + NVFP4 incompatibility β€” ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params -2. Unknown params from modelopt (compressor.position_bias) crash loading -3. Current approach (unpack uint8β†’bf16, skip scales) loses calibration-optimized scales for attention weights - -### Repo state: -- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch -- README fully updated with vLLM serving run history, open issues, bug list -- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 -# 2026-05-10 - -## DeepSeek V4 Pro NVFP4 β€” vLLM Serving Debug Session - -- Quantization completed successfully (Run 11, 881GB NVFP4) -- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint -- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4 -- NVIDIA themselves haven't gotten this far β€” we're in uncharted territory - -### What we fixed: -- Expert weight name mapping (gate_projβ†’w1, up_projβ†’w3, down_projβ†’w2) -- mlpβ†’ffn module naming -- Attention: self_attnβ†’attn.mla_attn, kv_projβ†’wkv, etc. -- Compressor: kv_projβ†’wkv, gate_projβ†’wgate -- kv_norm moved from compressor to attention level -- Class attribute patching (hf_to_vllm_mapper) -- Source file patching (workers are separate processes) -- E2M1 FP4β†’BF16 unpacking for stacked attention params -- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear - -### What we abandoned: -- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks) -- Runtime monkey-patching: Workers don't inherit patches - -### Open issues (stop point): -1. MergedColumnParallelLinear + NVFP4 incompatibility β€” ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params -2. Unknown params from modelopt (compressor.position_bias) crash loading -3. Current approach (unpack uint8β†’bf16, skip scales) loses calibration-optimized scales for attention weights - -### Repo state: -- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch -- README fully updated with vLLM serving run history, open issues, bug list -- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 - ---- - -## vLLM NVFP4 Serving β€” Second Session (16:28–19:35 UTC) - -Mike gave autonomous work instructions. Key directive: use weights AS-IS (NVFP4), do NOT convert to MXFP4. Try FusedMoE first, then if stuck, build a mega_moe NVFP4 kernel from scratch. - -### Major breakthroughs (S11β†’S14 progress): - -**Key insight: vLLM attention forward bypasses quant_method, uses deepseek_v4_fp8_einsum directly** -- The attention code reads `self.wo_a.weight` (expects fp8) and `self.wo_a.weight_scale_inv` directly -- NVFP4 uint8 weights are incompatible with this FP8 kernel -- Solution: **NVFP4β†’bf16β†’FP8 dequantize/requant at load time** for attention layers - -**S12 fixes applied (weight loading now succeeds to 94%):** -1. **Substr mapping fix**: Removed `.mla_attn.` prefix from attention projections. The model has `fused_wqa_wkv`, `wq_b`, `wo_a`, `wo_b` at `attn.*` level, not `attn.mla_attn.*`. The stacking code then correctly maps `attn.wq_a` β†’ `attn.fused_wqa_wkv`. -2. **Skip patterns fix**: Only skip compressor scale tensors (compressor uses `UnquantizedLinearMethod` with `quant_config=None`). Attention and shared expert scales now correctly load via stacking logic. -3. **Suffix mapping fix**: Removed `"head.weight": "lm_head.weight"` which caused `lm_head.weight` β†’ `lm_lm_head.weight` doubling. -4. **Resilient loading**: Unknown params (e.g., `compressor.position_bias`) silently skipped. - -**S13 β€” Weight loading SUCCESS (32 seconds!)** -- All 95 safetensors loaded without KeyError -- New error: `MergedColumnParallelLinear` has no `weight_scale_inv` (FP8 attribute) - -**S13.5 β€” o_a_proj discovery:** -- modelopt did NOT quantize `o_a_proj` β€” it's bf16 in the checkpoint (no scales) -- But vLLM creates `wo_a` with NVFP4 quant (uint8 weight + scales) -- Fix: convert bf16β†’FP8 directly at load time, set weight_scale_inv - -**S14 β€” NVFP4β†’FP8 post-load conversion approach:** -- Added `_convert_nvfp4_attention_to_fp8()` and `_convert_nvfp4_module_to_fp8()` methods to `DeepseekV4Model` -- Converts all uint8 NVFP4 attention weights (fused_wqa_wkv, wq_b, wo_a, wo_b, gate_up_proj) to FP8 at load time -- Steps: unpack E2M1 FP4β†’bf16, dequantize with block/global scales, requantize to FP8 e4m3, set weight_scale_inv -- For o_a_proj (bf16, no scales): convert directly bf16β†’FP8 -- For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod) -- For MoE experts: handled natively by ModelOptNvFp4FusedMoE - -**Bug found: E2M1 LUT indexing off-by-one** -- FP4 4-bit values are 0-15 (bit 3 = sign, bits 0-2 = magnitude) -- LUT has 8 entries (magnitudes 0-7), but code was indexing with full 4-bit value (0-15) β†’ CUDA assert -- Fix: mask with `& 0x07` for magnitude index, apply sign from bit 3 separately - -**Bug found: method placement inside Python class** -- `_convert_nvfp4_attention_to_fp8` was being placed at top level (0 indent) instead of inside `DeepseekV4Model` -- The class actually ends at `finalize_mega_moe_weights()` (line ~1600), followed by top-level `hc_head` function -- Had to insert methods BEFORE the `@torch.compile` decorator that marks the class boundary - -**Bug found: logger not available in method** -- `logger.info_once()` isn't accessible inside the conversion methods -- Replaced with `print(f"...")` for now - -### Current status (as of 19:35 UTC): -- Weight loading + NVFP4β†’FP8 conversion code is in place -- Last test was running (loading 880GB checkpoint) -- E2M1 sign handling fix applied but NOT YET TESTED -- Need to fix `logger` β†’ `print` issue -- After load succeeds: FusedMoE expert weight handling needs verification -- If FusedMoE fails: need to build mega_moe NVFP4 kernel - -### Key files on B200 node: -- Patch: `/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py` -- Docker: `docker compose up -d` (TP=8, no mega_moe, FLASHINFER_TRTLLM attn) -- Weights: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/` - -### Architecture decisions: -- NVFP4β†’FP8 for attention/shared_experts (requant, preserves FP8 kernel compat) -- BF16 for compressor (UnquantizedLinearMethod, no quant_config) -- Native NVFP4 for MoE experts (ModelOptNvFp4FusedMoE handles it) -- UnquantizedLinearMethod as no-op quant_method (attention forward bypasses it anyway) - ---- - -## vLLM NVFP4 Serving β€” Third Session (23:05+ UTC) - -### Current state of the B200 node: -- Docker container ran 27 min ago and crashed with `BFloat16 != Float8_e4m3fn` -- Uncommitted changes to `patches/deepseek_v4.py` (the _convert_nvfp4_post_load methods) -- Repo on `modelopt-nvfp4` branch, last commit `db16be8` - -### Crash analysis (S15 β€” `BFloat16 != Float8_e4m3fn`): - -Weight loading succeeds (95/95, 330s). Post-load conversion reports: 122 layers β†’ FP8, 183 β†’ BF16. MoE setup runs. Crash during profile_run/_dummy_run. - -**Root cause**: `_convert_nvfp4_post_load` converts `fused_wqa_wkv` to FP8 and sets `quant_method = UnquantizedLinearMethod()`. The attention forward calls `self.fused_wqa_wkv(hidden_states)` which goes through `UnquantizedLinearMethod.forward()` β†’ `F.linear(bf16_input, fp8_weight)` β†’ dtype mismatch. - -**Key insight about the attention forward paths**: -- `wo_a`: Attention code reads `self.wo_a.weight` and `self.wo_a.weight_scale_inv` DIRECTLY, passes to `deepseek_v4_fp8_einsum`. This bypasses `quant_method`. FP8 conversion works here. -- `fused_wqa_wkv`: Called via `self.fused_wqa_wkv(hidden_states)` β†’ `MergedColumnParallelLinear.forward()` β†’ `quant_method.forward()`. **Cannot be FP8 with UnquantizedLinearMethod**. -- `wq_b`, `wo_b`: Called via normal `.forward()`. Need BF16 + UnquantizedLinearMethod. -- `compressor.fused_wkv_wgate`: Called via `torch.mm(hidden_states, weight.T, out_dtype=torch.float32)` DIRECTLY. **Needs BF16 weight** β€” currently uint8 (not in any conversion set!). - -**Critical finding from safetensors**: `o_a_proj.weight` is BF16 (modelopt did NOT quantize it). So `wo_a` weight is already BF16, not NVFP4. The post-load conversion code's `dtype != uint8` check skips it. This means `wo_a.weight` stays BF16 and `wo_a.weight_scale_inv` is never created. When `deepseek_v4_fp8_einsum` tries to read it as FP8 β†’ crash. - -**Wait, but the log says 122 β†’ FP8.** 61 layers Γ— 2 (fused_wqa_wkv + wo_a) = 122. If wo_a.weight is BF16 and gets skipped, only 61 β†’ FP8. The 122 count means wo_a IS being converted somehow. Hypothesis: `ModelOptNvFp4LinearMethod.create_weights()` creates `wo_a.weight` as uint8. When the BF16 checkpoint data is loaded into the uint8 param, the weight_loader might be casting it, or the param might be updated to BF16. Need to verify. - -### Unfixed bugs from S14 (still present): -1. E2M1 sign handling fix applied but NOT TESTED -2. `logger` β†’ `print` issue in conversion methods - -### Compressor `fused_wkv_wgate` β€” PENDING CRASH: -- NOT in any conversion set (fp8_proj_names, bf16_proj_names, bf16_shared_names) -- Weight is uint8 after loading (NVFP4 packed) -- Forward uses `torch.mm(hidden_states, weight.T, out_dtype=torch.float32)` directly -- uint8 Γ— BF16 would crash with a different error than the current one -- Needs BF16 dequantization in post-load conversion - -### Checkpoint key format (verified from safetensors): -- `model.layers.0.self_attn.q_a_proj.weight` β€” uint8 -- `model.layers.0.self_attn.q_a_proj.weight_scale` β€” float8_e4m3fn (block scale) -- `model.layers.0.self_attn.q_a_proj.weight_scale_2` β€” float32 (per-tensor) -- `model.layers.0.self_attn.q_a_proj.input_scale` β€” float32 -- `model.layers.0.self_attn.o_a_proj.weight` β€” **BF16** (NOT quantized by modelopt) -- `model.layers.0.self_attn.o_b_proj.weight` β€” uint8 -- `model.layers.0.self_attn.kv_proj.weight` β€” uint8 -- `model.layers.0.self_attn.compressor.kv_proj.weight` β€” uint8 -- `model.layers.0.self_attn.compressor.gate_proj.weight` β€” uint8 -- `model.layers.0.self_attn.compressor.position_bias` β€” BF16 (unknown param, skipped) -- Expert scales: `.weight_scale`, `.weight_scale_2`, `.input_scale` (NOT `.scale`) - -### FusedMoE NVFP4 status: -- `ModelOptNvFp4FusedMoE` creates proper uint8 weights + float8_e4m3fn block scales + float32 per-tensor/input scales -- `process_weights_after_loading` calls `convert_to_nvfp4_moe_kernel_format` then `make_nvfp4_moe_kernel` -- Uses `cutlass_fp4_gemm` via nvfp4 backend -- Warning: `w1_weight_scale_2 must match w3_weight_scale_2` β€” modelopt gives different global scales to w1 and w3, but FusedMoE uses a single w13_weight_scale_2 (takes w1's). Minor accuracy impact. -- `expert_dtype: fp4` in config β€” causes weight mapper to use `.scale` β†’ `.weight_scale` regex, but checkpoint already uses `.weight_scale` directly, so regex is a no-op. Correct behavior. -- `scale_fmt: "ue8m0"` in config β€” used by attention FP8 einsum. Correct for NVFP4. - -### Config verification: -- `compress_ratios` βœ… (copied from BF16 source) -- `scale_fmt: "ue8m0"` βœ… (added by us) -- `rope_parameters` βœ… (flattened) -- `expert_dtype: fp4` βœ… (original, correct for weight mapper regex) diff --git a/memory/2026-05-11.md b/memory/2026-05-11.md deleted file mode 100644 index af54055..0000000 --- a/memory/2026-05-11.md +++ /dev/null @@ -1,58 +0,0 @@ -# 2026-05-11 β€” DeepSeek V4 NVFP4 vLLM Serving: Full End-to-End - -## πŸŽ‰ SERVER RUNNING ON PORT 8000 - -The vLLM server successfully loads the NVFP4 model and serves API requests on 8Γ— B200. - -### What We Fixed (Session Summary) - -#### 1. DeepGEMM `sf.dim()` Assertion (CRITICAL) -- **Error**: `Assertion error layout.hpp:94: sf.dim() == num_groups + 2` -- **Cause**: `weight_scale_inv` was 1D per-tensor scale. DeepGEMM expects 2D/3D block-scale tensor from `transform_sf_into_required_layout`. -- **Fix**: Use `deepgemm_post_process_fp8_weight_block(wq, ws, quant_block_shape=(128,128), use_e8m0=True)` to produce correct block-scale format. Store result in `weight_scale_inv`. -- **Key insight**: The attention runtime reads `self.wo_a.weight_scale_inv` as `b_scale` for the einsum. It MUST be the DeepGEMM-formatted block scale. - -#### 2. Block Scale dtype -- **Error**: `Expected float32 or float8_e8m0fnu, got float8_e4m3fn` -- **Fix**: Create block scale as `dtype=torch.float32` - -#### 3. Missing `deepgemm_post_process` args -- **Error**: `missing 2 required positional arguments: 'quant_block_shape' and 'use_e8m0'` -- **Fix**: Pass `quant_block_shape=(128, 128)` and `use_e8m0=True` - -#### 4. Compressor Indexer Shape Mismatch (CRITICAL) -- **Error**: `split_with_sizes expects 2048, got split_sizes=[256, 256]` -- **Cause**: `_reconstruct_compressor_weight` used wrong checkpoint prefix for indexer. Main compressor keys: `compressor.kv_proj.*`. Indexer keys: `compressor.indexer.kv_proj.*`. Loading main compressor weight into indexer's fused_wkv_wgate = 4Γ— size mismatch. -- **Fix**: Added `sub_path` parameter, pass `".indexer"` for indexer compressors. - -#### 5. All-Ones Block Scale β†’ Garbage Output (CRITICAL) -- **Symptom**: Server runs, outputs tokens, but text is incoherent gibberish (repeating "Palm", "sulfuric", "东桷") -- **Cause**: Block scale was `torch.ones(...)` = 1.0. DeepGEMM divides by block scale at runtime, so output was divided by 1.0 instead of actual fp8_scale. -- **Fix**: `torch.full(..., fp8_scale.item())` β€” fill each block with the per-tensor FP8 scale value. - -### Conversion Summary -- 61 NVFP4β†’FP8 (wo_a attention, DeepGEMM block-scale BMM einsum) -- 0 BF16β†’FP8 -- 305 attn/sharedβ†’BF16 (UnquantizedLinearMethod) -- 91 compressorβ†’BF16 (reconstructed from separate NVFP4 kv_proj+gate_proj) -- MoE experts: stay NVFP4 (FLASHINFER_TRTLLM FusedMoE backend) - -### Architecture Map -``` -wo_a β†’ FP8 + DeepGEMM block scale (weight_scale_inv = dg_ws) -fused_wqa_wkv, wo_b β†’ BF16 (UnquantizedLinearMethod) -compressor.fused_wkv_wgate β†’ BF16 (read from checkpoint, unpack, dequant, cat) -shared_expert β†’ FP8 (Fp8LinearMethod, DeepGEMM) -MoE w13/w2 β†’ NVFP4 (FusedMoE, FLASHINFER_TRTLLM) -``` - -### Key Code Locations -- Patch: `/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py` -- Runtime attention: `deepseek_v4_attention.py:319` β€” reads `wo_a.weight_scale_inv` -- Runtime einsum: `deepseek_v4_fp8_einsum` β†’ DeepGEMM `fp8_einsum` -- DeepGEMM scale format: `deepgemm_post_process_fp8_weight_block` in `fp8_utils.py` -- Compressor forward: `deepseek_compressor.py:281` β€” `kv, score = kv_score.split(...)` - -### Outstanding Issues -- **Output quality**: Still producing garbled text after block-scale fix. Need to verify the latest fix (fp8_scale in block scale) produces coherent output. -- Possible causes if still garbled: subtle dequant bug, sign handling in E2M1 unpack, wrong scale ordering