1. Compressor positional bias was being added to BOTH gate (softmax logit)
AND KV content. Per paper eq. 9-12, position bias is only for the
softmax logits (Z+B), NOT the KV content (C). Adding pb to kv_val
corrupts every compressed KV entry with learned positional-bias content.
Fixed in both CSA and HCA paths in compressor_reduce.cu.
2. SwiGLU clamp ordering: code was clamping silu(gate) instead of clamping
raw gate before SiLU. Per paper §4.2.3: gate = clamp(gate, max=limit),
then silu(clamp(gate)) * clamp(up). Fixed in moe.py (both unfused
paths) and fused_swiglu.py (CuTeDSL kernel). shared_expert.py was
already correct.