Previous approach used @CustomOp.register which doesn't create
torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre()
directly, which requires direct_register_custom_op.
Use direct_register_custom_op to register mhc_pre, mhc_post,
mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops
with torch (eager) implementations.
Patch kernels/mhc/__init__.py to import from both .torch (original)
and .mhc_torch_ops (our replacements), skipping tilelang import.