nvfp4-megamoe-kernel

Files

biondizzle 91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)

Patch torch.cuda.current_device to return the tensor's device index
during from_dlpack calls inside CUDA graph capture. This bypasses the
device check in __dlpack__ without changing the CUDA stream (which
caused 'Capture must end on the same stream' in v1) and without
triggering a cross-device copy (which caused 'Cannot copy between
CPU and CUDA tensors' in v2).

2026-06-03 21:09:12 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

decode

CUDA graph: Fix remaining sync violations from B200 detector run 2

2026-06-03 17:20:34 +00:00

kernels

Fix compressor: do not add positional bias to KV content