biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 04:13:43 +00:00

fae61d3ef7 Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 03:58:01 +00:00

ee86969f6c Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 03:43:08 +00:00

e26c28a1ce Fix CUDA stream API: getCurrentCUDAStream().stream()

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 03:29:13 +00:00

9b3917e248 Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 03:14:29 +00:00

5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 03:03:04 +00:00

a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 02:42:02 +00:00

e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 02:06:20 +00:00

676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:49:14 +00:00

0890e578f4 DEBUG: print l1_out shape before gate/up split

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:38:26 +00:00

8546ed725f DEBUG: check SE input magnitude

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:30:31 +00:00

26ecf96328 DEBUG: check intermediate magnitude before SE L2

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:27:28 +00:00

5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:16:40 +00:00

ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:05:49 +00:00

e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 01:03:38 +00:00

55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 00:53:12 +00:00

59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 00:41:36 +00:00

5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 00:30:33 +00:00

b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 00:15:40 +00:00

993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 00:05:10 +00:00

f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy