biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 04:13:43 +00:00
fae61d3ef7 Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 03:58:01 +00:00
ee86969f6c Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 03:43:08 +00:00
e26c28a1ce Fix CUDA stream API: getCurrentCUDAStream().stream()
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 03:29:13 +00:00
9b3917e248 Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 03:14:29 +00:00
5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 03:03:04 +00:00
a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 02:42:02 +00:00
e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 02:06:20 +00:00
676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:49:14 +00:00
0890e578f4 DEBUG: print l1_out shape before gate/up split
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:38:26 +00:00
8546ed725f DEBUG: check SE input magnitude
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:30:31 +00:00
26ecf96328 DEBUG: check intermediate magnitude before SE L2
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:27:28 +00:00
5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:16:40 +00:00
ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:05:49 +00:00
e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 01:03:38 +00:00
55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 00:53:12 +00:00
59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 00:41:36 +00:00
5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 00:30:33 +00:00
b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 00:15:40 +00:00
993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 00:05:10 +00:00
f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy