biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:27:37 +00:00

00ac46c9d3 FMHA SM100: Phase 1 — reference scalar implementation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:26:00 +00:00

6f7449ce71 FMHA SM100: Fix tcgen05.mma PTX syntax — correct register constraints

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:24:35 +00:00

a11a245307 fix: use unsigned short for BF16 storage, inline PTX for conversions

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:22:25 +00:00

2d4e2c57e0 auto: pre-test commit

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:22:16 +00:00

97df02ea07 fix: -Xcompiler -fPIC for nvcc shared library

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:21:43 +00:00

4dfb71bc20 test: nvcc direct compilation test (avoid torch JIT __bf16 ICE)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:20:32 +00:00

373900fa08 FMHA SM100: Fix launch wrapper to match new kernel API

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:19:39 +00:00

a30ebfb197 FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:16:43 +00:00

09dfd4a41f fix: rename .cpp to .cu for CUDA compilation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:15:55 +00:00

4c194b7254 fix: add CUDA include path for host compiler

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:15:09 +00:00

48baea7728 FMHA SM100: Remove CUTLASS includes, write raw PTX inline asm

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:14:04 +00:00

88d5995ec9 fix: define bf16_t using __bf16 built-in, avoid cuda_bf16.h bug

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:13:21 +00:00

f0660d0bd7 fix: use C++20 for cuda_bf16.h compat

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:13:04 +00:00

6bd3356582 fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:12:32 +00:00

c1266b5275 fix: include cuda_bf16.h only in device code

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:12:09 +00:00

a64e55665b fix: avoid cuda_bf16.h, use inline PTX for BF16 conversion

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:11:40 +00:00

1734d13f60 fix: restore cuda_bf16.h include

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:11:17 +00:00

8783a25deb fix: guard cuda_bf16.h with __CUDA_ARCH__

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:10:44 +00:00

5e389b5ed9 fix: remove duplicate desc declaration

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 05:10:17 +00:00

7ac2499266 fix: defer UMMA descriptor — use placeholder for now