biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 23:24:47 +00:00
b034c915d1 10-warp debug: MMA=warp4 TMA=warp5 idle=6-9 still gives cosine 0.29
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 23:11:11 +00:00
0b8f4da323 Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 22:07:55 +00:00
c681b591a0 10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 22:04:32 +00:00
c3a9e53253 Router: Blackwell-native fused decode kernel — real CuTeDSL implementation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 21:58:38 +00:00
a813d2824b Router: clean up dense_router_decode.py — realistic architecture, no fake code
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 21:54:17 +00:00
fb243a4133 Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 21:20:41 +00:00
a4d12fd560 WIP: correction warp group architecture - compiles, illegal address at runtime
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 20:13:54 +00:00
bb3ad3d2ef BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 19:26:18 +00:00
7d1c402a6d WIP: TMEM vector bridge not working (same cosine 0.513)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 19:16:18 +00:00
cae87fd744 WIP: confirmed row_sum is wrong (5.5 vs correct 29.22 for row 0)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 19:10:49 +00:00
8eb569e31c BREAKTHROUGH: Found the real bug!
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 19:09:34 +00:00
c09c660110 WIP: scalar C9 normalization - confirmed inv_row_sum is wrong
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 18:59:22 +00:00
ce91aa26e4 WIP: QK-partitioned C9 normalization (does not work)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 18:55:03 +00:00
1fa093ee12 BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128!
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 18:45:32 +00:00
c2901b2ecc WIP: TMEM vector for per-row row_sum (not yet working)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 18:04:23 +00:00
4c203809ef WIP: Stage C softmax - partial progress
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:58:06 +00:00
8e1facef01 Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:49:29 +00:00
58ca480fd1 Stage C: add validation harness with real softmax reference (C1)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:40:27 +00:00
e8485b9cf5 README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-21 17:34:49 +00:00
364d9edcd3 README: update for new dsv4/ package structure