DeepGEMM

Author	SHA1	Message	Date
biondizzle	e608a20dec	docs: major README update — packed FP4 SMEM layout, L1 epilogue, TMA descriptors Added detailed documentation of the packed FP4 architecture: - mxf4nvf4 reads packed (2 per byte), NOT unpacked like mxf8f6f4 - SMEM layout: float_e2m1_t, BLOCK_K/2 swizzle, UMMA desc byte math - L1 epilogue: st.shared.u16, no swizzle, kWarpBytesPerRow - Host TMA: hidden/2 K-dim, block_k/2 inner, fp4_unpacked_smem=false - Build history through Build 35	2026-05-11 22:40:09 +00:00
biondizzle	e80fe9af60	docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200) The build 17-18 'scale_vec not supported on sm_100f' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix is required for FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct arch target is the path forward.	2026-05-11 14:24:55 +00:00
biondizzle	c2f4a30780	docs: comprehensive README update through build 22	2026-05-11 13:55:17 +00:00
biondizzle	f44ff7f6ca	docs: document SM100 hardware constraint and full debugging log	2026-05-11 09:30:44 +00:00
biondizzle	acbe006498	docs: update debugging log in README	2026-05-11 07:33:02 +00:00
biondizzle	42c215d49b	docs: add NVFP4 mega MoE kernel README	2026-05-11 05:41:25 +00:00