grace-gpu-containers

Author	SHA1	Message	Date
biondizzle	0698298d13	Bleeding edge: vLLM main branch, flashinfer v0.6.7, Gitea fork source	2026-04-28 10:17:50 +00:00
biondizzle	6e03b5d357	custom weights tweaks	2026-04-28 08:59:53 +00:00
biondizzle	10c71a446c	Remove flash-attn GIT_TAG override to main — causes FLASHATTENTION_FP8_TWO_LEVEL_INTERVAL undefined error v0.19.0 pins a compatible flash-attn commit (2921022). The sed that forced GIT_TAG to main pulled in newer code that references FLASHATTENTION_FP8_TWO_LEVEL_INTERVAL which isn't defined in v0.19.0's build config. Use the pinned commit instead.	2026-04-28 03:07:14 +00:00
biondizzle	550a04a0ca	custom weights	2026-04-28 02:10:48 +00:00
biondizzle	e43c8c97f1	custom weights	2026-04-28 02:08:00 +00:00
biondizzle	edf12f7996	Clean up: remove PLAN-triton-kernels.md (merged into main)	2026-04-06 17:25:06 +00:00
biondizzle	e6cc28a942	Add triton_kernels for MoE support (vLLM v0.19.0) - Add build-triton-kernels stage to fetch triton_kernels from Triton v3.6.0 - Install to site-packages for vLLM to find at runtime - Resolves: No module named 'triton_kernels.matmul_ogs' - Image tag: gh200-vllm-tfa:v0.19.0-tfa	2026-04-06 16:39:56 +00:00
biondizzle	643d5589a3	Switch flashinfer to v0.6.6 for vLLM v0.19.0 (v0.6.7 works with v0.18.2rc0) v0.19.0	2026-04-03 13:15:56 +00:00
biondizzle	3290adb0ac	Upgrade vLLM to v0.19.0 for Gemma 4 support (requires transformers>=5.5.0)	2026-04-03 11:55:16 +00:00
biondizzle	cd5d58a6f9	Patch vLLM torch_utils.py: remove hoist=True for NGC PyTorch 2.11 compatibility	2026-04-03 11:40:51 +00:00
biondizzle	659c79638c	✅ WORKING BUILD #43 - GH200 vLLM container builds successfully Versions locked: - vLLM: v0.18.2rc0 - flashinfer: v0.6.7 - flash-attention: hopper branch - lmcache: dev branch - infinistore: main - triton: 3.6.0 (PyPI wheel) - Base: nvcr.io/nvidia/pytorch:26.03-py3 (PyTorch 2.11.0a0, CUDA 13.2.0) DO NOT MODIFY WITHOUT MIKE'S APPROVAL	2026-04-03 11:08:29 +00:00
biondizzle	2442906d95	Add -y flag to pip uninstall pynvml for non-interactive Docker build	2026-04-03 10:57:42 +00:00
biondizzle	5280a28205	Bump flashinfer from v0.6.6 to v0.6.7 (required by vLLM v0.18.2rc0)	2026-04-03 10:52:19 +00:00
biondizzle	dbca81bba2	Switch vLLM from main to v0.18.2rc0 for CUDA 13.2 compatibility	2026-04-03 09:19:01 +00:00
biondizzle	202b9c4e23	Add -y flag to pip uninstall infinistore for non-interactive Docker build	2026-04-03 09:06:42 +00:00
biondizzle	c2cebcf962	Add apache-tvm-ffi dependency for flashinfer build	2026-04-03 09:00:18 +00:00
biondizzle	beb26d3573	Fix python -m build flag: use --no-isolation instead of --no-build-isolation	2026-04-03 08:54:45 +00:00
biondizzle	4e8a765c72	Fix wheel install conflict, use python -m build instead of pip build	2026-04-03 08:52:43 +00:00
biondizzle	ce55e45db2	Fix NGC PyTorch image tag format (26.03-py3)	2026-04-03 08:46:43 +00:00
biondizzle	c92c4ec68a	Switch to NVIDIA NGC PyTorch 26.03 base image (PyTorch 2.11.0a0, CUDA 13.2.0, ARM SBSA support)	2026-04-03 08:44:36 +00:00
biondizzle	54e609b2c5	Update lmcache/Dockerfile to CUDA 13.0.1, PyTorch nightly, LMCache dev branch	2026-04-03 08:39:35 +00:00
biondizzle	4980d9e49a	Use PyTorch nightly with CUDA 13.0 (torch 2.11.0.dev)	2026-04-03 08:36:36 +00:00
biondizzle	6a97539682	Fix duplicate corrupted lines in Dockerfile	2026-04-03 08:31:56 +00:00
biondizzle	f55789c53b	Bump to CUDA 13.0.1 + PyTorch 2.9.0, add version output on git checkouts	2026-04-03 08:26:53 +00:00
biondizzle	e514e0cd1e	Revert my patches - try v0.18.2rc0	2026-04-03 08:09:05 +00:00
biondizzle	4860bcee41	Skip LMCache CUDA extensions (NO_CUDA_EXT=1) PyTorch 2.9.0+cu130 was compiled with CUDA 12.8 but container has CUDA 13.0. Skip CUDA extension build to avoid version mismatch.	2026-04-03 08:05:44 +00:00
biondizzle	360b0dea58	Restore CUDA 13.0.1 + patch vLLM for cuMemcpyBatchAsync API change CUDA 13 removed the fail_idx parameter from cuMemcpyBatchAsync. Patch cache_kernels.cu to match new API signature instead of downgrading. - Restore CUDA 13.0.1, PyTorch 2.9.0+cu130, flashinfer cu130 - Patch: remove fail_idx variable and parameter from cuMemcpyBatchAsync call - Simplify error message to not reference fail_idx	2026-04-03 07:53:12 +00:00
biondizzle	6255c94359	Downgrade to CUDA 12.8.1 for vLLM compatibility cuMemcpyBatchAsync API changed in CUDA 13 - removed fail_idx parameter. vLLM code targets CUDA 12.8 API. Downgrade to CUDA 12.8.1.	2026-04-03 07:43:19 +00:00
biondizzle	ceab7ada22	Update flashinfer to v0.6.6 to match vLLM 0.18.x requirements vLLM 0.18.x depends on flashinfer-python==0.6.6, was building 0.4.1	2026-04-03 07:13:16 +00:00
biondizzle	9d88d4c7d8	Skip xformers - vLLM has built-in FlashAttention kernels xformers requires TORCH_STABLE_ONLY which needs torch/csrc/stable/ headers not present in PyTorch 2.9.0. vLLM 0.18.1 includes its own FA2/FA3 kernels.	2026-04-03 05:50:02 +00:00
biondizzle	45b6109ee1	Fix xformers TORCH_STABLE_ONLY issue + ramp up MAX_JOBS for native GH200 - Switch to official facebook/xformers (johnnynunez fork has TORCH_STABLE_ONLY requiring PyTorch headers not in 2.9.0) - Increase MAX_JOBS from 2-4 to 8 for all builds (native GH200 has 97GB HBM3) - Increase NVCC_THREADS from 1 to 4 for flash-attention	2026-04-03 05:46:11 +00:00
biondizzle	b223c051de	move things	2026-04-03 04:27:21 +00:00
biondizzle	7f7ca4a742	move things	2026-04-03 04:26:52 +00:00
biondizzle	2dc2008475	move things	2026-04-03 04:26:26 +00:00
biondizzle	980cd1b749	move things	2026-04-03 04:26:08 +00:00
biondizzle	1540b0c54e	move things	2026-04-03 04:25:23 +00:00
biondizzle	0b4ede8047	Add .gitignore for internal docs	2026-04-03 03:49:41 +00:00
biondizzle	5c29d2bea7	Fix: LMCache default branch is 'dev' not 'main'	2026-04-03 03:34:58 +00:00
biondizzle	9259555802	Fix: Actually update LMCache to main branch (previous edit failed)	2026-04-03 03:16:45 +00:00
biondizzle	750906e649	Bleeding edge build: LMCache main, vLLM main, latest transformers	2026-04-03 03:14:01 +00:00
biondizzle	a399fbc8c6	Add MAX_JOBS=2 for LMCache, restore vLLM build from source - LMCache: reduced parallelism to avoid memory pressure - vLLM: restored build from source (was using PyPI wheel) - Will test with docker --memory=24g limit	2026-04-03 02:49:43 +00:00
biondizzle	f8a9d372e5	Use PyPI vLLM wheel instead of building (QEMU cmake try_compile fails) - vLLM 0.18.1 aarch64 wheel includes pre-compiled FA2, FA3, MoE kernels - Original build-from-source code commented out for GH200 restoration - CMake compiler ABI detection fails under QEMU emulation	2026-04-03 00:05:56 +00:00
biondizzle	436214bb72	Use PyPI triton wheel instead of building (QEMU segfaults) Triton 3.6.0 has official aarch64 wheel on PyPI. Building triton from source causes segfaults under QEMU emulation.	2026-04-02 23:58:20 +00:00
biondizzle	e5445512aa	Reduce MAX_JOBS by half to reduce QEMU memory pressure - xformers: 6 -> 3 - flash-attention: 8 -> 4 - vllm: 8 -> 4 Testing if lower parallelism helps avoid segfaults under emulation	2026-04-02 23:44:11 +00:00
biondizzle	4f94431af6	Revert CC/CXX to full paths, keep QEMU_CPU=max	2026-04-02 22:50:23 +00:00
biondizzle	866c9d9db8	Add QEMU_CPU=max for better emulation compatibility during cross-compilation	2026-04-02 22:47:53 +00:00
biondizzle	2ed1b1e2dd	Fix: use CC=gcc CXX=g++ instead of full paths for QEMU compatibility	2026-04-02 22:47:22 +00:00
biondizzle	14467bef70	Fix: add --no-build-isolation to pip wheel for flash-attention Without this flag, pip runs the build in an isolated environment that doesn't have access to torch in the venv.	2026-04-02 20:55:32 +00:00
biondizzle	82b2ceacd5	Update build history and fix pip command docs	2026-04-02 20:24:26 +00:00
biondizzle	8f870921f8	Fix: use 'pip wheel' instead of 'uv pip wheel' (uv has no wheel subcommand)	2026-04-02 20:22:11 +00:00

1 2

77 Commits