Background

With AI and ML in the limelight and hardware refreshes taking place, now seemed like a good time to try out utilizing some spare Nvidia 3070 TI cards and picking up some dirt cheap Tesla M40 24GB cards.

After a little research I settled on using Ubuntu 22.04.4 LTS as my server operating system.

It’s no mystery that having a lot of Ubuntu users creating and updating documentation along with answering posts creates draw for more Ubuntu users. So yes, this decision was driven by market saturation, updated documentation, and prolific amounts of stackoverflow posts.

Warnings & Gotchas

Hardware

I’ve used these steps successfully with the following Nvidia cards:

  • Tesla M40 24GB
    • These cards make up the bulk of my homelab CUDA count
  • RTX 3070 Ti 8GB
    • Ventus 3X Series
  • RTX 3070 LHR 8 GB
    • Zotac GAMING Twin Edge OC

Secure Boot

If Secure Boot is enabled, be prepared to handle creating a new secure key and manually entering the new key before the system can reboot.

How to Reset / Wipe Clean

If something goes wrong and you need to reset:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
sudo apt autoremove -y
sudo reboot now
sudo apt-get install linux-headers-$(uname -r)

Prerequisites

Blacklist Nouveau

Create the file /etc/modprobe.d/blacklist-nouveau.conf and prevent nouveau from loading.

Open the file for editing:

sudo vim /etc/modprobe.d/blacklist-nouveau.conf

Add the following lines:

blacklist nouveau
options nouveau modeset=0

Regenerate the kernel initramfs:

sudo update-initramfs -u

Choose your installation method

(Preferred) Package Manager Install

I’ve ended up preferring this installation method as it’s easier to manage. I did explore the local installer without much success, and I don’t believe I’d recommend it if you’re a casual homelabber like myself.

Official Documentation

  1. CUDA
    • The top level official documentation
  2. docs/common-installation-instructions-for-ubuntu
    • Direct link to the common installation instructions for Ubuntu
  3. /cuda-downloads for ubuntu 22
    • Download page that will provide you with the correct and updated URLs for your rig

Keyring and Toolkit

We need the toolkit before the drivers.

# Get the latest URL from download page ^^
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit

Now Choose Your Driver Adventure

Note: If secure boot is enabled, you will need to create a new secure key and type that in prior to the system booting. (This means you may have to do like I did, and grab a spare monitor and keyboard for this.)

The documentation provides two options for drivers, which they describe as “Legacy” and “Open Source”.

In my case, and I’m guessing because I’m using last gen cards, only the “Legacy” drivers worked for me.

Legacy Drivers (worked on 3070’s and M40’s)

sudo apt-get install -y cuda-drivers

# I've found that this is not necessary, but some people seem to recommend it at this point
# sudo reboot now 

Did not work for me, but here are the commands for Open Source drivers:

sudo apt-get install -y nvidia-kernel-open-545
sudo reboot now
sudo apt-get install -y cuda-drivers-545

GDS

sudo apt-get install nvidia-gds
Output:

Backing up initrd.img-5.15.0-89-generic to /boot/initrd.img-5.15.0-89-generic.old-dkms
Making new initrd.img-5.15.0-89-generic
(If next boot fails, revert to initrd.img-5.15.0-89-generic.old-dkms image)
update-initramfs.......

Note: I’ve seen: “modprobe: ERROR: could not insert ’nvidia_fs’: No such device” appear here at the end, but upon reboot GDS was working.

Update your shell

Either update your PATH and LD_LIBRARY_PATH exports or enter them:

Open up your profile

vim ~/.zshrc # or ~/.bashrc

Add the following lines:

export PATH=$PATH:/usr/local/cuda-12/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/lib

Reboot

We do need to reboot here. If you have Secure Boot, now is the time when you’ll need to enter the new key password.

sudo reboot now

Verify

Once you’re back up you can verify your installation and that your paths are set by entering:

nvidia-smi

You should see something like: nvidia-smi

Docker Setup

Documentation

  1. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

Install from Package Manager

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

Verify

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Output nvidia-smi

Samples

You can run some samples from the Nvidia repo – here’s a quick example:

git clone https://github.com/nvidia/cuda-samples
cd cuda-samples
make

These can take a while to build, but I’ve found a few of theme useful when verifying my setup.

Commands like:

./bin/x86_64/linux/release/deviceQuery
./bin/x86_64/linux/release/bandwidthTest

You can try to run these wih --help to see what they do and what options you can pass them.

Example:

nvidia-bandwidth-tests.png