TensorFlow on raad2 GPU Cluster

From TAMUQ Research Computing User Documentation Wiki
Jump to navigation Jump to search


How to setup Python Environment for TensorFlow

We will be using Anaconda virtual environment to install TensorFlow.

Step 01: Request a GPU node from raad2-gfx. Once you issue sinteractive command, you will notice a change in terminal prompt from raad2-gfx to gfx[1-4] confirming that you are on a GPU node now.

[muarif092@raad2-gfx ~]$ sinteractive
[muarif092@gfx1 ~]$

Step 02: Load Cuda module. You can list other available Cuda versions by issuing 'module avail cuda'. However, it is recommended to use latest Cuda version available on the system.

[muarif092@raad2-gfx ~]$ module load cuda11.2
[muarif092@gfx1 ~]$


Step 03: Source conda file. This will make conda executables available.

[muarif092@gfx1 ~]$ source /cm/shared/apps/anaconda3/etc/profile.d/conda.sh
[muarif092@gfx1 ~]$ conda -V
conda 4.9.2

Step 04: Create a Conda virtual environment specifying Python version

[muarif092@gfx1 ~]$ conda create -n tfProj python=3.8
..
..
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate tfProj
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Step 05: Activate conda environment and install required packages. At this point you can confirm that version of Python in Conda environment is the one you specified while creating the Conda virtual environment

[muarif092@gfx1 ~]$ conda activate tfProj
(tfProj) [muarif092@gfx1 ~]$ python -V
Python 3.8.11
(tfProj) [muarif092@gfx1 ~]$ which python
~/.conda/envs/tfProj/bin/python

Step 06: Install required packages in Conda virutal environment

(tfProj) [muarif092@gfx1 ~]$ conda install -c anaconda tensorflow-gpu keras-gpu

Step 07: Run a test program to confirm if TensorFlow GPU can detect GPU

(tfProj) [muarif092@gfx1 ~]$ python
Python 3.8.11 (default, Aug  3 2021, 15:09:35)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
2021-09-14 10:53:29.239845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-14 10:53:29.354295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:18:00.0 name: Tesla V100-PCIE-16GB computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-14 10:53:29.417433: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-09-14 10:53:30.977573: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-09-14 10:53:31.642137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-09-14 10:53:31.981036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-09-14 10:53:32.654807: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-09-14 10:53:32.881767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-09-14 10:53:33.750680: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-09-14 10:53:33.753288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Num GPUs Available:  1

Step 07: Deactivate conda and exit from job

(tfProj) [muarif092@gfx1 ]$ conda deactivate
[muarif092@gfx1 ~]$ exit

Run a test Program in interactive mode

Step 01: Submit an interactive job and activate conda environment

[muarif092@raad2-gfx ~]$ sinteractive
[muarif092@gfx1 ~]$ module load cuda11.2
[muarif092@gfx1 ~]$ source /cm/shared/apps/anaconda3/etc/profile.d/conda.sh
[muarif092@gfx1 ~]$ conda activate tfProj
(tfProj) [muarif092@gfx1 ~]$

Step 02: Run a sample program

(tfProj) [muarif092@gfx1 ~]$ mkdir myDLProject
(tfProj) [muarif092@gfx1 ~]$ cd myDLProject
(tfProj) [muarif092@gfx1 myDLProject]$ cp /lustre/share/examples/gpu-tutorial/04_tensorflow/neuralNet.py .
(tfProj) [muarif092@gfx1 myDLProject]$
(tfProj) [muarif092@gfx1 myDLProject]$ python neuralNet.py