首页/作品列表/在 ESXi 上安装Tensorflow GPU 版本
在 ESXi 上安装Tensorflow GPU 版本
23090 0
主要内容
注意:NVidia 驱动选择 384 版本,CUDA 选择 9.0 版本, Tensorflow 选择 1.6.0 版本必须匹配,不然就是坑。

部分参考文档:

在 vmware 上创建新的虚拟机

在 ESXi 主机上将 GPU 显卡设置为“开启直通”

在 ESXi 的 Web 控制台上,选择 “管理”/“硬件”/“PCI设备” ,找到 NVIDIA 设备,选中后,点按“切换直通”,启用 GPU 显卡的直通模式。 需要重启 ESXi。

创建虚拟机

  • OS,选 Ubuntu 64bit,我安装的是 Ubuntu 17.10.1 Server 版本。
  • 虚拟机硬件设置
  • 内存设置 4096MB
  • 硬盘 100GB
  • 增加 CDROM 设备,并绑定 Ubuntu 安装 iso 文件
  • 增加 PCI 设备
  • 虚拟机启动参数设置
  • 增加参数,否则在虚拟机中找不到GPU。hypervisor.cpuid.v0 = FALSE
系统安装时,需要安装 SSH Server,以便于远程访问。

安装 Python3

iasc@chenhao:~$ sudo apt-get update && sudo apt-get upgrade
iasc@chenhao:~$ sudo apt-get install -y python3-pip python3-dev python-virtualenv
创建 Python 虚拟环境,我们把它起名为 tensorflow
iasc@chenhao:~$ virtualenv --system-site-packages -p python3 tensorflow
启动虚拟环境
iasc@chenhao:~$ . tensorflow/bin/activate
(tensorflow) iasc@chenhao:~$ python --version
Python 3.6.3
(tensorflow) iasc@chenhao:~$ pip --version
pip 9.0.3 from /home/iasc/tensorflow/lib/python3.6/site-packages (python 3.6)

创建工作环境(Option)

(tensorflow) iasc@chenhao:~$ mkdir workspaces
(tensorflow) iasc@chenhao:~$ cd workspaces/
(tensorflow) iasc@chenhao:~/workspaces$

安装 Tensorflow GPU 版本(1.6)

请注意需要在 Python3 的环境里安装
(tensorflow) iasc@chenhao:~$ pip install --upgrade tensorflow
(tensorflow) iasc@chenhao:~$ pip install --upgrade tensorflow-gpu
写个小小的程序,准备验证环境。
(tensorflow) iasc@chenhao:~/workspaces$ vi hello_tensorflow.py

# Python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
运行一下,先来个小凉菜,体会下小小的挫败感。
(tensorflow) iasc@chenhao:~/workspaces$ python hello_tensorflow.py
Traceback (most recent call last):
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/home/iasc/tensorflow/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/home/iasc/tensorflow/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "hello_tensorflow.py", line 2, in
import tensorflow as tf
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in
from tensorflow.python import *
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in
from tensorflow.python import pywrap_tensorflow
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/iasc/tensorflow/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/home/iasc/tensorflow/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/home/iasc/tensorflow/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
(tensorflow) iasc@chenhao:~/workspaces$
仔细看log,发现是找不到 libcublas.so.9.0。那当然,咱们还没安装 CUDA 呢。

检查环境

CUDA 安装参考。注意看一下 System Requirements 部分的信息。
检查 GPU 显卡信息
$ lspci | grep -i nvidia
03:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
检查是所支持的 Linux
$ uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=17.10
DISTRIB_CODENAME=artful
DISTRIB_DESCRIPTION="Ubuntu 17.10"
NAME="Ubuntu"
VERSION="17.10 (Artful Aardvark)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 17.10"
VERSION_ID="17.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=artful
UBUNTU_CODENAME=artful
检查gcc版本需要 > 6.3.0 
$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
检查 Linux 内核 > 4.9.0
$ uname -r
4.13.0-21-generic

安装 NVidia 驱动

运行一下 nvidia-smi 你看看,这是 nvidia 驱动还没装。
$ nvidia-smi
nvidia-smi: command not found
如果已经安装,但是需要修改驱动版本时,可以用下面的命令清除旧版本驱动和 nouveau。
$ sudo apt-get purge nvidia*
$ sudo apt-get --purge remove xserver-xorg-video-nouveau
安装驱动 384 版本,这源于 CUDA 9.0 的要求。 
利用一下的命令得出 当前版本最高 GPU Driver 是 384。
$ sudo apt-cache search nvidia | grep -E "nvidia-[0-9]{3}"
进行安装
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt-get update && sudo apt-get upgrade -y
$ sudo apt-get install -y nvidia-384
重启系统,然后就能够用 nvidia-smi 查看显卡状态了。
$ nvidia-smi
Thu Mar 29 00:32:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 35C P5 22W / 200W | 0MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

安装 CUDA 9.0

安装 CUDA 9.0 是 Tensorflow 1.6 的要求。
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/1/cuda-repo-ubuntu1704-9-0-local-cublas-performance-update_1.0-1_amd64-deb
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/2/cuda-repo-ubuntu1704-9-0-local-cublas-performance-update-2_1.0-1_amd64-deb
安装命令
$ sudo dpkg -i cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
$ sudo apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
$ sudo dpkg -i cuda-repo-ubuntu1704-9-0-local-cublas-performance-update_1.0-1_amd64-deb
$ sudo dpkg -i cuda-repo-ubuntu1704-9-0-local-cublas-performance-update-2_1.0-1_amd64-deb
$ sudo apt-get update
$ sudo apt-get install cuda
安装完成后能够看到 /usr/local/cuda
$ ls -l /usr/local/
total 36
...
lrwxrwxrwx 1 root root 8 Mar 29 02:41 cuda -> cuda-9.0
drwxr-xr-x 15 root root 4096 Mar 29 02:40 cuda-9.0
...
将 CUDA 路径增加到 
$ vi ~/.bash_profile:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda

$ source ~/.bash_profile

cuDNN 安装

下载链接 https://developer.nvidia.com/rdp/cudnn-download。 请选择 cuDNN v7.1.2 Library for Linux 版本下载。
展开后,将相关文件拷贝到 CUDA 的安装目录下:
tar -zxvf cudnn-9.0-linux-x64-v7.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

最后验证

安装完成之后,我们再验证一下
iasc@chenhao:~/workspaces$ . ~/tensorflow/bin/activate
(tensorflow) iasc@chenhao:~/workspaces$ python hello_tensorflow.py
2018-03-29 03:00:57.990863: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-29 03:00:58.706255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-29 03:00:58.707094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:03:00.0
totalMemory: 7.92GiB freeMemory: 7.80GiB
2018-03-29 03:00:58.707109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-29 03:01:03.001985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7540 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
b'Hello, TensorFlow!'
太好了,运行成功了。不过为什么还有警告? not compiled to use: AVX2 FMA ?
修改程序:
(tensorflow) iasc@chenhao:~/workspaces$ vi hello_tensorflow.py
import tensorflow as tf

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
再次运行,Yeah:
(tensorflow) iasc@chenhao:~/workspaces$ python hello_tensorflow.py
b'Hello, TensorFlow!'

所需硬件
暂无数据!
代码展示

1. hello_tensorflow.py

编程语言: Python

软件工具:other


附件下载
暂无数据!
0
大牛,别默默的看了,快登录帮我点评一下吧!

立即注册