ojp*_*ojp 5 gpu machine-learning tensorflow google-colaboratory
我正在 colab Pro GPU 上运行 Convnet。我在运行时选择了 GPU,并且可以确认 GPU 可用。我运行的网络与昨天晚上完全相同,但每个 epoch 大约需要 2 小时...昨晚每个 epoch 大约需要 3 分钟...根本没有任何变化。我有一种感觉 Colab 可能限制了我的 GPU 使用,但我不知道如何判断这是否是问题所在。GPU 速度是否会根据一天中的时间等而波动很大?以下是我打印的一些诊断信息,有谁知道我如何更深入地调查这种缓慢行为的根本原因是什么?
\n\n我还尝试将 colab 中的加速器更改为“无”,并且我的网络速度与选择“GPU”时的速度相同,这意味着由于某种原因我不再在 GPU 上进行训练,或者资源受到严重限制。我使用的是 Tensorflow 2.1。
\n\ngpu_info = !nvidia-smi\ngpu_info = \'\\n\'.join(gpu_info)\nif gpu_info.find(\'failed\') >= 0:\n print(\'Select the Runtime \xe2\x86\x92 "Change runtime type" menu to enable a GPU accelerator, \')\n print(\'and then re-execute this cell.\')\nelse:\n print(gpu_info)\n\nSun Mar 22 11:33:14 2020 \n+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 440.64.00 Driver Version: 418.67 CUDA Version: 10.1 |\n|-------------------------------+----------------------+----------------------+\n| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n|===============================+======================+======================|\n| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |\n| N/A 40C P0 32W / 250W | 8747MiB / 16280MiB | 0% Default |\n+-------------------------------+----------------------+----------------------+\n\n+-----------------------------------------------------------------------------+\n| Processes: GPU Memory |\n| GPU PID Type Process name Usage |\n|=============================================================================|\n+-----------------------------------------------------------------------------+\n
Run Code Online (Sandbox Code Playgroud)\n\ndef mem_report():\n print("CPU RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ))\n\n GPUs = GPUtil.getGPUs()\n for i, gpu in enumerate(GPUs):\n print(\'GPU {:d} ... Mem Free: {:.0f}MB / {:.0f}MB | Utilization {:3.0f}%\'.format(i, gpu.memoryFree, gpu.memoryTotal, gpu.memoryUtil*100))\n\nmem_report()\n
Run Code Online (Sandbox Code Playgroud)\n\nCPU RAM Free: 24.5 GB\nGPU 0 ... Mem Free: 7533MB / 16280MB | Utilization 54%\n
Run Code Online (Sandbox Code Playgroud)\n\n仍然没有运气加快速度,这是我的代码,也许我忽略了一些东西......顺便说一句,这些图像来自旧的 Kaggle 竞赛,数据可以在这里找到。训练图像保存在我的谷歌驱动器上。https://www.kaggle.com/c/datasciencebowl
\n\n#loading images from kaggle api\n\n#os.environ[\'KAGGLE_USERNAME\'] = ""\n#os.environ[\'KAGGLE_KEY\'] = ""\n\n#!kaggle competitions download -c datasciencebowl\n\n#unpacking zip files\n\n#zipfile.ZipFile(\'./sampleSubmission.csv.zip\', \'r\').extractall(\'./\')\n#zipfile.ZipFile(\'./test.zip\', \'r\').extractall(\'./\')\n#zipfile.ZipFile(\'./train.zip\', \'r\').extractall(\'./\')\n\ndata_dir = pathlib.Path(\'train\')\n\nimage_count = len(list(data_dir.glob(\'*/*.jpg\')))\nCLASS_NAMES = np.array([item.name for item in data_dir.glob(\'*\') if item.name != "LICENSE.txt"])\n\nshrimp_zoea = list(data_dir.glob(\'shrimp_zoea/*\'))\nfor image_path in shrimp_zoea[:5]:\n display.display(Image.open(str(image_path)))\n
Run Code Online (Sandbox Code Playgroud)\n\nimage_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255,\n validation_split=0.2)\n #rotation_range = 40,\n #width_shift_range = 0.2,\n #height_shift_range = 0.2,\n #shear_range = 0.2,\n #zoom_range = 0.2,\n #horizontal_flip = True,\n #fill_mode=\'nearest\')\n
Run Code Online (Sandbox Code Playgroud)\n\nvalidation_split = 0.2\nBATCH_SIZE = 32\nBATCH_SIZE_VALID = 10\nIMG_HEIGHT = 224\nIMG_WIDTH = 224\nSTEPS_PER_EPOCH = np.ceil(image_count*(1-(validation_split))/BATCH_SIZE)\nVALIDATION_STEPS = np.ceil((image_count*(validation_split)/BATCH_SIZE))\n
Run Code Online (Sandbox Code Playgroud)\n\ntrain_data_gen = image_generator.flow_from_directory(directory=str(data_dir),\n subset=\'training\',\n batch_size=BATCH_SIZE,\n class_mode = \'categorical\',\n shuffle=True,\n target_size=(IMG_HEIGHT, IMG_WIDTH),\n classes = list(CLASS_NAMES))\n\nvalidation_data_gen = image_generator.flow_from_directory(directory=str(data_dir),\n subset=\'validation\',\n batch_size=BATCH_SIZE_VALID,\n class_mode = \'categorical\',\n shuffle=True,\n target_size=(IMG_HEIGHT, IMG_WIDTH),\n classes = list(CLASS_NAMES))\n\n
Run Code Online (Sandbox Code Playgroud)\n\nmodel_basic = tf.keras.models.Sequential([\n tf.keras.layers.Conv2D(16, (3,3), activation=\'relu\', input_shape=(224, 224, 3)),\n tf.keras.layers.MaxPooling2D(2, 2),\n tf.keras.layers.Conv2D(32, (3,3), activation=\'relu\'),\n tf.keras.layers.MaxPooling2D(2,2),\n tf.keras.layers.Conv2D(64, (3,3), activation=\'relu\'),\n tf.keras.layers.MaxPooling2D(2,2),\n tf.keras.layers.Conv2D(128, (3,3), activation=\'relu\'),\n tf.keras.layers.MaxPooling2D(2,2),\n tf.keras.layers.Conv2D(128, (3,3), activation=\'relu\'),\n tf.keras.layers.MaxPooling2D(2,2),\n tf.keras.layers.Flatten(),\n tf.keras.layers.Dropout(0.2),\n tf.keras.layers.Dense(1000, activation=\'relu\'),\n tf.keras.layers.Dense(121, activation=\'softmax\')\n])\n\nmodel_basic.summary()\n
Run Code Online (Sandbox Code Playgroud)\n\nmodel_basic.compile(optimizer=\'adam\',\n loss=\'categorical_crossentropy\',\n metrics=[\'accuracy\'])\n
Run Code Online (Sandbox Code Playgroud)\n\nhistory = model_basic.fit(\n train_data_gen,\n epochs=10,\n verbose=1,\n validation_data=validation_data_gen,\n steps_per_epoch=STEPS_PER_EPOCH,\n validation_steps=VALIDATION_STEPS,\n initial_epoch=0 \n)\n
Run Code Online (Sandbox Code Playgroud)\n
最后,瓶颈似乎是在每批中将图像从谷歌驱动器加载到colab。将图像加载到磁盘将每个周期的时间减少到大约 30 秒...这是我用来加载到磁盘的代码:
!mkdir train_local
!unzip train.zip -d train_local
Run Code Online (Sandbox Code Playgroud)
将我的 train.zip 文件上传到 colab 后
您的nvidia-smi
输出清楚地表明 GPU 已连接。您将训练数据存储在哪里?如果不在本地磁盘上,我建议将其存储在那里。训练数据远程传输的速度可能会根据您的 Colab 后端所在位置而有所不同。
归档时间: |
|
查看次数: |
11472 次 |
最近记录: |