小编Sur*_*ale的帖子

ValueError：使用 env:// rendezvous 初始化 torch.distributed 时出错：需要环境变量 MASTER_ADDR，但未设置

我无法在 PyTorch 中为 BERT 模型初始化组进程，我曾尝试使用以下代码进行初始化：

import torch
import datetime

torch.distributed.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=datetime.timedelta(0, 1800),
    world_size=0,
    rank=0,
    store=None,
    group_name=''
)

Run Code Online (Sandbox Code Playgroud)

并尝试访问该get_world_size()功能：

num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

Run Code Online (Sandbox Code Playgroud)

完整代码：

train_examples = None
    num_train_optimization_steps = None
    if do_train:
        train_examples = processor.get_train_examples(data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
        if local_rank != -1:
            import datetime
            torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
            print(num_train_optimization_steps)

Run Code Online (Sandbox Code Playgroud)

python pytorch

Sur*_*ale

2019 06-28

8
推荐指数

2
解决办法

3万
查看次数

导入错误：请从 https://www.github.com/nvidia/apex 安装 apex 以使用分布式和 fp16 训练

无法为 bert 模型的分布式和 fp16 训练安装 apex 我试图通过从 github 克隆 apex 来安装并尝试使用 pip 安装软件包

我试图通过使用以下命令从 git hub 克隆来安装 apex：

git 克隆https://github.com/NVIDIA/apex.git

和 cd apex 转到 apex 目录并尝试使用以下 pip 命令安装软件包：

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext"

完整代码是：

def main(server_ip,server_port,local_rank,no_cuda,fp16,train_batch_size,gradient_accumulation_steps,seed,do_train,do_eval,output_dir,task_name,data_dir,do_lower_case,bert_model,num_train_epochs,cache_dir,learning_rate,warmup_proportion,loss_scale,max_seq_length):
        if server_ip and server_port:
            # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
            import ptvsd
            print("Waiting for debugger attach")
            ptvsd.enable_attach(address=(server_ip, server_port), redirect_output=True)
            ptvsd.wait_for_attach()

        processors = {"ner":NerProcessor}
        print(processors)

        if local_rank == -1 or no_cuda:
            device = torch.device("cuda" if torch.cuda.is_available() and not no_cuda else "cpu")
            n_gpu = …

Run Code Online (Sandbox Code Playgroud)

nlp python-3.x deep-learning pytorch

Sur*_*ale

2019 07-03

3
推荐指数

1
解决办法

6588
查看次数

无法连接到 duckling http 服务器。确保 duckling 服务器正在运行，并且在配置中设置了正确的主机和端口

我已经在 Slack 上创建了工作场所，并在那里注册了应用程序，从那里我获得了必要的东西，例如 slack 令牌和通道，以将其放入 rasa 的 credential.yml 文件中。获得所有凭据后，我尝试使用以下命令在 rasa 机器人和 slack 之间进行连接：

rasa run

Run Code Online (Sandbox Code Playgroud)

我的凭证.yml 包含：

松弛：

  slack_token: "xoxb-****************************************"
  slack_channel: "#ghale"

Run Code Online (Sandbox Code Playgroud)

在这里，我使用 ngrok 将本地计算机上运行的 Web 服务器公开到互联网

但出现错误：

rasa.nlu.extractors.duckling_http_extractor - 无法连接到 duckling http 服务器。确保 duckling 服务器正在运行，并且在配置中设置了正确的主机和端口。有关如何运行服务器的更多信息可以在 github 上找到： https: //github.com/facebook/duckling#quickstart错误：HTTPConnectionPool(host='localhost', port=8000): url 超出最大重试次数：/parse （由NewConnectionError（'：无法建立新连接：[WinError 10061]无法建立连接，因为目标机器主动拒绝'，））

chatbot python-3.x ngrok slack-api rasa

Sur*_*ale

2019 07-18

1
推荐指数

1
解决办法

5248
查看次数