NotImplementedError:不支持加载本地文件系统中缓存的数据集

Ari*_*Ari 15 python python-3.x openai-api huggingface-datasets

我尝试使用datasets本地 Python 笔记本中的 python 模块加载数据集。我正在运行 Python 3.10.13 内核,就像我为虚拟环境所做的那样。

我无法加载我从教程中遵循的数据集。这是错误:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/Users/ari/Downloads/00-fine-tuning.ipynb Celda 2 line 3
      1 from datasets import load_dataset
----> 3 data = load_dataset(
      4     "jamescalam/agent-conversations-retrieval-tool",
      5     split="train"
      6 )
      7 data

File ~/Documents/fastapi_language_tutor/env/lib/python3.10/site-packages/datasets/load.py:2149, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2145 # Build dataset for splits
   2146 keep_in_memory = (
   2147     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2148 )
-> 2149 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
   2150 # Rename and cast features to match task schema
   2151 if task is not None:
   2152     # To avoid issuing the same warning twice

File ~/Documents/fastapi_language_tutor/env/lib/python3.10/site-packages/datasets/builder.py:1173, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
   1171 is_local = not is_remote_filesystem(self._fs)
   1172 if not is_local:
-> 1173     raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
   1174 if not os.path.exists(self._output_dir):
   1175     raise FileNotFoundError(
   1176         f"Dataset {self.dataset_name}: could not find data in {self._output_dir}. Please make sure to call "
   1177         "builder.download_and_prepare(), or use "
   1178         "datasets.load_dataset() before trying to access the Dataset object."
   1179     )

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题?我不明白这个错误是如何适用的,因为数据集是我正在获取的东西,因此不能首先缓存在我的 LocalFileSystem 中。

Pal*_*ine 34

尝试做:

pip install -U datasets
Run Code Online (Sandbox Code Playgroud)

此错误源于 fsspec 的重大更改。它已在最新的数据集版本 (2.14.6) 中修复。使用 pip install -U datasets 更新安装应该可以解决该问题。

git 链接: https: //github.com/huggingface/datasets/issues/6352


如果您正在使用,fsspec请执行以下操作:

pip install fsspec==2023.9.2
Run Code Online (Sandbox Code Playgroud)

有一个问题fsspec==2023.10.0

git 链接: https: //github.com/huggingface/datasets/issues/6330

  • 这仅在我重新启动内核后才起作用。我也没有明确安装“fsspec”。谢谢! (5认同)