Sam*_*Sam 5 gpu dataset pytorch huggingface-transformers
我正在使用文本列对数据框运行 T5-base-grammar- Correction 进行语法校正
from happytransformer import HappyTextToText
from happytransformer import TTSettings
from tqdm.notebook import tqdm
tqdm.pandas()
happy_tt = HappyTextToText("T5", "./t5-base-grammar-correction")
beam_settings = TTSettings(num_beams=5, min_length=1, max_length=30)
def grammer_pipeline(text):
text = "gec: " + text
result = happy_tt.generate_text(text, args=beam_settings)
return result.text
df['new_text'] = df['original_text'].progress_apply(grammer_pipeline)
Run Code Online (Sandbox Code Playgroud)
Pandas apply 函数虽然运行并提供所需的结果,但运行速度相当慢。
另外,我在执行代码时收到以下警告
/home/.local/lib/python3.6/site-packages/transformers/pipelines/base.py:908: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
UserWarning,
Run Code Online (Sandbox Code Playgroud)
我可以访问 GPU。有人可以提供一些指导来加快执行速度并利用 GPU 的全部功能吗
- - - - - - - - - - - - - - - - 编辑 - - - - - - - - - ----------------
我尝试按以下方式使用 pytorch 数据集,但处理速度仍然很慢:
class CustomD(Dataset):
def __init__(self, text):
self.text = text
self.len = text.shape[0]
def __len__(self):
return self.len
def __getitem__(self, idx):
text = self.text[idx]
text = "gec: " + text
result = happy_tt.generate_text(text, args=beam_settings)
return result.text
TD = GramData(df.original_text)
final_data = DataLoader(dataset=TD,
batch_size=10,
shuffle=False
)
import itertools
list_modified=[]
for (idx, batch) in enumerate(final_data):
list_modified.append(batch)
flat_list = [item for sublist in list_modified for item in sublist]
df["new_text"]=flat_list
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3400 次 |
最近记录: |