如何微调文本分类的零样本模型

cro*_*oik 1 python nlp huggingface-transformers

我需要一个能够对未知数量的类(即数量可能随着时间的推移而增长)的文本进行分类的模型。零样本文本分类的蕴涵方法似乎是我问题的解决方案,我尝试的模型facebook / bart-large-mnli在我的注释数据上表现不佳。有没有办法在不损失模型稳健性的情况下对其进行微调?

我的数据集如下所示:

# http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
World, "Afghan Army Dispatched to Calm Violence KABUL, Afghanistan - Government troops intervened in Afghanistan's latest outbreak of deadly fighting between warlords, flying from the capital to the far west on U.S. and NATO airplanes to retake an air base contested in the violence, officials said Sunday..."
Sports, "Johnson Helps D-Backs End Nine-Game Slide (AP) AP - Randy Johnson took a four-hitter into the ninth inning to help the Arizona Diamondbacks end a nine-game losing streak Sunday, beating Steve Trachsel and the New York Mets 2-0." 
Business, "Retailers Vie for Back-To-School Buyers (Reuters) Reuters - Apparel retailers are hoping their\back-to-school fashions will make the grade among\style-conscious teens and young adults this fall, but it could\be a tough sell, with students and parents keeping a tighter\hold on their wallets."
Run Code Online (Sandbox Code Playgroud)

PS:这是一个人为的问题,因为这个主题出现在与这篇文章相关的这篇文章的评论部分。

cro*_*oik 7

概念解释

\n

在回答您的问题之前,了解零样本文本分类的蕴含方法如何工作至关重要。这种方法需要一个经过 NLI 训练的模型,这意味着它能够确定是否hypothesis

\n
    \n
  • 支持的,
  • \n
  • 不支持,
  • \n
  • 未确定的
  • \n
\n

由给定的premise [1]。您可以使用以下代码验证您提到的模型:

\n
from transformers import AutoModelForSequenceClassification, AutoTokenizer\nnli_model = AutoModelForSequenceClassification.from_pretrained(\'facebook/bart-large-mnli\')\n# It will output three logits\nprint(nli_model.classification_head.out_proj)\n# Each vector corresponds to the following labels\nprint(nli_model.config.id2label)\n
Run Code Online (Sandbox Code Playgroud)\n

输出:

\n
Linear(in_features=1024, out_features=3, bias=True)\n{0: \'contradiction\', 1: \'neutral\', 2: \'entailment\'}\n
Run Code Online (Sandbox Code Playgroud)\n

Yin 等人提出的蕴含方法。al,通过使用文本 as并使用模板为每个可能的类premise制定 a 来利用这些 NLI 功能:hypothesis

\n
Linear(in_features=1024, out_features=3, bias=True)\n{0: \'contradiction\', 1: \'neutral\', 2: \'entailment\'}\n
Run Code Online (Sandbox Code Playgroud)\n

这意味着当您有一个文本和三个潜在类别时,您将向 NLI 模型传递三个序列并比较蕴含逻辑以对文本进行分类。

\n

微调

\n

因此,要根据带注释的数据微调 NLI 模型,您需要将文本分类任务制定为 NLI 任务!这意味着,您需要生成premises并且标签必须是contradictionentailment。包含标签contradiction是为了避免模型只看到各自前提所蕴含的假设(即模型需要学习收缩来预测零样本文本分类任务蕴涵的低分)。

\n

以下代码向您展示了如何准备数据集的示例:

\n
"the text is about {}\xe2\x80\x9d\n
Run Code Online (Sandbox Code Playgroud)\n

输出:

\n
{\'text\': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are \'disappointed\' after talks with stricken parent firm Federal Mogul.", \n\'class\': \'Business\'}\n\n{\'input_ids\': [0, 597, 12541, 13, 255, 234, 4931, 71, 1431, 1890, 2485, 4561, 1138, 23, 6980, 1437, 1437, 188, 1250, 224, 51, 32, 128, 7779, 19051, 108, 71, 1431, 19, 35876, 4095, 933, 1853, 18059, 922, 4, 2, 2, 713, 1246, 16, 2090, 4, 2], \n\'attention_mask\': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], \n\'labels\': 2, \n\'input_sentence\': "<s>Fears for T N pension after talks Unions representing workers at Turner   Newall say they are \'disappointed\' after talks with stricken parent firm Federal Mogul.</s></s>This example is Business.</s>"}\n
Run Code Online (Sandbox Code Playgroud)\n

鲁棒性

\n

微调将明显降低模型的稳健性(即为不属于微调数据集的类提供良好结果的能力)。为了避免这种情况,你可以尝试:

\n
    \n
  • 在转换之前停止训练并检查性能是否仍足以满足您的需求。
  • \n
  • WiSE-FT由 Wortsmann 等人提出。等人。伪代码如附录 A 所示。
  • \n
\n