我正在开发一个 nlp 项目并尝试遵循本教程https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e \n并在执行这部分时
\nimport spacy\n\n# Load the large English NLP model\nnlp = spacy.load(\'en_core_web_lg\')\n\n# Replace a token with "REDACTED" if it is a name\ndef replace_name_with_placeholder(token):\n if token.ent_iob != 0 and token.ent_type_ == "PERSON":\n return "[REDACTED] "\n else:\n return token.string\n\n # Loop through all the entities in a document and check if they are names\ndef scrub(text):\ndoc = nlp(text)\nfor ent in doc.ents:\n ent.merge()\ntokens = map(replace_name_with_placeholder, doc)\nreturn "".join(tokens)\n\ns = """\nIn 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". \nIn 1957, Noam Chomsky\xe2\x80\x99s \n Syntactic Structures revolutionized Linguistics with \'universal grammar\', a rule based system of \n syntactic structures.\n """\n\n print(scrub(s))\nRun Code Online (Sandbox Code Playgroud)\n出现这个错误
\n---------------------------------------------------------------------------\nAttributeError Traceback (most recent call last)\n<ipython-input-62-ab1c786c4914> in <module>\n 4 """\n 5 \n ----> 6 print(scrub(s))\n\n<ipython-input-60-4742408aa60f> in scrub(text)\n 3 doc = nlp(text)\n 4 for ent in doc.ents:\n ----> 5 ent.merge()\n 6 tokens = map(replace_name_with_placeholder, doc)\n 7 return "".join(tokens)\n\n AttributeError: \'spacy.tokens.span.Span\' object has no attribute \'merge\'\nRun Code Online (Sandbox Code Playgroud)\n
小智 6
span.merge()自该教程制作以来,Spacy 就取消了该方法。现在执行此操作的方法是使用doc.retokenize(): https: //spacy.io/api/doc#retokenize。我为你实现了它scrub功能实现了以下功能:
# Loop through all the entities in a document and check if they are names\ndef scrub(text):\n doc = nlp(text)\n with doc.retokenize() as retokenizer:\n for ent in doc.ents:\n retokenizer.merge(ent)\n tokens = map(replace_name_with_placeholder, doc)\n return "".join(tokens)\n\ns = """\nIn 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". \nIn 1957, Noam Chomsky\xe2\x80\x99s \n Syntactic Structures revolutionized Linguistics with \'universal grammar\', a rule based system of \n syntactic structures.\n """\n\nprint(scrub(s))\nRun Code Online (Sandbox Code Playgroud)\n其他注意事项:
\n你的replace_name_with_placeholder函数会抛出一个错误,使用token.text,我在下面修复了它:
def replace_name_with_placeholder(token):\n if token.ent_iob != 0 and token.ent_type_ == "PERSON":\n return "[REDACTED] "\n else:\n return token.text\nRun Code Online (Sandbox Code Playgroud)\n如果您正在提取实体以及其他跨度,例如doc.noun_chunks,您可能会遇到一些问题,例如:
ValueError: [E102] Can\'t merge non-disjoint spans. \'Computing\' is already part of \n tokens to merge. If you want to find the longest non-overlapping spans, you can \n use the util.filter_spans helper:\n https://spacy.io/api/top-level#util.filter_spans\nRun Code Online (Sandbox Code Playgroud)\n因此,您可能还需要查看spacy.util.filter_spans:\n https://spacy.io/api/top-level#util.filter_spans。