Sus*_*ant 6 python word2vec spacy
sense2vec的文档提到了3个主要文件 - 第一个是merge_text.py.我尝试了几种类型的input-txt,csv,bzipped文件,因为merge_text.py试图打开由bzip2压缩的文件.
该文件位于:https: //github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
这个脚本需要什么类型的输入格式?此外,如果有人可以请建议如何训练模型.
我扩展并调整了来自 sense2vec 的代码示例。
你从这个输入文本开始:
“就沙特阿拉伯及其动机而言,这也很简单。沙特人擅长金钱和算术。面临着亏损的痛苦选择,要么将当前产量维持在每桶 60 美元,要么每天减产 200 万桶。市场并损失更多的钱——这是一个简单的选择:走不那么痛苦的道路。如果有伤害美国致密油生产商或伤害伊朗和俄罗斯等次要原因,那很好,但实际上只是钱的问题。”
对此:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ 母题|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money| NOUN and|CCONJ 算术|NOUN 面临|VERB with|ADP pain_choice|NOUN of|ADP loss|VERB money|NOUN 维持|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP 桶|NOUN or|CCONJ take|VERB 两百万|CARDINAL 桶|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ loss|VERB much_more_money|NOUN it|PRON's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is |VERB 少|ADV 痛苦|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP 伤害|VERB us|ENT tiny_oil_producer|NOUN or|CCONJ 伤害|VERB 伊朗|ENT 和|CCONJ 俄罗斯|ENT 的| VERB great|ADJ but|CCONJ it|PRON's|VERB真的|ADV just|ADV about|ADP money|NOUN
这是代码。如果您有任何疑问,请告诉我。
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
Run Code Online (Sandbox Code Playgroud)
您可以使用以下方法在 Tensorboard 中使用 Gensim 可视化您的模型:https : //github.com/ArdalanM/gensim2tensorboard
我还将调整此代码以使用 sense2vec 方法(例如,单词在预处理步骤中变为小写,只需在代码中将其注释掉)。
快乐编码,woltob
小智 0
输入文件应该是 bzipped json。要使用纯文本文件,只需编辑merge_text.py如下:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3421 次 |
| 最近记录: |