Loo*_*ast 3 python nlp vectorization feature-extraction countvectorizer
尝试将字符串转换为数字矢量,
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
return Vectorizer
### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()
Run Code Online (Sandbox Code Playgroud)
但是当我陶醉时:
['g', 'o', 'm', 'd']
Run Code Online (Sandbox Code Playgroud)
有错误:
ValueError: empty vocabulary; perhaps the documents only contain stop words
Run Code Online (Sandbox Code Playgroud)
这样的单字母字符串似乎存在问题。我该怎么办?谢谢
CountVectorizer中的默认token_pattern regexp选择文档中所述的至少2个字符的单词:
token_pattern:字符串
表示什么构成“令牌”的正则表达式,仅在分析器=='word'时使用。默认的regexp select标记包含2个或更多字母数字字符(标点符号被完全忽略,始终视为标记分隔符)。
从CountVectorizer的源代码是r"(?u)\b\w\w+\b
更改r"(?u)\b\w+\b为包含1个字母的单词。
将您的代码更改为以下内容(包括token_pattern上面建议的参数):
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b")
Run Code Online (Sandbox Code Playgroud)