我有 900 个不同的文本文件加载到我的控制台中,总共大约 350 万个单词。我正在运行此处看到的文档聚类算法,并且遇到了该TfidfVectorizer函数的问题。这是我在看的:
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.4, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
store_matrix = {}
for key,value in speech_dict.items():
tfidf_matrix = tfidf_vectorizer.fit_transform(value) #fit the vectorizer to synopses
store_matrix[key] = tfidf_matrix
Run Code Online (Sandbox Code Playgroud)
此代码运行直到ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.弹出。但是,代码不会出错退出,除非我riasemax_df来0.99,降低min_df到0.01。然后,它似乎永远运行,因为它基本上包括所有 350 万个术语。
我怎样才能解决这个问题?
我的文本文件存储在 中speech_dict,其中的键是文件名,其值是文本。
你如何编写Python代码来检查是否运行?在由 Cayley 表定义的集合 {0,1,..,n?1} 上是否关联。
我尝试的代码是:
def is_associative_cayley_table(table):
if not is_cayley_table(table):
return False
for i in range (0,len(table)):
for j in range (0,len(table)):
for k in range (0,len(table)):
if (table[table[i][j])][k])==(table[i][(table[j][k])]):
print("Okay")
else
return False
Run Code Online (Sandbox Code Playgroud) python abstract-algebra python-2.7 python-3.x finite-group-theory
我有很多文本文件,每个文件最后都有一个空行。我的脚本似乎没有删除它们。有人可以帮忙吗?
# python 2.7
import os
import sys
import re
filedir = 'F:/WF/'
dir = os.listdir(filedir)
for filename in dir:
if 'ABC' in filename:
filepath = os.path.join(filedir,filename)
all_file = open(filepath,'r')
lines = all_file.readlines()
output = 'F:/WF/new/' + filename
# Read in each row and parse out components
for line in lines:
# Weed out blank lines
line = filter(lambda x: not x.isspace(), lines)
# Write to the new directory
f = open(output,'w')
f.writelines(line)
f.close()
Run Code Online (Sandbox Code Playgroud) 我正在尝试做一些我知道必须是基本熊猫的事情,但我正在绞尽脑汁想弄清楚。我希望每个组的比例和计数可用于任意级别的分组:
import pandas as pd
df = pd.DataFrame({'A': [1, 0, 1, 0, 1, 0, 0, 0], 'B': ['A'] * 4 + ['B'] * 4})
gb = df.groupby(['A', 'B']).size()
prop_gb = gb / gb.groupby(level=0).sum()
Run Code Online (Sandbox Code Playgroud)
prop_gb就是现在:
prop_gb
Out[116]:
A B
0 A 0.400000
B 0.600000
1 A 0.666667
B 0.333333
dtype: float64
Run Code Online (Sandbox Code Playgroud)
不过,我最终想要这个:
A B prop count
0 A 0.400000 2
B 0.600000 3
1 A 0.666667 2
B 0.333333 1
Run Code Online (Sandbox Code Playgroud)
我尝试合并这两个pandas.Series对象,gb并且prop_gb它们转换为字典并以这种方式“连接”它们,但我知道必须有一种原生的 pandas 方法来完成此操作......
这在技术上实现了我想要的: …
假设我有以下数据:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
Run Code Online (Sandbox Code Playgroud)
我想快速计算数据框中所有值集合中每个值的全局出现.
这有效:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
Run Code Online (Sandbox Code Playgroud)
但是很慢:
%timeit Counter([v for c in df for v in …Run Code Online (Sandbox Code Playgroud) 我有一些训练管道大量使用XGBoost而不是scikit-learn,这仅是因为XGBoost干净地处理空值的方式。
但是,我的任务是向非技术人员介绍机器学习,并认为最好采用单树分类器的想法,并讨论XGBoost 通常如何采用该数据结构并将其“放在类固醇上。 ” 具体来说,我想绘制此单树分类器以显示切点。
指定n_estimators=1是否大致等同于使用scikit的DecisionTreeClassifier?
k我正在尝试根据使用的轮廓分数找到正确的簇数sklearn.cluster.MiniBatchKMeans。
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import HashingVectorizer
docs = ['hello monkey goodbye thank you', 'goodbye thank you hello', 'i am going home goodbye thanks', 'thank you very much sir', 'good golly i am going home finally']
vectorizer = HashingVectorizer()
X = vectorizer.fit_transform(docs)
for k in range(5):
model = MiniBatchKMeans(n_clusters = k)
model.fit(X)
Run Code Online (Sandbox Code Playgroud)
我收到此错误:
Warning (from warnings module):
File "C:\Python34\lib\site-packages\sklearn\cluster\k_means_.py", line 1279
0, n_samples - 1, init_size)
DeprecationWarning: This function is deprecated. Please call randint(0, 4 …Run Code Online (Sandbox Code Playgroud) 我想将在 R 中训练的 XGBoost 模型移植到 Python,反之亦然,以确保跨两个平台的预测性能。
我可以pip install和import虚拟环境中的Mac上的任何软件包一样,执行以下操作:
设置虚拟环境:
Last login: Mon Oct 3 18:47:06 on ttys000
me-MacBook-Pro-3:~ me$ cd /Users/me/Desktop/
me-MacBook-Pro-3:Desktop me$ virtualenv env
New python executable in /Users/me/Desktop/env/bin/python
Installing setuptools, pip, wheel...done.
me-MacBook-Pro-3:Desktop me$ source env/bin/activate
Run Code Online (Sandbox Code Playgroud)
我们的pip install熊猫:
(env) me-MacBook-Pro-3:Desktop me$ pip install pandas
Collecting pandas
Using cached pandas-0.19.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting pytz>=2011k (from pandas)
Using cached pytz-2016.7-py2.py3-none-any.whl
Collecting python-dateutil (from pandas)
Using cached python_dateutil-2.5.3-py2.py3-none-any.whl
Collecting numpy>=1.7.0 (from pandas)
Using cached numpy-1.11.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting six>=1.5 (from python-dateutil->pandas)
Using cached six-1.10.0-py2.py3-none-any.whl
Installing …Run Code Online (Sandbox Code Playgroud) 我正在尝试创建一个sklearn.compose.ColumnTransformer用于转换分类和连续输入数据的管道:
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
{
'a': [1, 'a', 1, np.nan, 'b'],
'b': [1, 2, 3, 4, 5],
'c': list('abcde'),
'd': list('aaabb'),
'e': [0, 1, 1, 0, 1],
}
)
for col in df.select_dtypes('object'):
df[col] = df[col].astype(str)
categorical_columns = list('acd')
continuous_columns = list('be')
categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'
column_transformer = …Run Code Online (Sandbox Code Playgroud) python ×9
pandas ×2
scikit-learn ×2
xgboost ×2
dictionary ×1
infinity ×1
macos ×1
nlp ×1
numpy ×1
pip ×1
python-2.7 ×1
python-3.x ×1
r ×1
types ×1