小编pla*_*nne的帖子

语料库中的 Pyspark CountVectorizer 和词频

我目前正在研究文本语料库。
假设我清理了我的逐字记录并且我有以下 pyspark DataFrame ：

df = spark.createDataFrame([(0, ["a", "b", "c"]),
                            (1, ["a", "b", "b", "c", "a"])],
                            ["label", "raw"])
df.show()

+-----+---------------+
|label|            raw|
+-----+---------------+
|    0|      [a, b, c]|
|    1|[a, b, b, c, a]|
+-----+---------------+

Run Code Online (Sandbox Code Playgroud)

我现在想实现一个 CountVectorizer。所以，我使用pyspark.ml.feature.CountVectorizer如下：

cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)

+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+

Run Code Online (Sandbox Code Playgroud)

现在，我还想获取 CountVectorizer 选择的词汇表，以及语料库中相应的词频。
Using cvmodel.vocabularyonly 提供词汇表：

voc = cvmodel.vocabulary
voc
[u'b', u'a', …

Run Code Online (Sandbox Code Playgroud)

python text-mining pyspark

pla*_*nne

2018 05-10

6
推荐指数

1
解决办法

7723
查看次数

具有非捕获字符的 python re.sub() 的意外结果

我无法理解以下输出：

import re 

re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'

Run Code Online (Sandbox Code Playgroud)

根据文档：

返回通过替换 repl 替换 string 中模式的最左侧非重叠出现而获得的字符串。

那么为什么空格包含在捕获的出现中，然后被替换，因为我在它之前添加了一个非捕获标签？

我想要以下输出：

' fast-forward'

Run Code Online (Sandbox Code Playgroud)

python regex

pla*_*nne

lucky-day

4
推荐指数

1
解决办法

814
查看次数

Pandas groupby周给出了一个日期时间列

假设我有以下数据样本:

df = pd.DataFrame({'date':['2011-01-01','2011-01-02',
                       '2011-01-03','2011-01-04','2011-01-05',
                       '2011-01-06','2011-01-07','2011-01-08',
                       '2011-01-09','2011-12-30','2011-12-31'],
                   'revenue':[5,3,2,
                              10,12,2,
                              1,0,6,10,12]})

# Let's format the date and add the week number and year
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
df['week_number'] = df['date'].dt.week
df['year'] = df['date'].dt.year

df

        date        revenue     week_of_year    year
0       2011-01-01  5           52              2011
1       2011-01-02  3           52              2011
2       2011-01-03  2           1               2011
3       2011-01-04  10          1               2011
4       2011-01-05  12          1               2011
5       2011-01-06  2           1               2011
6       2011-01-07  1           1               2011
7       2011-01-08  0           1               2011
8       2011-01-09  6           1 …

Run Code Online (Sandbox Code Playgroud)

python datetime pandas pandas-groupby

pla*_*nne

2018 06-30

3
推荐指数

1
解决办法

2762
查看次数