小编Bon*_*son的帖子

在Python 3.4中从PDF文本提取的最佳工具

我使用的是Python 3.4,需要从PDF中提取所有文本,然后将其用于文本处理.

我见过的所有答案都提出了Python 2.7的选项.

我需要Python 3.4中的东西.

Bonson

pdf python-3.x

Bon*_*son

lucky-day

40
推荐指数

2
解决办法

4万
查看次数

如何将多个参数传递给apply函数

我有一个名为count的方法,它有两个参数.我需要使用apply()方法调用此方法.但是,当我将两个参数传递给apply方法时,它会给出以下错误:

TypeError:counting()只需2个参数(给定1个)

我已经看到了以下线程python pandas:将一个带有参数的函数应用于一个系列.更新,我不想使用functool.partial,因为我不想导入其他类,以便能够传递参数.

def counting(dic, strWord):
    if strWord in dic:
        return dic[strWord]
    else:
        return 0

DF['new_column'] = DF['dic_column'].apply(counting, 'word')

Run Code Online (Sandbox Code Playgroud)

如果我给出一个参数,它可以工作:

def awesome_count(dic):
    if strWord in dic:
       return dic[strWord]
    else:
       return 0

DF['new_column'] = DF['dic_column'].apply(counting)

Run Code Online (Sandbox Code Playgroud)

python python-3.x pandas

Bon*_*son

2017 05-23

10
推荐指数

1
解决办法

1万
查看次数

将pandas列添加到稀疏矩阵

我有我想在模型中使用的X变量的其他派生值.

XAll = pd_data[['title','wordcount','sumscores','length']]
y = pd_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(XAll, y, random_state=1)

Run Code Online (Sandbox Code Playgroud)

当我在标题中处理文本数据时,我首先将其分别转换为dtm:

vect = CountVectorizer(max_df=0.5)
vect.fit(X_train['title'])
X_train_dtm = vect.transform(X_train['title'])
column_index = X_train_dtm.indices

print(type(X_train_dtm))    # This is <class 'scipy.sparse.csr.csr_matrix'>
print("X_train_dtm shape",X_train_dtm.get_shape())  # This is (856, 2016)
print("column index:",column_index)     # This is column index: [ 533  754  859 ...,  633  950 1339]

Run Code Online (Sandbox Code Playgroud)

现在我将文本作为文档术语矩阵,我想将其他功能添加到X_train_dtm这些数字中,例如'wordcount','sumscores','length'.我将使用新的dtm创建模型,因此我将插入附加功能更准确.

如何将pandas数据帧的其他数字列添加到稀疏csr矩阵？

python pandas scikit-learn sklearn-pandas

Bon*_*son

lucky-day

10
推荐指数

1
解决办法

3502
查看次数

将列添加到稀疏矩阵

当我执行以下代码时,我得到一个备用矩阵:

import numpy as np
from scipy.sparse import csr_matrix

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sp = csr_matrix((data, (row, col)), shape=(3, 3))
print(sp)

  (0, 0)        1
  (0, 2)        2
  (1, 2)        3
  (2, 0)        4
  (2, 1)        5
  (2, 2)        6

Run Code Online (Sandbox Code Playgroud)

我想在这个稀疏矩阵中添加另一列,因此输出为:

  (0, 0)        1
  (0, 2)        2
  (0, 3)        7
  (1, 2)        3
  (1, 3)        7
  (2, 0)        4
  (2, 1) …

Run Code Online (Sandbox Code Playgroud)

python numpy scipy sparse-matrix python-3.x

Bon*_*son

2017 01-31

9
推荐指数

1
解决办法

7785
查看次数

安装 nltk 支持包时出错：nltk.download()

我已经安装了 nltk 包。之后，我尝试使用 nltk.download() 下载支持包，但出现错误：

[错误 11001] 获取地址信息

我的机器/软件详细信息是：

操作系统：Windows 8.1 Python：3.3.4 NLTK 包：3.0

下面是在python中运行的命令：

Python 3.3.4 (v3.3.4:7ff62415e426, Feb 10 2014, 18:13:51) [MSC v.1600 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.

import nltk

nltk.download()
showing info http://nltk.github.com/nltk_data/
True

nltk.download("all")
[nltk_data] Error loading all: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
False

Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

看起来它要去 http://nltk.github.com/nltk_data/而理想情况下它应该尝试从http://www.nltk.org/nltk_data/获取数据。

在另一台机器上，当我们在浏览器中输入http://nltk.github.com/nltk_data/时，它会重定向到http://www.nltk.org/nltk_data/。我不明白为什么我的笔记本电脑上没有发生重定向。

我觉得这可能是问题所在。

请帮忙。

我已经添加了命令提示符屏幕截图。需要帮忙..

在此处输入图片说明

问候，邦森

python nltk python-3.x

Bon*_*son

2015 01-04

7
推荐指数

2
解决办法

3万
查看次数

有向图中的最大公共子图

我试图将一组句子表示为一个有向图，其中一个单词由一个节点表示。如果单词重复，则节点不重复，则使用先前存在的节点。我们称这个图为MainG。

在此之后，我取一个新句子，创建该句子的有向图（称为该图SubG），然后查找SubGin的最大公共子图MainG。

我在 Python 3.5 中使用 NetworkX api。我知道这是正常图的 NP-Complete 问题，但对于有向图则是线性问题。我提到的链接之一：

如何找到两个图的最大公共子图？

我尝试执行以下代码：

import networkx as nx
import pandas as pd
import nltk

class GraphTraversal:
    def createGraph(self, sentences):
        DG=nx.DiGraph()
        tokens = nltk.word_tokenize(sentences)
        token_count = len(tokens)
        for i in range(token_count):
            if i == 0:
                continue
            DG.add_edges_from([(tokens[i-1], tokens[i])], weight=1)
        return DG


    def getMCS(self, G_source, G_new):
        """
        Creator: Bonson
        Return the MCS of the G_new graph that is present 
        in the G_source graph
        """
        order = …

Run Code Online (Sandbox Code Playgroud)

python graph networkx python-3.x

Bon*_*son

2020 03-24

5
推荐指数

1
解决办法

1938
查看次数

对 Pandas 数据帧上的文本应用自定义函数，而不是迭代单个元素

我的熊猫数据框非常大，所以我希望能够修改 textLower(frame) 函数，以便它在一个命令中执行，而且我不必遍历每一行来对每个元素执行一系列字符串操作。

#   Function iterates over all the values of a pandas dataframe
def textLower(frame):
    for index, row in frame.iterrows():
        row['Text'] = row['Text'].lower()
        # further modification on row['Text']
    return frame


def tryLower():
    cities = ['Chicago', 'New York', 'Portland', 'San Francisco',
     'Austin', 'Boston']
    dfCities = pd.DataFrame(cities, columns=['Text'])
    frame = textLower(dfCities)

    for index, row in frame.iterrows():
        print(row['Text'])
#########################  main () #########################    
def main():
    tryLower()

Run Code Online (Sandbox Code Playgroud)

python pandas

Bon*_*son

lucky-day

3
推荐指数

1
解决办法

1445
查看次数

通过联合SPARQL查询连接Linkedmdb和DBpedia

我运行了以下查询,并从linkedmdb获取了电影及其相应的DBpedia URI的数据.

SELECT ?film ?label ?dbpediaLink
WHERE { 
  ?film rdf:type movie:film .
  ?film rdfs:label ?label .
  ?film owl:sameAs ?dbpediaLink 
  FILTER(regex(str(?dbpediaLink), "dbpedia", "i"))
}
LIMIT 100

Run Code Online (Sandbox Code Playgroud)

我想使用?dbpediaLinkURI 从DBpedia获取这些电影的类别.另外,我需要dcterms:subject从DBpedia获取电影的属性值.我无法理解如何连接它们？我可以通过SPARQL来完成,还是需要为此编写代码？

rdf semantic-web sparql dbpedia linkedmdb

Bon*_*son

2013 10-07

1
推荐指数

1
解决办法

3231
查看次数