小编SPY*_*G96的帖子

使用 .csv 中的自定义分隔符导出 Excel 数据

我想将一个巨大的 Excel 文件导出为 .csv，但数据的单元格内包含逗号。

如何将 Excel 数据导出到带有分隔符的 .csv |

我尝试过执行通常的“另存为”操作，但它对我的数据不起作用。

excel excel-2010

SPY*_*G96

lucky-day

8
推荐指数

1
解决办法

2万
查看次数

spark 可以将数据帧拆分为 topandas 的部分

我有一个 1000 万条记录数据框。我的要求是我需要对 Pandas 中的这些数据进行一些操作，而且我没有一次将所有 1000 万条记录放入 Pandas 的内存。所以我希望能够将它分块并在每个块上使用 toPandas

df = sqlContext.sql("select * from db.table")
#do chunking to take X records at a time
#how do I generated chunked_df?
p_df = chunked_df.toPandas()
#do things to p_df

Run Code Online (Sandbox Code Playgroud)

我如何将我的数据帧分成相等的 x 部分或按记录计数分成几部分，比如一次 100 万。任何一种解决方案都是可以接受的，我只需要以较小的块处理它。

python pandas apache-spark

tes*_*acc

2018 10-27

7
推荐指数

1
解决办法

4916
查看次数

有没有办法关闭 PdfFileReader 打开的文件？

我打开了很多 PDF，我想在解析后删除这些 PDF，但文件在程序运行完成之前保持打开状态。如何关闭使用 PyPDF2 打开的 PDF？

代码：

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = PyPDF2.PdfFileReader(file(path, "rb"))

    #Check for number of pages, prevents out of bounds errors
    max = 0
    if pdf.numPages > 3:
        max = 3
    else:
        max = (pdf.numPages - 1)

    # Iterate pages
    for i in range(0, max): 
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    #pdf.close()
    return …

Run Code Online (Sandbox Code Playgroud)

python python-2.7 pypdf2

SPY*_*G96

2017 10-31

5
推荐指数

1
解决办法

7138
查看次数

何时使用re.compile

请耐心等待,我不能包括我的1,000+行程序,并且在描述中有几个问题.

所以我有几种类型的模式我正在寻找:

#literally just a regular word
re.search("Word", arg)

#Varying complex pattern
re.search("[0-9]{2,6}-[0-9]{2}-[0-9]{1}", arg)

#Words with varying cases and the possibility of ending special characters 
re.search("Supplier [Aa]ddress:?|Supplier [Ii]dentification:?|Supplier [Nn]ame:?", arg)

#I also use re.findall for the above patterns as well
re.findall("uses patterns above", arg

Run Code Online (Sandbox Code Playgroud)

我总共有大约75个,有些需要转移到深层嵌套的函数

我应该在何时何地编译模式？

现在我试图通过编译main中的所有内容来改进我的程序,然后将正确的编译RegexObjects列表传递给使用它的函数.这会增加我的表现吗？

会做类似以下的事情会提高我的程序速度吗？

re.compile("pattern").search(arg)

Run Code Online (Sandbox Code Playgroud)

编译后的模式是否保留在内存中,因此如果函数被多次调用,它是否会跳过编译部分？所以我不必将数据从函数移动到函数.

如果我将数据移动得那么多,是否值得编译所有模式？

有没有正则表达式匹配常规单词的更好方法？

我的代码的简短示例:

import re

def foo(arg, allWords):
   #Does some things with arg, then puts the result into a variable, 
   # this function does not use allWords

   data = …

Run Code Online (Sandbox Code Playgroud)

python regex performance python-2.7

SPY*_*G96

2018 12-14

5
推荐指数

2
解决办法

5671
查看次数