标签: text-processing

如何在 R 中将文本文件读取为一行

我正在尝试处理一个文本文件。总的来说，我有一个想要分析的语料库。为了使用 tm 包（R 中的文本挖掘包）创建一个语料库对象，我需要使该段落成为一个巨大的向量，以便能够正确阅读。

我有一个段落

          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

Run Code Online (Sandbox Code Playgroud)

我使用了 scan 和 readLine 方法，它处理文本的方式如下：

[28]“过去两百年的商业开发使
巨须鲸濒临灭绝”
[30]“开发前种群规模极小”

有没有办法摆脱换行符？或者将文本文件作为一个巨大的向量读取？

到目前为止，发布的所有解决方案都很棒，谢谢。

regex text text-processing r text-mining

Zay*_*iwa

2014 12-08

4
推荐指数

1
解决办法

6521
查看次数

从 PostgreSQL 文本正文中提取长度为 1、2 和 3 的所有 n 元语法的最快方法是什么？

我有很多文本体，对于每个文本体，我想提取所有一元组、二元组和三元组（单词，而不是字符），并将计数和 ngram 长度插入到另一个表中。

现在我正在考虑使用WITH ORDINALITY取消嵌套正则表达式分割的文本正文，然后对二元组和三元组使用多个子查询，但这需要 ordering 。但是，我认为这可能是一种低效的方法，因为这种位置数据通常应该通过索引访问。

我目前正在用Python实现这个，一个巨大的瓶颈是字典插入和字典/集合的停用词搜索。

这是一个非常基本的例子：

输入：

This is a small, small sentence.

输出

ngram                | count | length
-------------------------------------
this                 | 1 | 1
is                   | 1 | 1
a                    | 1 | 1
small                | 2 | 1
sentence             | 1 | 1
this is              | 1 | 2 
is a                 | 1 | 2
a small              | 1 | 2
small small          | 1 | 2
small sentence       | 1 | 2
this is …

Run Code Online (Sandbox Code Playgroud)

postgresql text-processing n-gram

Max*_*cia

2017 05-27

4
推荐指数

1
解决办法

1381
查看次数

使用pdfminer检测pdf的部分

我正在尝试将会议/期刊论文中的 pdf 转换为 .txt 文件。我基本上希望有一个比当前 pdf 更清晰的结构：在句子结束前没有换行符并突出显示论文的各个部分。我目前正在处理的问题是尝试自动检测部分。也就是说，在下图中，我希望能够找到 ABSTRACT、CSS CONCEPT、1 INTRODUCTION、2 THE BODY OF THE PAPER。.

如果目前使用一个简单的想法，它是有效的。我基本上让 pdf miner 完成它的工作，然后使用 NTLK 来查找句子。

def convert_pdf_to_txt(path, year):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    sentences = sent_tokenize(text)

    size = len(sentences)
    i …

Run Code Online (Sandbox Code Playgroud)

python pdf text-processing nlp pdfminer

LBe*_*Bes

2018 11-09

4
推荐指数

1
解决办法

5625
查看次数

断言错误：某些对象具有未恢复的属性

我正在通过关注官方 TensorFlow 网站此处来训练关于文本预测的基本 LSTM 。我设法在 GTX 1050ti 上训练我的模型多达 40 个时期，并将 checkPoint 文件保存在一个单独的文件夹中。但是，当我现在尝试恢复模型时，出现了这个长错误：-

StreamExecutor device (0): GeForce GTX 1050 Ti, Compute Capability 6.1
WARNING:tensorflow:Entity <function standard_gru at 0x7f9e121324d0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <function standard_gru at 0x7f9e121324d0>: AttributeError: module 'gast' has no attribute 'Num'
WARNING:tensorflow:Entity <function cudnn_gru …

Run Code Online (Sandbox Code Playgroud)

text-processing python-3.x keras tensorflow

nee*_*l g

2020 03-31

4
推荐指数

1
解决办法

2461
查看次数

从文件中删除行

我正在unix系统上进行一些文本处理.我可以访问这台机器上的命令行,它有Python,Perl和安装的默认文本处理程序,awk等.

我有一个文本文件,如下所示:

2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 1232453488239 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488302 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488365 Tue Mar  3 10:47:44 2009
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
5622983575622325494 1232453323986 Thu Feb 12 15:57:49 2009

Run Code Online (Sandbox Code Playgroud)

它基本上是3行:ID ID Date

我希望删除所有没有2个ID和日期的行.因此,结果将是这样的:

2029754527851451717 1232453488239 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488302 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488365 Tue Mar  3 10:47:44 2009
5622983575622325494 1232453323986 Thu Feb 12 15:57:49 2009

Run Code Online (Sandbox Code Playgroud)

你们怎么建议这样做？总的来说,文本文件大约有30,000行.

干杯

EEF

python perl awk text text-processing

Rai*_*Son

2013 08-08

3
推荐指数

3
解决办法

781
查看次数

从字符串中删除不匹配的括号

我想从字符串中删除"un-partnered"括号.

也就是说,(除非它们跟)在字符串中的某处,否则应该删除所有内容.同样,应该删除字符串中某个地方)之前的所有内容(.

理想情况下,算法也会考虑嵌套.

例如:

"(a)".remove_unmatched_parents # => "(a)"
"a(".remove_unmatched_parents # => "a"
")a(".remove_unmatched_parents # => "a"

Run Code Online (Sandbox Code Playgroud)

ruby regex string recursion text-processing

Tom*_*man

2011 03-25

3
推荐指数

1
解决办法

1623
查看次数

无法用Python拆分String

我绝望地尝试使用Python拆分字符串,但我需要解析的文本文件有点棘手:

文本文件是逗号分隔的数据文件

我做了以下事情:

import fileinput
for line in fileinput.input("sample.txt"):
data = line.strip().split(',')
pass

Run Code Online (Sandbox Code Playgroud)

这实际上应该使工作正确吗？

好了,现在是棘手的部分:我有一些包含逗号的字段,如下所示:

"(CONTRACTS OF 5,000 BUSHELS)"

Run Code Online (Sandbox Code Playgroud)

使用我的代码,脚本也将此字段拆分为2.

我如何让python使用逗号作为分隔符,但是当它们被""括起来时却不能？

提前感谢您的回答

Crak

python parsing text-processing split

Cra*_*rak

lucky-day

3
推荐指数

2
解决办法

984
查看次数

在Haskell中表达一系列脚本化的Python字符串替换

我经常使用Python来替换文本中的各种类型的字符,使用如下所示的脚本:

#!/usr/bin/env python                                                                                                                                                                                                                                                         
# coding=UTF-8

import sys

for file in sys.argv[1:]:                                                                                                                                                                                                                                                     
    f = open(file)                                                                                                                                                                                                                                                            
    fs = f.read()
    r1 = fs.replace('\n',' ')
    r2 = r1.replace('\r',' ')                                                                                                                                                                                                                                                   
    r3 = r2.replace('. ','.\n\n')                                                                                                                                                                                                                                                   
    r4 = r3.replace('é','e')
    r5 = r4.replace('\xc2',' ')
    r6 = r5.replace('\xa0',' ')
    r7 = r6.replace(' ',' ')
    r8 = r7.replace(' ',' ')
    r9 = r8.replace('\n ','\n')
    f.close()                                                                                                                                                                                                                                                                 
    print r8

Run Code Online (Sandbox Code Playgroud)

但我现在正在学习Haskell,因为我厌倦了Python.

我在Haskell做的最好的尝试是

#!/usr/bin/runhaskell 

import System.IO

main :: IO ()
main = do 
       inh <- getArgs >>= withFileLines
       outh <- -- ??
       mainloop inh …

Run Code Online (Sandbox Code Playgroud)

file-io text-processing haskell replace

mag*_*tar

lucky-day

3
推荐指数

1
解决办法

152
查看次数

找到页面中的所有href并用链接维护以前的链接替换 - PHP

我正在尝试在网页上找到所有href链接,并用我自己的代理链接替换该链接.

例如

<a href="http://www.google.com">Google</a>

Run Code Online (Sandbox Code Playgroud)

需要是

<a href="http://www.example.com/?loadpage=http://www.google.com">Google</a>

Run Code Online (Sandbox Code Playgroud)

php text-processing hyperlink

Gle*_*ton

2012 06-28

3
推荐指数

1
解决办法

2728
查看次数

使用`awk`打印BEGIN部分文件中的行数

我正在尝试编写一个awk脚本,在完成任何操作之前告诉用户文件中有多少行.我知道如何在END部分中执行此操作但在BEGIN部分中无法执行此操作.我搜索过SE和Google,但是在END部分或者作为bash脚本的一部分只找到了六种方法,而不是在完成任何处理之前如何做到这一点.我希望得到以下内容:

#!/usr/bin/awk -f

BEGIN{
        print "There are a total of " **TOTAL LINES** " lines in this file.\n"
     }
{

        if($0==4587){print "Found record on line number "NR; exit 0;}
}

Run Code Online (Sandbox Code Playgroud)

但是如果有可能的话,一直无法确定如何做到这一点.谢谢.

awk text text-processing

Dyl*_*lan

lucky-day

3
推荐指数

1
解决办法

6417
查看次数