Python从文件中提取数据

Question

Python从文件中提取数据

我有一个文本文件说

text1 text2 text text
text text text text

Run Code Online (Sandbox Code Playgroud)

我期待首先计算文件中的字符串数(全部由空格分隔),然后输出前两个文本.(正文1文字2)

有任何想法吗？

在此先感谢您的帮助

编辑:这是我到目前为止:

>>> f=open('test.txt')
>>> for line in f:
    print line
ï»¿text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
    print "there are more than 2 words"

Run Code Online (Sandbox Code Playgroud)

我遇到的第一个问题是,我的文本文件是:text1 text2文本文本文本

但当我拉出单词的输出时,我得到:['\ xef\xbb\xbftext1','text2','text','text','text','text','hello']

'\ xef\xbb\xbf来自哪里？

Answer 1

Mar*_*ers 16

要逐行读取文件,只需循环遍历打开的文件对象for:

for line in open(filename):
    # do something with line

Run Code Online (Sandbox Code Playgroud)

要按空格将一行划分为单独的单词列表,请使用str.split():

words = line.split()

Run Code Online (Sandbox Code Playgroud)

要计算python列表中的项目数,请使用len(yourlist):

count = len(words)

Run Code Online (Sandbox Code Playgroud)

要从python列表中选择前两项,请使用切片:

firsttwo = words[:2]

Run Code Online (Sandbox Code Playgroud)

我会给你构建完整的程序,但你不需要比上面更多的东西,还有一个if声明,看你是否已经有了这两个词.

您在文件开头看到的三个额外字节是UTF-8 BOM(字节顺序标记); 它将您的文件标记为UTF-8编码,但它是多余的,仅在Windows上使用.

您可以删除它:

import codecs
if line.startswith(codecs.BOM_UTF8):
    line = line[3:]

Run Code Online (Sandbox Code Playgroud)

您可能希望使用该编码将字符串解码为unicode:

line = line.decode('utf-8')

Run Code Online (Sandbox Code Playgroud)

您也可以使用codecs.open()以下命令打开文件:

file = codecs.open(filename, encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

请注意,codecs.open()将不会剥离BOM你; 最简单的方法是使用.lstrip():

import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
    for line in f:
        line = line.lstrip(BOM)

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，1 月前
查看次数：	19384 次
最近记录：	9 年，10 月前