python检测标签字符

Question

python检测标签字符

我试图在特定文件中拆分单词和整数。文件的字符串采用这些形式（包含单词的行没有 '\t' 字符，但整数数字（所有正数）有）：（有些单词是包含 '-' 字符的数字，）

-1234
\t22
\t44
\t46
absv
\t1
\t2
\t4
...

Run Code Online (Sandbox Code Playgroud)

所以我的想法是通过将行的对象转换为浮动来拆分单词和字符串。

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

with codecs.open("/media/New Volume/3rd_step.txt", 'Ur') as file:#open file
    for line in file: # read line by line
        temp_buffer = line.split() # split elements
        for word in temp_buffer:
            if not('-' in word or not is_number(word)):
            ....

Run Code Online (Sandbox Code Playgroud)

所以如果它是一个词，我会得到例外，如果不是，那么它是一个数字。该文件是 50 Gb ，中间的某个地方似乎文件格式有问题。因此，拆分单词和数字的唯一可能方法是使用 \t 字符。但是我怎么能检测到呢？我的意思是我拆分了行来获取字符串，这样我就丢失了特殊字符。

编辑：

我真的很傻，很抱歉浪费你的时间。似乎我可以通过这种方式更轻松地找到它：

with codecs.open("/media/D60A6CE00A6CBEDD/InvertedIndex/1.txt", 'Ur') as file:#open file
    for line in file: # read line by line
    if not '\t' in line:
            print line

Run Code Online (Sandbox Code Playgroud)

Answer 1

The*_*nse 5

您应该尝试将参数指定为，split()而不是仅使用默认值，即所有空白字符。您可以将它最初拆分为除\t. 尝试这个：

white_str = list(string.whitespace)    # string.whitespace contains all whitespace.
white_str.remove("\t")                 # Remove \t
white_str = ''.join(white_str)         # New whitespace string, without \t

Run Code Online (Sandbox Code Playgroud)

然后代替split()，使用split(white_str). 这将在所有空白处拆分您的行，除了\t获取您的字符串。然后您可以\t稍后检测您需要的内容。

归档时间：	11 年，7 月前
查看次数：	17661 次
最近记录：	11 年，7 月前