Python：检测数字分隔符并解析为没有语言环境的浮点数

Question

Python：检测数字分隔符并解析为没有语言环境的浮点数

Jos*_*eak 4 python formatting python-2.x

我有一个包含数百万个文本文件的数据集，其中数字保存为字符串，并使用各种语言环境来格式化数字。我想做的是猜测哪个符号是小数点分隔符，哪个是千位分隔符。

这不应该太难，但是似乎还没有提出这个问题，为了后代，应该在这里提出并回答。

我所知道的是，总会有一个十进制分隔符，并且它始终是字符串中的最后一个non [0-9]符号。

正如您在下面看到的那样numStr.replace(',', '.')，十进制分隔符中的变量的简单修复将与可能的千位分隔符冲突。

如果您知道语言环境，我已经看到了解决方法，但是在这种情况下我不知道语言环境。

资料集：

1.0000 //1.0
1,0000 //1.0
10,000.0000 //10000.0
10.000,0000 //10000.0
1,000,000.0000 // 1000000.0
1.000.000,0000 // 1000000.0

//also possible

1 000 000.0000 //1000000.0 with spaces as thousand separators

Run Code Online (Sandbox Code Playgroud)

Answer 1

Joh*_*024 5

一种方法：

import re
with open('numbers') as fhandle:
    for line in fhandle:
        line = line.strip()
        separators = re.sub('[0-9]', '', line)
        for sep in separators[:-1]:
            line = line.replace(sep, '')
        if separators:
            line = line.replace(separators[-1], '.')
        print(line)

Run Code Online (Sandbox Code Playgroud)

在样本输入（已删除评论）上，输出为：

1.0000
1.0000
10000.0000
10000.0000
1000000.0000
1000000.0000
1000000.0000

Run Code Online (Sandbox Code Playgroud)

更新：处理Unicode

正如NeoZenith在评论中指出的那样，对于现代unicode字体，古老的正则表达式[0-9]并不可靠。请改用以下内容：

import re
with open('numbers') as fhandle:
    for line in fhandle:
        line = line.strip()
        separators = re.sub(r'\d', '', line, flags=re.U)
        for sep in separators[:-1]:
            line = line.replace(sep, '')
        if separators:
            line = line.replace(separators[-1], '.')
        print(line)

Run Code Online (Sandbox Code Playgroud)

不带re.U标志，\d等效于[0-9]。通过该标志，可以\d匹配Unicode字符属性数据库中分类为十进制数字的任何内容。或者，要处理不寻常的数字字符，可能需要考虑使用unicode.translate。

归档时间：	11 年，7 月前
查看次数：	1330 次
最近记录：	6 年，10 月前