使用 Python 3 过滤掉文本中的所有非汉字字符

alp*_*411 4 character

我有一段文字,其中有拉丁字母和日文字符(平假名、片假名和汉字)。

我想过滤掉所有拉丁字符,平假名和片假名,但我不知道如何以优雅的方式做到这一点。我的直接方法是过滤掉拉丁字母表中的每个字母以及每个平假名/片假名,但我相信有更好的方法。

我猜我必须使用正则表达式,但我不太确定如何去做。字母是否以某种方式分类为罗马字母、日语、中文等。如果是,我可以以某种方式使用它吗?

这里有一些示例文本:

"Lesson 1:",, "?","???","I" "???","?????","We" "? ??","???","You" "???","????","That person" "???","????","That person (polite)" "???","????"
Run Code Online (Sandbox Code Playgroud)

程序应该只返回汉字(汉字?像这样:

`???????`
Run Code Online (Sandbox Code Playgroud)

alp*_*411 5

感谢 Reddit 上的 Olsgaarddk,我找到了答案。

https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py

# -*- coding: utf-8 -*-
import re

''' This is a library of functions and variables that are helpful to have handy 
    when manipulating Japanese text in python.
    This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
    Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
    All rights reserved.
    Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''




## UNICODE BLOCKS ##

# Regular expression unicode blocks collected from 
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/

hiragana_full = r'[?-?]'
katakana_full = r'[?-?]'
kanji = r'[?-??-??-?]'
radicals = r'[?-?]'
katakana_half_width = r'[?-?]'
alphanum_full = r'[?-?]'
symbols_punct = r'[?-?]'
misc_symbols = r'[?-??-??-??-?]'
ascii_char = r'[ -~]'

## FUNCTIONS ##

def extract_unicode_block(unicode_block, string):
    ''' extracts and returns all texts from a unicode block from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.findall( unicode_block, string)

def remove_unicode_block(unicode_block, string):
    ''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.sub( unicode_block, '', string)

## EXAMPLES ## 

text = '????? ???????????????????????????????????????????????????????????????abc????????????'

print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))
Run Code Online (Sandbox Code Playgroud)