我有一个包含英语和阿拉伯语的字符串。我正在使用API,因此无法在其中设置指标。
我想要得到的是:将阿拉伯文和英文分为两个部分。这是一个示例字符串:
“ ??????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????? ???? ????????????? ?????? ????????? ???? ????????? ??????? Bismika rabbee wadaAAtu janbee wabika arfaAAuh,fa-in amsakta nafsee farhamha,wa-in asaltaha fahfathha bima tahfathu bihi AAibadakas-saliheen。我以你的主我的身分躺下,以你的名义我起身,所以如果你要接受我的灵魂,然后对他施怜悯,如果你应归还我的灵魂,则以你对义仆的方式保护我。”,
我找不到如何将其分为两部分,将阿拉伯语和英语分为两部分。
我想要的是:
所以可以有任何一种语言,我的问题是只取出英语或阿拉伯语,并在相应的字段中显示它们。
我该如何实现?
You can use a Natural Language Tagger, which would work even if both scripts are intermingled:
import NaturalLanguage
let str = "¿como? ????? start ??? middle ?????? ??????? ????? ????? end. ?????. "
let tagger = NLTagger(tagSchemes: [.script])
tagger.string = str
var index = str.startIndex
var dictionary = [String: String]()
var lastScript = "other"
while index < str.endIndex {
let res = tagger.tag(at: index, unit: .word, scheme: .script)
let range = res.1
let script = res.0?.rawValue
switch script {
case .some(let s):
lastScript = s
dictionary[s, default: ""] += dictionary["other", default: ""] + str[range]
dictionary.removeValue(forKey: "other")
default:
dictionary[lastScript, default: ""] += str[range]
}
index = range.upperBound
}
print(dictionary)
Run Code Online (Sandbox Code Playgroud)
and print the result if you'd like:
for entry in dictionary {
print(entry.key, ":", entry.value)
}
Run Code Online (Sandbox Code Playgroud)
yielding :
Hant : ?????.
Cyrl : ?????? ??????? ?????
Arab : ????? ??? ?????
Latn : ¿como? start middle end.
Run Code Online (Sandbox Code Playgroud)
This is still not perfect since the language tagger only checks to which script the most number of letters in a word belong to. For example, in the string you're working with, the tagger would consider ?????????????.Bismika as one word. To overcome this, we could use two pointers and traverse the original string and check the script of words individually. Words are defined as contiguous letters:
let str = "????????? ?????? ???????? ???????? ?????? ??????????? ?????? ?????????? ??????? ????????????? ?????? ????????????? ????????????? ????? ???????? ???? ????????? ?????????????.Bismika rabbee wadaAAtu janbee wabika arfaAAuh, fa-in amsakta nafsee farhamha, wa-in arsaltaha fahfathha bima tahfathu bihi AAibadakas-saliheen. In Your name my Lord, I lie down and in Your name I rise, so if You should take my soul then have mercy upon it, and if You should return my soul then protect it in the manner You do so with Your righteous servants."
let tagger = NLTagger(tagSchemes: [.script])
var i = str.startIndex
var dictionary = [String: String]()
var lastScript = "glyphs"
while i < str.endIndex {
var j = i
while j < str.endIndex,
CharacterSet.letters.inverted.isSuperset(of: CharacterSet(charactersIn: String(str[j]))) {
j = str.index(after: j)
}
if i != j { dictionary[lastScript, default: ""] += str[i..<j] }
if j < str.endIndex { i = j } else { break }
while j < str.endIndex,
CharacterSet.letters.isSuperset(of: CharacterSet(charactersIn: String(str[j]))) {
j = str.index(after: j)
}
let tempo = String(str[i..<j])
tagger.string = tempo
let res = tagger.tag(at: tempo.startIndex, unit: .word, scheme: .script)
if let s = res.0?.rawValue {
lastScript = s
dictionary[s, default: ""] += dictionary["glyphs", default: ""] + tempo
dictionary.removeValue(forKey: "glyphs")
}
else { dictionary["other", default: ""] += tempo }
i = j
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
339 次 |
| 最近记录: |