根据不存在的空间分割乌尔都语单词

use*_*181 5 c# regex urdu

I have a Urdu word "\xd9\x84\xd8\xa7\xd8\xb9\xd9\x84\xd9\x85" and more similar words. How can I split the word that I get "\xd9\x84\xd8\xa7" and "\xd8\xb9\xd9\x84\xd9\x85" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "\xd9\x84\xd8\xa7" and "\xd8\xb9\xd9\x84\xd9\x85".

\n\n

English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.

\n

Sho*_*aib 5

There is no space because its a single word meaning "ignorant." As a matter of fact, "\xd9\x84\xd8\xa7" and "\xd8\xb9\xd9\x84\xd9\x85" separated wouldn\'t mean anything.

\n\n

在乌尔都语(和阿拉伯语脚本)中插入空格是为了在字体自动与相邻字符连字时划分单词的实际需要。撤消连字的唯一方法是在字符之间插入多余的空格。从技术上讲,零宽度非连接器(U+200C)正是用于此目的,但人类学习速度很慢,而且空间很容易插入。

\n\n

有些字符不与后面的字母连接,例如,“\xd8\xa7”不会与任何后面的字符连接,但可以与前面的字符(例如“\xd9\x84”)连接以形成连字“ \xd9\x84\xd8\xa7。” 您可以使用此列表 of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.

\n