Sté*_*las 19 shell-script unicode
从 shell 脚本获取字符串的显示宽度(至少在终端上(在当前语言环境中以正确的宽度显示字符的终端))最接近便携式方式的方法是什么。
我主要对非控制字符的宽度感兴趣,但也欢迎考虑退格、回车、水平制表等控制字符的解决方案。
换句话说,我正在寻找一个围绕POSIX 函数的shell API wcswidth()
。
该命令应该返回:
$ that-command '????' # 4 fullwidth characters
8
$ that-command 'Ste?phane' # 9 characters, one of which zero-width
8
$ that-command '?? ???' # 5 double-width Japanese characters and a space
11
Run Code Online (Sandbox Code Playgroud)
可以使用ksh93
'sprintf '%<n>Ls'
来考虑填充到<n>
列的字符宽度,或者使用col
命令(例如printf '++%s\b\b--\n' <character> | col -b
)来尝试推导它,perl
至少有一个 Text::CharWidth模块,但是否有更直接或可移植的方法。
在终端模拟器中,可以使用光标位置报告来获取之前/之后的位置,例如,从
...record position
printf '%s' $string
...record position
Run Code Online (Sandbox Code Playgroud)
并找出打印在终端上的字符有多宽。由于这是几乎所有您可能使用的终端都支持的 ECMA-48(以及 VT100)控制序列,因此它相当便携。
以供参考
CSI Ps n 设备状态报告 (DSR)。 ... Ps = 6 -> 报告光标位置 (CPR) [行;列]。 结果是 CSI r ;cR
最终,终端仿真器确定可打印宽度,原因如下:
wcswidth
单独并不能说明如何处理组合字符;POSIX 在该函数的描述中没有提到这方面。wcswidth
单独使用的应用程序的可移植性(例如参见第 2 章设置 Cygwin)。 xterm
例如,可以为需要的配置选择双角字符。wcswidth
不同程度地支持Shell API 调用:
该模块提供类似于C 语言中的 wcwidth(3) 和 wcswidth(3) 的功能。
这些或多或少是直接的:wcswidth
在 Perl 的情况下进行模拟,从 Ruby 和 Python 调用 C 运行时。您甚至可以使用诅咒,例如来自 Python(它将处理组合字符):
filter
函数(对于单行)addstr
,检查错误(以防太长),然后检查结束位置endwin
(不应该做 a refresh
)使用curses输出(而不是将信息反馈给脚本或直接调用tput
)将清除整行(filter
确实将其限制为一行)。
在 my 中.profile
,我调用一个脚本来确定终端上字符串的宽度。在我不信任 system-set 的机器的控制台上LC_CTYPE
登录时,或者当我远程登录并且无法信任LC_CTYPE
匹配远程端时,我会使用它。我的脚本查询终端,而不是调用任何库,因为这是我用例中的重点:确定终端的编码。
这在几个方面是脆弱的:
plink
方法从 Linux 机器访问远程文件的 Windows Emacs ,我使用该plinkx
方法解决了该问题。)这可能与您的用例相匹配,也可能不匹配。
#! /bin/sh
if [ z"$ZSH_VERSION" = z ]; then :; else
emulate sh 2>/dev/null
fi
set -e
help_and_exit () {
cat <<EOF
Usage: $0 {-NUMBER|TEXT}
Find out the width of TEXT on the terminal.
LIMITATION: this program has been designed to work in an xterm. Only
xterm and sufficiently compatible terminals will work. If you think
this program may be blocked waiting for input from the the terminal,
try entering the characters "0n0n" (digit 0, lowercase letter n,
repeat).
Display TEXT and erase it. Find out the position of the cursor before
and after displaying TEXT so as to compute the width of TEXT. The width
is returned as the exit code of the program. A value of 100 is returned if
the text is wider than 100 columns.
TEXT may contain backslash-escapes: \\0DDD represents the byte whose numeric
value is DDD in octal. Use '\\\\' to include a single backslash character.
You may use -NUMBER instead of TEXT (if TEXT begins with a dash, use
"-- TEXT"). This selects one of the built-in texts that are designed
to discriminate between common encodings. The following table lists
supported values of NUMBER (leftmost column) and the widths of the
sample text in several encodings.
1 ASCII=0 UTF-8=2 latinN=3 8bits=4
EOF
exit
}
builtin_text () {
case $1 in
-*[!0-9]*)
echo 1>&2 "$0: bad number: $1"
exit 119;;
-1) # UTF8: {\'E\'e}; latin1: {\~A\~A\copyright}; ASCII: {}
text='\0303\0211\0303\0251';;
*)
echo 1>&2 "$0: there is no text number $1. Stop."
exit 118;;
esac
}
text=
if [ $# -eq 0 ]; then
help_and_exit 1>&2
fi
case "$1" in
--) shift;;
-h|--help) help_and_exit;;
-[0-9]) builtin_text "$1";;
-*)
echo 1>&2 "$0: unknown option: $1"
exit 119
esac
if [ z"$text" = z ]; then
text="$1"
fi
printf "" # test that it is there (abort on very old systems)
csi='\033['
dsr_cpr="${csi}6n" # Device Status Report --- Report Cursor Position
dsr_ok="${csi}5n" # Device Status Report --- Status Report
stty_save=`stty -g`
if [ z"$stty_save" = z ]; then
echo 1>&2 "$0: \`stty -g' failed ($?)."
exit 3
fi
initial_x=
final_x=
delta_x=
cleanup () {
set +e
# Restore terminal settings
stty "$stty_save"
# Restore cursor position (unless something unexpected happened)
if [ z"$2" = z ]; then
if [ z"$initial_report" = z ]; then :; else
x=`expr "${initial_report}" : "\\(.*\\)0"`
printf "%b" "${csi}${x}H"
fi
fi
if [ z"$1" = z ]; then
# cleanup was called explicitly, so don't exit.
# We use `trap : 0' rather than `trap - 0' because the latter doesn't
# work in older Bourne shells.
trap : 0
return
fi
exit $1
}
trap 'cleanup 120 no' 0
trap 'cleanup 129' 1
trap 'cleanup 130' 2
trap 'cleanup 131' 3
trap 'cleanup 143' 15
stty eol 0 eof n -echo
printf "%b" "$dsr_cpr$dsr_ok"
initial_report=`tr -dc \;0123456789`
# Get the initial cursor position. Time out if the terminal does not reply
# within 1 second. The trick of calling tr and sleep in a pipeline to put
# them in a process group, and using "kill 0" to kill the whole process
# group, was suggested by Stephane Gimenez at
# https://unix.stackexchange.com/questions/10698/timing-out-in-a-shell-script
#trap : 14
#set +e
#initial_report=`sh -c 'ps -t $(tty) -o pid,ppid,pgid,command >/tmp/p;
# { tr -dc \;0123456789 >&3; kill -14 0; } |
# { sleep 1; kill -14 0; }' 3>&1`
#set -e
#initial_report=`{ sleep 1; kill 0; } |
# { tr -dc \;0123456789 </dev/tty; kill 0; }`
if [ z"$initial_report" = z"" ]; then
# We couldn't read the initial cursor position, so abort.
cleanup 120
fi
# Write some text and get the final cursor position.
printf "%b%b" "$text" "$dsr_cpr$dsr_ok"
final_report=`tr -dc \;0123456789`
initial_x=`expr "$initial_report" : "[0-9][0-9]*;\\([0-9][0-9]*\\)0" || test $? -eq 1`
final_x=`expr "$final_report" : "[0-9][0-9]*;\\([0-9][0-9]*\\)0" || test $? -eq 1`
delta_x=`expr "$final_x" - "$initial_x" || test $? -eq 1`
cleanup
# Zsh has function-local EXIT traps, even in sh emulation mode. This
# is a long-standing bug.
trap : 0
if [ $delta_x -gt 100 ]; then
delta_x=100
fi
exit $delta_x
Run Code Online (Sandbox Code Playgroud)
该脚本在其返回状态中返回宽度,剪裁为 100。示例用法:
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
Run Code Online (Sandbox Code Playgroud)
Eric Pruitt在 awk 中编写了令人印象深刻的实现,wcwidth()
可wcswidth()
在wcwidth.awk获取。主要提供4个功能
wcscolumns(), wcstruncate(), wcwidth(), wcswidth()\n
Run Code Online (Sandbox Code Playgroud)\n\n其中wcscolumns()
还容忍不可打印的字符。
wcscolumns(), wcstruncate(), wcwidth(), wcswidth()\n
Run Code Online (Sandbox Code Playgroud)\n\n我打开了一个问题,询问有关 TAB 的处理,因为wcscolumns($\'My sign is\\t\xe9\xbc\xa0\xe9\xbc\xa0\')
应该大于 14。更新:wcsexpand()
Eric 添加了将 TAB 扩展为空格的功能:
$ cat wcscolumns.awk \n{ printf "%d\\n", wcscolumns($0) }\n$ awk -f wcwidth.awk -f wcscolumns.awk <<< \'\xef\xbd\x95\xef\xbd\x8e\xef\xbd\x89\xef\xbd\x98\'\n8\n$ awk -f wcwidth.awk -f wcscolumns.awk <<< \'Ste\xcc\x81phane\'\n8\n$ awk -f wcwidth.awk -f wcscolumns.awk <<< \'\xe3\x82\x82\xe3\x81\xa7 \xe8\xab\xa4\xe5\xa5\xaf\xe3\x82\x9e\'\n11\n$ awk -f wcwidth.awk -f wcscolumns.awk <<< $\'My sign is\\t\xe9\xbc\xa0\xe9\xbc\xa0\'\n14\n
Run Code Online (Sandbox Code Playgroud)\n