我正在尝试在J中解析一个很大的CSV文件,这是我想到的行拆分路由:
splitlines =: 3 : 0
NB. y is the input string
nl_positions =. (y = (10 { a.)) NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do. NB. For each newline
to_drop =. i_index { prev_idx NB. The number of characters from the start of the string to skip
to_take =. i - to_drop NB. The number of characters in the current line
result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)
Run Code Online (Sandbox Code Playgroud)
不过,这确实很慢。性能监视器显示第8行花费的时间最长,这可能是因为删除和获取元素并扩展结果列表时所有的内存分配:
Time (seconds)
???????????????????????????????????????????????????????????????????
?all ?here ?rep ?splitlines ?
???????????????????????????????????????????????????????????????????
?0.000011?0.000011? 1?monad ?
?0.003776?0.003776? 1?[1] nl_positions=.(y=(10{a.)) ?
?0.012429?0.012429? 1?[2] nl_idx=.(#i.@#)nl_positions ?
?0.000144?0.000144? 1?[3] prev_idx =.(#nl_idx){.0,nl_idx ?
?0.000002?0.000002? 1?[4] result=.'' ?
?0.027566?0.027566? 1?[5] for_i. nl_idx do. ?
?0.940466?0.940466?20641?[6] to_drop=.i_index{prev_idx ?
?0.011238?0.011238?20641?[7] to_take=.i-to_drop ?
?4.310495?4.310495?20641?[8] result=.result,<(to_take{.to_drop}.y)?
?0.006926?0.006926?20641?[9] end. ?
?5.313052?5.313052? 1?total monad ?
???????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
有一个更好的方法吗?我正在寻找一种方法来:
for单个数组指令替换整个循环如果我理解正确,那么您当前只想将包含多行的字符串拆分为单独的行。(我想将线分成多个字段将是稍后的下一步?)
cut(;.)是完成大部分您想做的工作的关键原语。例如:
<;._2 InputString NB. box each segment terminated by the last character in the string
<;._1 InputString NB. box each segment of InputString starting with the first character in the string
cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces
Run Code Online (Sandbox Code Playgroud)
你可能会发现有用的其他相关资源:splitstring,freads,在tables/dsv和tables/csv插件。freads并且splitstring都是在标准库(后J6)提供。
'b' freads 'myfile.txt' NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
'","' splitstring InputString NB. boxed sub-strings of input string delimited by left argument
Run Code Online (Sandbox Code Playgroud)
该tables/dsv和tables/csv插件可以使用安装包管理器。安装后,它们可用于在行中拆分行和字段,如下所示:
require 'tables/csv'
readcsv 'myfile.csv'
',' readdsv 'myfile.txt'
TAB readdsv 'myfile.txt'
Run Code Online (Sandbox Code Playgroud)