如何有效地将字符串拆分成J中的行?

Mih*_*hai 3 j

我正在尝试在J中解析一个很大的CSV文件,这是我想到的行拆分路由:

splitlines =: 3 : 0
                                     NB. y is the input string
nl_positions =. (y = (10 { a.))      NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions      NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do.                                  NB. For each newline
    to_drop =. i_index { prev_idx                  NB. The number of characters from the start of the string to skip
    to_take =. i - to_drop                         NB. The number of characters in the current line
    result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)
Run Code Online (Sandbox Code Playgroud)

不过,这确实很慢。性能监视器显示第8行花费的时间最长,这可能是因为删除和获取元素并扩展结果列表时所有的内存分配:

 Time (seconds)
???????????????????????????????????????????????????????????????????
?all     ?here    ?rep  ?splitlines                               ?
???????????????????????????????????????????????????????????????????
?0.000011?0.000011?    1?monad                                    ?
?0.003776?0.003776?    1?[1] nl_positions=.(y=(10{a.))            ?
?0.012429?0.012429?    1?[2] nl_idx=.(#i.@#)nl_positions          ?
?0.000144?0.000144?    1?[3] prev_idx =.(#nl_idx){.0,nl_idx       ?
?0.000002?0.000002?    1?[4] result=.''                           ?
?0.027566?0.027566?    1?[5] for_i. nl_idx do.                    ?
?0.940466?0.940466?20641?[6] to_drop=.i_index{prev_idx            ?
?0.011238?0.011238?20641?[7] to_take=.i-to_drop                   ?
?4.310495?4.310495?20641?[8] result=.result,<(to_take{.to_drop}.y)?
?0.006926?0.006926?20641?[9] end.                                 ?
?5.313052?5.313052?    1?total monad                              ?
???????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

有一个更好的方法吗?我正在寻找一种方法来:

  1. 切片列表而不分配内存
  2. 也许用for单个数组指令替换整个循环

Tik*_*anz 5

如果我理解正确,那么您当前只想将包含多行的字符串拆分为单独的行。(我想将线分成多个字段将是稍后的下一步?)

cut;.)是完成大部分您想做的工作的关键原语。例如:

   <;._2 InputString   NB. box each segment terminated by the last character in the string
   <;._1 InputString   NB. box each segment of InputString starting with the first character in the string
   cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces
Run Code Online (Sandbox Code Playgroud)

你可能会发现有用的其他相关资源:splitstringfreads,在tables/dsvtables/csv插件。freads并且splitstring都是在标准库(后J6)提供。

   'b' freads 'myfile.txt'  NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
   '","' splitstring InputString  NB. boxed sub-strings of input string delimited by left argument
Run Code Online (Sandbox Code Playgroud)

tables/dsvtables/csv插件可以使用安装包管理器。安装后,它们可用于在行中拆分行和字段,如下所示:

   require 'tables/csv'
   readcsv 'myfile.csv'
   ',' readdsv 'myfile.txt'
   TAB readdsv 'myfile.txt'
Run Code Online (Sandbox Code Playgroud)