使用SAS BASE从JSON中的变量中解析具有unicode字符的表

San*_*nik 14 regex parsing json sas

我在使用vars中的unicode char解析JSON时遇到了问题.所以,我有下一个JSON(例子):

 {  
   "SASJSONExport":"1.0",
   "SASTableData+TEST":[  
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
      },
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
      },
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
      }
   ]
}
Run Code Online (Sandbox Code Playgroud)

要从JSON解析表,我使用SAS引擎:

libname jsonfl JSON fileref=injson ;
Run Code Online (Sandbox Code Playgroud)

代码更高解码单元格中的字符,但vars的名称看起来像缺少的val:

+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | __________ | _______ | ______  |
+--------------+---------------------------+------------+---------+---------+
|            1 |                         1 |          2 |       4 | ???-??1 |
|            1 |                         2 |          2 |       2 | ???-??2 |
|            1 |                         3 |          1 |      42 | ???-??3 |
+--------------+---------------------------+------------+---------+---------+
Run Code Online (Sandbox Code Playgroud)

标题必须如下所示:

+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | ?????????? | ??????? | ??????  |
+--------------+---------------------------+------------+---------+---------+
Run Code Online (Sandbox Code Playgroud)

所以我决定用这样的名字替换unicoded变量字符DIM_N_.为此,我必须找到所有字符串,同意下一个正则表达式:/([\s\w\d\\]+)\"\:/

但是,要从json获取字符串,我需要设置为delim下一个char '{','}','[',']',','.但如果把那个字符设为dlm,我就不会再组装json了.所以我决定在char之前粘贴~以将其设置为dlm.

data delim;
    infile injson lrecl=1073741823 nopad;
    file  delim;
    input char1 $char1. @@;
        if char1 in ('{','}','[',']',',') then
            put '7E'x;
        put char1 $CHAR1. @@;
run;
Run Code Online (Sandbox Code Playgroud)

我得到了无效的json文件:

~
{"SASJSONExport":"1.0"~
,"SASTableData+TEST":~
[  ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":4~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0031"~
}~
,  ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":2~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0032"~
}~
,  ~
{"\u0056\u0061\u0072":1~
,"\u006d\u0065\u0061\u006e":42~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0033"~
}  ~
]~
}   
Run Code Online (Sandbox Code Playgroud)

因此,下一步我将解析JSON并~用作分隔符:

data transfer;
length column $2000;
retain r;
    infile delim  delimiter='7E'x nopad;
    input char1 : $4000. @@;
            r = prxparse('/([\s\w\d\\]+)\"\:/');
            pos = prxmatch(r,char1);
            column = prxposn(r,1,char1);
        n= _n_;
run;
Run Code Online (Sandbox Code Playgroud)

它有效...但我觉得那些做法太糟糕了,而且它有限制.

UPD1
选项,

options vAlidfmtname=long VALIDMEMNAME=extend VALIDVARNAME=any;
Run Code Online (Sandbox Code Playgroud)

返回:

+--------------+---------------------------+----------------------------+---------+--------------+
| ordinal_root | ordinal_SASTableData_TEST |         __________         | _______ |    ______    |
+--------------+---------------------------+----------------------------+---------+--------------+
|            1 |                         1 | ????2 ????? = ???? - ????? |       4 | ???-??1 ,,,, |
|            1 |                         2 | ????2 ????? = ???? - ????? |       2 | ???-??2      |
|            1 |                         3 | ????2 ????? = ???? - ????? |    2017 | ???-??3      |
+--------------+---------------------------+----------------------------+---------+--------------+
Run Code Online (Sandbox Code Playgroud)

所以我的问题是:

  1. 我可以在没有infile语句的情况下解码整个文件吗?
  2. 我可以使用infile delimiter,但设置smth选项不删除分隔符?

欢迎充分的批评.

And*_*gov 2

UPD
我找到了解决方案,无需手动编辑 json 映射文件,而是使用正则表达式。

\n\n
libname _all_ clear;\nfilename _all_ clear;\nfilename _PDFOUT temp;\nfilename _GSFNAME temp;\nproc datasets lib=work kill memtype=data nolist; quit;\nfilename jsf \'~/sasuser.v94/.json\' encoding=\'utf-8\';\ndata _null_;\n  file jsf;\n  length js varchar(*);\n  retain js;\n  input;\n  js=unicode(_infile_);\n  put js;\n  datalines;\n{\n  "SASJSONExport":"1.0",\n  "SASTableData+TEST":[\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":2,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":4,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0031"\n    },\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":2,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":2,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0032"\n    },\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":1,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":42,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0033"\n    }\n  ]\n}\n;\nrun;\nfilename jsm \'~/sasuser.v94/.json.map\' encoding=\'utf-8\';\nlibname jsd json fileref=jsf map=jsm automap=replace;\nlibname jsm json fileref=jsm;\ndata jsmm;\n  merge jsm.datasets jsm.datasets_variables;\n  by ordinal_DATASETS;\nrun;\nproc sort data=jsmm; by ordinal_root ordinal_DATASETS; run;\ndata _null_;\n  set work.jsmm end=last;\n  if _N_=1 then do;\n    length s varchar(*) ds varchar(*);\n    retain s ds prx;\n    s=\'{"DATASETS":[\';\n    ds=\'\';\n    prx=prxparse(\'/[^_]/\');\n  end;\n  if ds=dsname then s=s||\',\';\n  else do;\n    ds=dsname;\n    if _N_^=1 then s=s||\']},\';\n    s=cats(s,\'{"DSNAME":"\',ds,\'","TABLEPATH":"\',tablepath,\'","VARIABLES":[\');\n  end;\n  s=cats(s,\'{"NAME":"\',name,\'","TYPE":"\',type,\'","PATH":"\',path,\'"\');\n  if prxmatch(prx,name) > length(name) then\n    s=cats(s,\',"LABEL":"\',scan(path,-1,\'/\'),\'"\');\n  s=s||\'}\';\n  if last then do;\n    s=s||\']}]}\';\n    file jsm;\n    put s;\n  end;\nrun;\nlibname jsd json fileref=jsf map=jsm;\nproc print data=jsd.SASTableData_TEST label noobs; run;\n
Run Code Online (Sandbox Code Playgroud)\n\n

该解决方案的第一个变体
是快速\'n\'肮脏的解决方案。
首先准备输入数据:

\n\n
libname _all_ clear;\nfilename _all_ clear;\nfilename jsf \'~/sasuser.v94/.json\' encoding=\'utf-8\';\ndata _null_;\n  file jsf;\n  length js varchar(*);\n  input;\n  js=unicode(_infile_);\n  put js;\n  datalines;\n{\n  "SASJSONExport":"1.0",\n  "SASTableData+TEST": [\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":2,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":4,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0031"\n    },\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":2,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":2,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0032"\n    },\n    {\n      "\\u041f\\u0435\\u0440\\u0435\\u043c\\u0435\\u043d\\u043d\\u0430\\u044f":1,\n      "\\u0421\\u0440\\u0435\\u0434\\u043d\\u0435\\u0435":42,\n      "\\u0421\\u0442\\u0440\\u043e\\u043a\\u0430":"\\u0427\\u0442\\u043e\\u002d\\u0442\\u043e\\u0033"\n    }\n  ]\n}\n;\nrun;\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出文件.json

\n\n
{\n"SASJSONExport":"1.0",\n"SASTableData+TEST": [\n{\n"\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f":2,\n"\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5":4,\n"\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0":"\xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe1"\n},\n{\n"\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f":2,\n"\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5":2,\n"\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0":"\xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe2"\n},\n{\n"\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f":1,\n"\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5":42,\n"\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0":"\xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe3"\n}\n]\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后创建 json 映射文件.json.map

\n\n
filename jsmf \'~/sasuser.v94/.json.map\' encoding=\'utf-8\';\nlibname jsm json fileref=jsf map=jsmf automap=create;\n
Run Code Online (Sandbox Code Playgroud)\n\n

内容.json.map

\n\n
{\n  "DATASETS": [\n    {\n      "DSNAME": "root",\n      "TABLEPATH": "/root",\n      "VARIABLES": [\n        {\n          "NAME": "ordinal_root",\n          "TYPE": "ORDINAL",\n          "PATH": "/root"\n        },\n        {\n          "NAME": "SASJSONExport",\n          "TYPE": "CHARACTER",\n          "PATH": "/root/SASJSONExport",\n          "CURRENT_LENGTH": 3\n        }\n      ]\n    },\n    {\n      "DSNAME": "SASTableData_TEST",\n      "TABLEPATH": "/root/SASTableData+TEST",\n      "VARIABLES": [\n        {\n          "NAME": "ordinal_root",\n          "TYPE": "ORDINAL",\n          "PATH": "/root"\n        },\n        {\n          "NAME": "ordinal_SASTableData_TEST",\n          "TYPE": "ORDINAL",\n          "PATH": "/root/SASTableData+TEST"\n        },\n        {\n          "NAME": "____________________",\n          "TYPE": "NUMERIC",\n          "PATH": "/root/SASTableData+TEST/\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f"\n        },\n        {\n          "NAME": "______________",\n          "TYPE": "NUMERIC",\n          "PATH": "/root/SASTableData+TEST/\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5"\n        },\n        {\n          "NAME": "____________",\n          "TYPE": "CHARACTER",\n          "PATH": "/root/SASTableData+TEST/\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0",\n          "CURRENT_LENGTH": 12\n        }\n      ]\n    }\n  ]\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

让我们稍微更改一下文件,删除不必要的数据集的描述并添加标签:

\n\n
{\n  "DATASETS": [\n    {\n      "DSNAME": "SASTableData_TEST",\n      "TABLEPATH": "/root/SASTableData+TEST",\n      "VARIABLES": [\n        {\n          "NAME": "ordinal_root",\n          "TYPE": "ORDINAL",\n          "PATH": "/root"\n        },\n        {\n          "NAME": "ordinal_SASTableData_TEST",\n          "TYPE": "ORDINAL",\n          "PATH": "/root/SASTableData+TEST"\n        },\n        {\n          "NAME": "____________________",\n          "TYPE": "NUMERIC",\n          "PATH": "/root/SASTableData+TEST/\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f",\n          "LABEL": "\xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f"\n        },\n        {\n          "NAME": "______________",\n          "TYPE": "NUMERIC",\n          "PATH": "/root/SASTableData+TEST/\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5",\n          "LABEL": "\xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5"\n        },\n        {\n          "NAME": "____________",\n          "TYPE": "CHARACTER",\n          "PATH": "/root/SASTableData+TEST/\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0",\n          "LABEL": "\xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0",\n          "CURRENT_LENGTH": 12\n        }\n      ]\n    }\n  ]\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后再试一次:

\n\n
libname jsd json fileref=jsf map=jsmf;\nproc print data=jsd.SASTableData_TEST label noobs; run;\n
Run Code Online (Sandbox Code Playgroud)\n\n

结果:

\n\n
+--------------+---------------------------+- ----------+---------+-----------+\n| ordinal_root | ordinal_SASTableData_TEST | \xd0\x9f\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xb0\xd1\x8f | \xd0\xa1\xd1\x80\xd0\xb5\xd0\xb4\xd0\xbd\xd0\xb5\xd0\xb5 |    \xd0\xa1\xd1\x82\xd1\x80\xd0\xbe\xd0\xba\xd0\xb0 |\n+--------------+---------------------------+------------+---------+-----------+\n|            1 |                         1 |          2 |       4 | \xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe1   |\n|            1 |                         2 |          2 |       2 | \xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe2   |\n|            1 |                         3 |          1 |      42 | \xd0\xa7\xd1\x82\xd0\xbe-\xd1\x82\xd0\xbe3   |\n+--------------+---------------------------+------------+---------+-----------+\n
Run Code Online (Sandbox Code Playgroud)\n\n

所有这一切都是在 SAS 大学版中完成的。

\n