PROC DS2性能问题

ree*_*cec 5 sas sas-ds2

我试图使用proc ds2尝试通过使用多线程功能在正常数据步骤中获得一些性能提升.
fred.testdata是一个包含500万个观测值的SPDE数据集.我的代码如下:

proc ds2; 
   thread home_claims_thread / overwrite = yes;
   /*declare char(10) producttype;
   declare char(12) wrknat_clmtype;
   declare char(7) claimtypedet;
   declare char(1) event_flag;*/
   /*declare date week_ending having format date9.;*/
   method run();
      /*declare char(7) _week_ending;*/
      set fred.testdata;
      if claim = 'X' then claimtypedet= 'ABC';
      else if claim = 'Y' then claimtypedet= 'DEF';
      /*_week_ending = COMPRESS(exposmth,'M');
    week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
   end;
   endthread;

data home_claims / overwrite = yes;
   declare thread home_claims_thread t;  
   method run();
      set from t threads=8;
   end;
enddata;
run;
quit;
Run Code Online (Sandbox Code Playgroud)

我没有包括所有的IF语句,只包含了一些,否则它会占用几页(你应该得到这个想法).由于代码目前的工作速度比正常数据步骤快得多,但是当发生以下任何一种情况时会出现严重的性能问题:

  1. 我取消注释任何声明语句
  2. 我在fred.testdata中包含任何数字变量(即使不对数字变量执行任何计算)

我的问题是:

  1. 有没有办法将数值变量引入fred.testdata而不会导致DS2比正常数据步骤慢得多的显着减速?(对于包含数字列/ s的500万行的小表格,ds2的实时时间约为1分30秒,正常数据步骤的实时时间约为20秒).实际的全表更接近6亿行.例如,我希望能够进行week_ending转换,而不会在运行时引入5倍的性能损失.ds2 WITHOUT声明语句和数值变量的运行时间大约需要7秒
  2. 有没有办法在ds2中压缩表而无需执行额外的数据步骤来压缩它?

谢谢

Stu*_*ski 4

可以尝试两种方法:使用proc hpds2SAS 处理并行执行,或者更手动的方法。请注意,使用这些方法中的任何一种都不可能始终保持顺序。

方法 1:PROC HPDS2

HPDS2是一种执行大规模并行数据处理的方法。在单机模式下,它将使每个核心并行运行,然后将数据全部重新组合在一起。您只需对代码进行一些细微的修改即可运行它。

hpds2data有一个设置,您可以在和out语句中声明数据proc。您的dataandset语句将始终使用以下语法:

    data DS2GTF.out;
        method run();
            set DS2GTF.in;
            <code>;
        end;
    enddata;
Run Code Online (Sandbox Code Playgroud)

知道了这一点,我们可以修改您的代码以在 HPDS2 上运行:

proc hpds2 data=fred.test_data
           out=home_claims; 

   data DS2GTF.out;
   /*declare char(10) producttype;
   declare char(12) wrknat_clmtype;
   declare char(7) claimtypedet;
   declare char(1) event_flag;*/

   /*declare date week_ending having format date9.;*/
   method run();

      /*declare char(7) _week_ending;*/
      set DS2GTF.in;

      if claim = 'X' then claimtypedet= 'ABC';
      else if claim = 'Y' then claimtypedet= 'DEF';

      /*_week_ending = COMPRESS(exposmth,'M');
    week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/

   end;
   enddata;

run;
quit;
Run Code Online (Sandbox Code Playgroud)

方法2:使用rsubmit和append分割数据

下面的代码利用rsubmit直接观察访问来读取块中的数据,然后将它们全部附加到最后。如果您为块 I/O设置了数据,则此功能尤其有效

options sascmd='!sascmd'
        autosignon=yes
        noconnectwait
        noconnectpersist
        ;

%let cpucount = %sysfunc(getoption(cpucount));

%macro parallel_execute(data=, out=, threads=&cpucount);

    /* Get total obs from data */
    %let dsid = %sysfunc(open(&data.));
    %let n    = %sysfunc(attrn(&dsid., nlobs));
    %let rc   = %sysfunc(close(&dsid.));

    /* Run &threads rsubmit sessions */
    %do i = 1 %to &threads;

        /* Determine the records that each worker will read */
        %let firstobs = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i+1)+1, floor);
        %let lastobs  = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i.), floor);

        /* Get this session's work directory */
        %let workdir = %sysfunc(getoption(work));

        /* Send all macro variables to the remote session, and simultaneously start the remote session */
        %syslput _USER_ / remote=worker&i.;

        /* Check for an input libname */
        %if(%scan(&data., 2, .) NE) %then %do;
            %let inlib = %scan(&data., 1, .);
            %let indsn = %scan(&data., 2, .);
        %end;
            %else %do;
                %let inlib = workdir;
                %let indsn = &data.;
            %end;

        /* Check for an output libname */
        %if(%scan(&out., 2, .) NE) %then %do;
            %let outlib = %scan(&out., 1, .);
            %let outdsn = %scan(&out., 2, .);
        %end;
            %else %do;
                %let outlib = workdir;
                %let outdsn = &out.;
            %end;

        /* Work library location of this session to be inherited by the parallel session */
        %let workdir = %sysfunc(getoption(work));

        /* Sign on to a remote session and send over all user-made macro variables */
        %syslput _USER_ / remote=worker&i.;

        /* Run code on remote session &i */
        rsubmit remote=worker&i. inheritlib=(&inlib.);

             libname workdir "&workdir.";

             data workdir._&outdsn._&i.;
                 set &inlib..&indsn.(firstobs=&firstobs. obs=&lastobs.);
/*               <PUT CODE HERE>;*/
             run;
        endrsubmit;

    %end;

    /* Wait for everything to complete */
    waitfor _ALL_;

    /* Append all of the chunks together */
    proc datasets nolist;
        delete &out.;

        %do i = 1 %to &threads.;
            append base=&out.
                   data=_&outdsn._&i.
                   force
            ;
        %end;

/* Optional: remove all temporary data */
/*      delete _&outdsn._:;*/
    quit;

    libname workdir clear;
%mend;
Run Code Online (Sandbox Code Playgroud)

您可以使用以下代码测试其功能:

data pricedata;
    set sashelp.pricedata;
run;

%parallel_execute(data=pricedata, out=test, threads=3);
Run Code Online (Sandbox Code Playgroud)

如果您查看 WORK 目录中的临时文件,您会发现它将数据集均匀地分配到 3 个并行进程中,并且加起来等于原始总数。

_test_1 = 340
_test_2 = 340
_test_3 = 340
TOTAL   = 1020

pricedata = 1020
Run Code Online (Sandbox Code Playgroud)