由于嵌套视图被视为禁忌 - 我还应该如何构建极其冗长的查询?

Ami*_*rni 5 sql database-administration snowflake-cloud-data-platform

背景:一名网络开发人员在大学时没有足够认真地对待 SQL,现在在一家使用 Snowflake 作为数据仓库来计算统计数据的金融公司工作时感到后悔。

我们有 3 个用于所有计算的源表:

  • 职位:
create or replace TABLE POS (
    ACCOUNT_NUMBER VARCHAR(15) NOT NULL,
    ACCOUNT_TYPE VARCHAR(30),
    SECURITY_TYPE VARCHAR(30) NOT NULL,
    SYMBOL VARCHAR(30) NOT NULL,
    QUANTITY NUMBER(15,4),
    AMOUNT NUMBER(15,4),
    FILE_DATE DATE NOT NULL,
    primary key (ACCOUNT_NUMBER, SYMBOL, FILE_DATE)
); 
Run Code Online (Sandbox Code Playgroud)
  • 交易:
create or replace TABLE TRN (
    REP_CODE VARCHAR(10),
    FILE_DATE DATE NOT NULL,
    ACCOUNT_NUMBER VARCHAR(15) NOT NULL,
    CODE VARCHAR(10),
    CANCEL_STATUS_FLAG VARCHAR(1),
    SYMBOL VARCHAR(100),
    SECURITY_CODE VARCHAR(2),
    TRADE_DATE DATE,
    QUANTITY NUMBER(15,4),
    NET_AMOUNT NUMBER(15,4),
    PRINCIPAL NUMBER(15,4),
    BROKER_FEES NUMBER(15,4),
    OTHER_FEES NUMBER(15,4),
    SETTLE_DATE DATE,
    FROM_TO_ACCOUNT VARCHAR(30),
    ACCOUNT_TYPE VARCHAR(30),
    ACCRUED_INTEREST NUMBER(15,4),
    CLOSING_ACCOUNT_METHOD VARCHAR(30),
    DESCRIPTION VARCHAR(500)
); 
Run Code Online (Sandbox Code Playgroud)
  • 价格:
create or replace TABLE PRI (
    SYMBOL VARCHAR(100) NOT NULL,
    SECURITY_TYPE VARCHAR(2) NOT NULL,
    FILE_DATE DATE NOT NULL,
    PRICE NUMBER(15,4) NOT NULL,
    FACTOR NUMBER(15,10),
    primary key (SYMBOL, FILE_DATE)
); 
Run Code Online (Sandbox Code Playgroud)

这些表本身实际上都是无用且混乱的,它们几乎总是需要相互连接(或它们自身),并且应用了许多附加计算才能以任何有意义的方式进行解释。视图帮助我概括了这个问题。

我在这些表的下游使用了两个核心视图:

  1. 控股
SELECT 
    POS.FILE_DATE, 
    POS.ACCOUNT_NUMBER, 
    POS.SYMBOL,
    CASE WHEN POS.QUANTITY > 0 THEN POS.QUANTITY ELSE POS.AMOUNT END AS QUANTITY,
    CASE WHEN POS.SECURITY_TYPE IN ('FI', 'MB', 'UI') THEN
        COALESCE(
            PRI.FACTOR * PRI.PRICE * .01,
            PRI.PRICE * .01
        )
        ELSE PRI.PRICE END AS PPU,
    COALESCE(
        POS.AMOUNT,
        QUANTITY * PPU
    ) AS MARKET_VALUE
FROM POS AS POS 
LEFT JOIN PRI AS PRI 
    ON POS.FILE_DATE = PRI.FILE_DATE AND POS.SYMBOL = PRI.SYMBOL; 

Run Code Online (Sandbox Code Playgroud)
  1. 现金流(这个 a 是一个很奇怪的......我们的数据提供者在这里真的没有多大帮助)
select t.file_date, T.ACCOUNT_NUMBER,
    COALESCE (
        CASE WHEN T.SECURITY_CODE = 'MB' THEN INIT * p.factor * .01 ELSE NULL END, -- IF Factor and Par needed
        CASE WHEN T.SECURITY_CODE IN ('FI', 'UI') THEN INIT * .01 ELSE NULL END, -- if par val needed
        CASE WHEN T.QUANTITY > 0 AND P.PRICE > 0 THEN t.quantity * p.price ELSE NULL END,
        CASE WHEN T.NET_AMOUNT > 0 and p.price is not null THEN T.NET_AMOUNT * p.price ELSE NULL END,
        T.NET_AMOUNT, -- if the transaction has a net value
        BUYS.NET_AMOUNT, -- if there is a buy aggregate match for the day
        SELLS.NET_AMOUNT -- if there is a sell aggregate match for the day
    ) AS DERIVED, -- this records the initial cash flow value
    COALESCE( 
        CASE WHEN t.code IN ('DEP', 'REC') THEN DERIVED ELSE NULL END,
        CASE WHEN t.code IN ('WITH', 'DEL', 'FRTAX', 'EXABP') THEN -1 * DERIVED ELSE NULL END
    ) as DIRECTION, -- this determines if it was an inflow or outflow
    CASE 
        WHEN T.CANCEL_STATUS_FLAG = 'Y' THEN -1*DIRECTION 
        ELSE DIRECTION 
    END AS FLOW, -- this cancels out an existing transaction
    CASE WHEN T.CODE = 'MFEE' THEN INIT ELSE NULL END AS FEES,
    t.code, 
    t.symbol, 
    t.net_amount, 
    t.quantity, 
    p.price,
    p.factor
from trn t
LEFT JOIN PRI p 
    ON t.symbol = p.symbol 
    AND t.file_date = p.file_date
-- in the rare case that we dont have a securities price, it means that a buy/sell 
-- transaction occurred to remove the position from our 
-- data feed. This must mean that the transaction value 
-- is equivalent to the total internal operation that occurred to a particular security in 
-- this account on this day.
LEFT JOIN (
    select file_date, 
        account_number, 
        symbol, 
        SUM(net_amount) as net_amount 
    from TRN 
    where code = 'BUY' 
    group by file_date, account_number, symbol
) AS buys 
    ON t.code = 'DEL'   
    AND buys.file_date = t.file_date  
    AND buys.symbol = t.symbol  
    AND buys.account_number = t.account_number
    AND p.price IS NULL
    AND t.net_amount = 0
    AND buys.net_amount != 0
LEFT JOIN (
    select file_date, 
        account_number, 
        symbol, 
        SUM(net_amount) as net_amount 
    from TRN 
    where code = 'SELL' 
    group by file_date, account_number, symbol
) AS sells 
    ON t.code = 'REC' 
    AND t.file_date = sells.file_date 
    AND sells.symbol = t.symbol 
    AND sells.account_number = t.account_number
    AND p.price IS NULL
    AND t.net_amount = 0
    AND sells.net_amount != 0
WHERE 
    t.code in ('DEP', 'WITH', 'DEL', 'REC', 'FRTAX', 'MFEE', 'EXABP')
ORDER BY t.file_date; 
Run Code Online (Sandbox Code Playgroud)

我还编写了视图,根据帐号对上面的两个视图进行分组,分别命名为account_balancesgrouped_cashflows。我经常从应用程序层调用这两个视图,并且到目前为止对执行速度感到满意。

排除所有这些因素......

我现在正在尝试计算每个投资账户的时间加权表现。我更喜欢使用 SQL 而不是在应用程序层中执行此操作,以便我可以在那些甜蜜的Snowflake 仪表板中查看输出。

我使用的公式称为TWRR

总之,它要求我收集所有历史余额+所有现金流,计算每组连续市场收盘之间的净差,并将其记录为百分比。如果我们将此百分比 + 1 表示为“因子”,并在给定时间范围内取所有这些因子的乘积并减去 1,我们就得到了该时间范围内的性能。

所以......我的第一次尝试,我做了你所期望的 - 创建了另一个称为“因素”的视图,它引用了我的其他视图:

SELECT 
B.FILE_DATE, 
B.ACCOUNT_NUMBER, 
B.MARKET_VALUE AS EMV,
COALESCE(CF.FLOW, 0) AS NET,
COALESCE(CF.FEES, 0) AS FEES,
COALESCE(NET + FEES, NET, 0) AS GRS,
LAG(B.MARKET_VALUE, 1, NULL) OVER (PARTITION BY B.ACCOUNT_NUMBER ORDER BY B.FILE_DATE) AS LAST_BAL,
COALESCE( 
    LAST_BAL, 
    B.MARKET_VALUE - NET,
    B.MARKET_VALUE
) AS BMV,
EMV - BMV AS DIFF,
DIFF - NET AS NET_DIFF,
DIFF - GRS AS GRS_DIFF,
CASE WHEN BMV > 10 AND EMV > 10 AND NET_DIFF / BMV < 1 AND GRS != 0 THEN 1 + (NET_DIFF / BMV) ELSE 1 END AS NET_FACTOR,
CASE WHEN BMV > 10 AND EMV > 10 AND GRS_DIFF / BMV < 1 AND GRS != 0 THEN 1 + (GRS_DIFF / BMV) ELSE 1 END AS GRS_FACTOR
FROM ACCOUNT_BALANCES B 
LEFT JOIN GROUPED_CASHFLOWS CF 
    ON B.FILE_DATE = CF.FILE_DATE 
    AND B.ACCOUNT_NUMBER = CF.ACCOUNT_NUMBER
order by ACCOUNT_NUMBER, FILE_DATE;
Run Code Online (Sandbox Code Playgroud)

这个查询有效,但是,正如您可以猜到的,它真的...真的... 例如,某些帐户需要 10 秒(诚然,使用 xs 雪花实例,但仍然如此。)

在这一点上,很明显我做错了什么,果然,快速的谷歌搜索清楚地表明,像这样的嵌套视图是一个巨大的禁忌。

但问题是……在不使用我的观点的情况下将所有这些写成一个查询似乎……太可怕了。

那么对于所有 SQL/Snowflake 专家来说...是否有更好的方法来做到这一点?

任何建议都将非常感激。

编辑:包括因素视图的雪花查询配置文件:

在此输入图像描述

谢谢你!

Sim*_*rim 3

到目前为止,我只看到了一些小事,我认为这些小事不会叠加成任何大事。

从持有量来看:

    CASE WHEN POS.SECURITY_TYPE IN ('FI', 'MB', 'UI') THEN
        COALESCE(
            PRI.FACTOR * PRI.PRICE * .01,
            PRI.PRICE * .01
        )
        ELSE PRI.PRICE END AS PPU,
Run Code Online (Sandbox Code Playgroud)

雪花中的两条腿 CASE 与使用 IFF 相同,而且 IFF 更容易阅读,恕我直言。并且数学可以调整。

    IFF(POS.SECURITY_TYPE IN ('FI', 'MB', 'UI'),
        PRI.PRICE * .01 * COALESCE(PRI.FACTOR, 1),
        PRI.PRICE) AS PPU,
Run Code Online (Sandbox Code Playgroud)

是现金流量,派生的大 COALESCE 可以成为 CASE 语句,但也许这不会更快:

因此:

    COALESCE (
        IFF( T.SECURITY_CODE = 'MB', INIT * p.factor * .01, NULL), -- IF Factor and Par needed
        IFF( T.SECURITY_CODE IN ('FI', 'UI'), INIT * .01, NULL), -- if par val needed
        IFF( T.QUANTITY > 0 AND P.PRICE > 0, t.quantity * p.price, NULL),
        IFF( T.NET_AMOUNT > 0 and p.price is not null, T.NET_AMOUNT * p.price, NULL),
        T.NET_AMOUNT, -- if the transaction has a net value
        BUYS.NET_AMOUNT, -- if there is a buy aggregate match for the day
        SELLS.NET_AMOUNT -- if there is a sell aggregate match for the day
    ) AS DERIVED, -- this records the initial cash flow value
Run Code Online (Sandbox Code Playgroud)

可能

    CASE 
        WHEN T.SECURITY_CODE = 'MB' THEN INIT * p.factor * .01
        WHEN T.SECURITY_CODE IN ('FI', 'UI') THEN INIT * .01
        WHEN T.QUANTITY > 0 AND P.PRICE > 0 THEN t.quantity * p.price
        WHEN T.NET_AMOUNT > 0 and p.price is not null THEN T.NET_AMOUNT * p.price
        ELSE COALESCE(
            T.NET_AMOUNT, -- if the transaction has a net value
            BUYS.NET_AMOUNT, -- if there is a buy aggregate match for the day
            SELLS.NET_AMOUNT -- if there is a sell aggregate match for the day
        ) 
    END AS DERIVED, -- this records the initial cash flow value
Run Code Online (Sandbox Code Playgroud)

嗯,这可能是一些东西。

在现金流中,如果 BUT 仅在以下情况中使用这些值,则您将进行buys和且仅将这些聚合连接起来:sellst.net_amount = 0

        ELSE COALESCE(
            T.NET_AMOUNT, -- if the transaction has a net value
            BUYS.NET_AMOUNT, -- if there is a buy aggregate match for the day
            SELLS.NET_AMOUNT -- if there is a sell aggregate match for the day
        ) 
Run Code Online (Sandbox Code Playgroud)

t.net_amount如果为 null,COALESCE 将仅使用这些值。t.net_amount但这些值只有在为零时才会出现,buys因此sells100% 浪费了计算。所以以太连接应该是t.net_amount is null或​​者那些可以被删除。

然后还有类似的事情

CASE WHEN T.CODE = 'MFEE' THEN INIT ELSE NULL END AS FEES
Run Code Online (Sandbox Code Playgroud)

如果为空,则稍后将其合并为零(这也可能处理左连接)。但这里可能只是零。但它也指出T.CODE可以等于“MFEE”,并且 DIRECTION 不处理此问题,因此方向可以为空,因此 FLOW 可以为空。