Mat*_*att 7 haskell type-systems pipeline type-level-computation
我使用质谱数据创建了许多数据处理管道,其中来自仪器的数据被清理、转换、缩放、检查和最终分析。我倾向于为此使用递归类型定义——这是一个非常简化的例子:
data Dataset = Initial { x::(Vector Double), y::(Vector Double) name::String}
| Cleaned { x::(Vector Double), y::(Vector Double) name::String}
| Transformed { x::(Vector Double), y::(Vector Double) name::String}
Run Code Online (Sandbox Code Playgroud)
那么一个典型的管道将只是一个以Dataset创建者开始的函数链,然后继续消耗类型的函数Dataset,并产生类型的东西Dataset:
createDataset :: Vector Double -> Vector Double -> String -> Dataset
createDataset x y name = Initial x y name
removeOutliers :: Dataset -> Dataset
removeOutliers (Initial x y n) = let
(new_x, new_y) = outlierRemovalFunction x y
in Cleaned new_x new_y (n ++"_outliersRemoved")
(Cleaned x y n) = error "Already been cleaned"
(Scaled x y n) = error "Scaled data should have already been cleaned"
(Transformed x y n) = error "Transformed data should have already been cleaned"
logTransform :: Dataset -> Dataset
logTransform (Initial x y n) = error "Need to clean first"
(Cleaned x y n) = let
(new_x, new_y) = logTransformFunction x y
in Transformed new_x new_y (n ++ "_logTransformed)
Run Code Online (Sandbox Code Playgroud)
因此,这可以确保管道中的处理步骤以正确的顺序发生,并且您可以使用组合创建整个管道
(logTransform . removeOutliers . createDataset) init_y init_y "ourData"
Run Code Online (Sandbox Code Playgroud)
但由于几个原因,这种方法似乎非常有限。第一个原因是通过构造函数上的模式匹配检测到不正确,因此对管道的添加和更改将需要在模式匹配的任何地方进行更改。想象一个更复杂的例子,有几个清理和几个转换步骤——基本上每个可能的组合都需要它自己独特的构造函数,并且所有的模式匹配都必须是非穷尽的,或者绝对重复。
这似乎受到限制的第二个原因是错误构造的管道只能在运行时通过故障检测到。我已经对所有处理步骤进行了排序,因此在管道中的每个点,我都确切地知道数据发生了什么。首先,类型系统应该能够防止我错误地将步骤放在一起,并且在编译时应该可以检测到在未清理的输入上使用期望清理数据的函数。
我想过为管道中的每个阶段使用单独的类型,然后将“数据集”接口实现为类型类,例如:
class Dataset a where
x :: a -> Vector Double
y :: a -> Vector Double
name :: a -> String
data Initial = Initial x y name
instance Dataset Initial where ...
data Cleaned a = Cleaned a
instance Dataset Cleaned where ...
data Transformed a = Transformed a
instance Dataset Transformed where ...
Run Code Online (Sandbox Code Playgroud)
那么你可以做一些事情(我认为......),比如:
removeOutliers :: (Dataset a) => a -> Cleaned a
removeOutliers = ...
logTransform :: (Dataset a) => Cleaned a -> Transformed Cleaned a
logTransform = ...
Run Code Online (Sandbox Code Playgroud)
我相信这种方法解决了上面的问题 1:我们现在可以在编译时检测管道的不正确性,并且我们不再需要使用所有这些不同的构造函数来描述处理步骤。
但是,似乎我只是将问题“上一级”。我现在正在处理类型变量和所有这些嵌套类型。Dataset我现在不需要为每个可能的管道步骤组合创建一个构造函数,而是需要Dataset为每个类型组合创建一个实例!
我真正想要的是一种让处理管道中的类型在其约束中既非常具体又非常普遍的方法。我想使用类型/约束来详细说明应用特定处理步骤的顺序,但我也希望类型/约束能够传达更一般的东西——即“除了其他不重要的步骤,离群值去除已经完成”。所以基本上是已经删除异常值的事物类型。
传递订购信息将是一个额外的好处——“除了发生了其他不重要的步骤,异常值去除发生了,并且在稍后的某个时刻发生了日志转换”。该类型的东西,有异常值去掉,他们对数变换之前(不一定立即之前)。
使用 Haskell 的类型系统可以实现这种事情吗?
您可以使用幻像类型在其类型中存储有关数据集的信息,例如:
data Initial
data Cleaned
data Scaled
data Dataset a = Dataset { x :: Vector Double, y :: Vector Double, name :: String }
createDataset :: Vector Double -> Vector Double -> String -> Dataset Initial
createDataset x y name = Dataset x y name
removeOutliers :: Dataset Initial -> Dataset Cleaned
removeOutliers (Dataset x y n) =
let (x', y') = clean x y
in Dataset x' y' (n ++ "_clean")
Run Code Online (Sandbox Code Playgroud)
通过一些 GHC 扩展,您可以将幻像类型限制为给定的状态类型,并避免显式声明空数据类型。例如:
{-# LANGUAGE DataKinds, KindSignatures #-}
data State = Initial | Cleaned | Scaled
data Dataset (a :: State) = Dataset { x :: Vector Double, y :: Vector Double, name :: String }
Run Code Online (Sandbox Code Playgroud)
是的,现代 Haskell 类型系统可以处理这个问题。但是,与通常的术语级别编程相比,Haskell 中的类型级别编程仍然很难。语法和技术比较复杂,文档有点欠缺。也往往是这样的情况,需求的相对较小的变化会导致实现中的大变化(即,向您的实现添加一个新的“特性”可以级联成所有类型的重大重组),这可能会使实现变得困难如果您对自己的实际要求仍然有点不确定,请提出解决方案。
@JonPurdy 的评论和 @AtnNn 的回答给出了一些可能的想法。这是一个尝试解决您的特定要求的解决方案。但是,除非您愿意坐下来自学一些类型级编程,否则它可能难以使用(或至少难以适应您的要求)。
无论如何,假设您有兴趣使用已在其上执行的进程的类型级别列表标记固定数据结构(即,始终具有相同类型的相同字段),并通过一种方法检查进程列表所需进程的有序子列表。
我们需要一些扩展:
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
Run Code Online (Sandbox Code Playgroud)
过程标签本身被定义为 sum 类型的构造函数,DataKinds扩展将标签从术语级别提升到类型级别:
data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed
Run Code Online (Sandbox Code Playgroud)
然后数据结构被标记为一个应用进程列表,它的“管道”:
data Dataset (pipeline :: [Process])
= Dataset { x :: [Double]
, y :: [Double]
, name :: String }
Run Code Online (Sandbox Code Playgroud)
注意:管道以相反的顺序排列将是最方便的,最近应用的Process首先应用。
为了允许我们要求 apipeline具有特定的有序进程子序列,我们需要一个类型级函数(即类型系列)来检查子序列。这是一个版本:
type family a || b where
True || b = True
False || b = b
type family Subseq xs ys where
Subseq '[] ys = True
Subseq nonempty '[] = False
Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
Subseq xs (y:ys) = Subseq xs ys
Run Code Online (Sandbox Code Playgroud)
我们可以在 GHCi 中测试这个类型级别的函数:
?> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned] :: Bool
= 'True
?> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned] :: Bool
= 'False
?> :kind! Subseq '[Inspected, Transformed] '[Transformed, Inspected]
Subseq '[Inspected, Transformed] '[Transformed, Inspected] :: Bool
= 'False
Run Code Online (Sandbox Code Playgroud)
如果您想编写一个函数,该函数需要对数据集进行转换,然后清除异常值(按此顺序),可能与其他不重要的步骤混合在一起,并且函数本身应用了缩放步骤,那么签名将如下所示:
-- remember: pipeline type is in reverse order
foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined
Run Code Online (Sandbox Code Playgroud)
如果要防止双重缩放,可以引入另一个类型级函数:
type family Member x xs where
Member x '[] = 'False
Member x (x:xs) = 'True
Member x (y:xs) = Member x xs
Run Code Online (Sandbox Code Playgroud)
并添加另一个约束:
foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
, Member Scaled pipeline ~ False)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined
Run Code Online (Sandbox Code Playgroud)
然后:
> foo2 (Dataset [] [] "x" :: Dataset '[Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
:: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]
Run Code Online (Sandbox Code Playgroud)
在约束语法和错误消息方面,您可以使用一些额外的类型别名和类型系列使其更加友好:
import Data.Kind
import GHC.TypeLits
type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
Require1 True procs pipeline = ()
Require1 False procs pipeline
= TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
Forbid1 False proc pipeline = ()
Forbid1 True proc pipeline
= TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
Text " must not include " :<>: ShowType proc)
foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined
Run Code Online (Sandbox Code Playgroud)
这使:
> foo3 (Dataset [] [] "x" :: Dataset '[Transformed])
...The pipeline '[ 'Transformed] lacks required processing '[ 'Cleaned, 'Transformed]...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
...The pipeline '[ 'Cleaned, 'Scaled, 'Transformed] must not include 'Scaled...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
:: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]
Run Code Online (Sandbox Code Playgroud)
完整的代码示例:
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
import Data.Kind
import GHC.TypeLits
data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed
data Dataset (pipeline :: [Process])
= Dataset { x :: [Double]
, y :: [Double]
, name :: String }
type family a || b where
True || b = True
False || b = b
type family Subseq xs ys where
Subseq '[] ys = True
Subseq nonempty '[] = False
Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
Subseq xs (y:ys) = Subseq xs ys
type family Member x xs where
Member x '[] = False
Member x (x:xs) = True
Member x (y:xs) = Member x xs
type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
Require1 True procs pipeline = ()
Require1 False procs pipeline
= TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
Forbid1 False proc pipeline = ()
Forbid1 True proc pipeline
= TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
Text " must not include " :<>: ShowType proc)
foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined
foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
, Member Scaled pipeline ~ False)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined
foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
=> Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
225 次 |
| 最近记录: |