请考虑以下代码:
case class Person(
personId: Long, name: String, ageGroup: String, gender: String,
relationshipStatus: String, country: String, state: String
)
case class PerPersonPower(personId: Long, power: Double)
val people: Dataset[Person] = ... // Around 50 million entries.
val powers: Dataset[PerPersonPower] = ... // Around 50 million entries.
people.join(powers, "personId")
.groupBy("ageGroup", "gender", "relationshipStatus", "country", "state")
.agg(
sum("power").alias("totalPower"),
count("*").alias("personCount")
)
Run Code Online (Sandbox Code Playgroud)
它在具有大约100 GB RAM的群集上执行.但是,群集内存不足.我不知道该怎么做.实际上,people被分区$"personId"和缓存 - people.repartition($"personId").cache().
我有什么想法可以优化这个计算?
该集群是一个普通的Google Dataproc集群---因此它在客户端模式下使用YARN--由14个节点组成,每个节点具有8 GB RAM.
scala apache-spark apache-spark-sql google-cloud-dataproc apache-spark-dataset
我正在学习Haskell,并对此示例感到困惑.
考虑以下:
class Tofu t where
tofu :: j a -> t a j
data Frank a b = Frank {frankField :: b a} deriving (Show)
instance Tofu Frank where
tofu x = Frank x
Run Code Online (Sandbox Code Playgroud)
为什么在创建Frank实例时Tofu,我们提供(据我所知)一个类型构造函数Frank x,而不是值构造函数,即tofu x = Frank {frankField = x}?