我正在AWS EMR上运行5节点Spark群集,每个群集大小为m3.xlarge(1个主4个从属).我成功地运行了一个146Mb的bzip2压缩CSV文件,最终获得了完美的聚合结果.
现在我正在尝试在此群集上处理~5GB bzip2 CSV文件但我收到此错误:
16/11/23 17:29:53 WARN TaskSetManager:阶段6.0中丢失的任务49.2(TID xxx,xxx.xxx.xxx.compute.internal):ExecutorLostFailure(执行者16退出由其中一个正在运行的任务引起)原因:容器由于超过内存限制而被YARN杀死.使用10.4 GB的10.4 GB物理内存.考虑提升spark.yarn.executor.memoryOverhead.
我很困惑为什么我在~75GB群集上获得~10.5GB内存限制(每3m.xlarge实例15GB)...
这是我的EMR配置:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
Run Code Online (Sandbox Code Playgroud)
根据我的阅读,设置maximizeResourceAllocation属性应告诉EMR配置Spark以充分利用群集上的所有可用资源.即,我应该有~75GB的内存......那么为什么我会得到~10.5GB的内存限制错误?这是我正在运行的代码:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", …Run Code Online (Sandbox Code Playgroud) 我想看看如何在 Sequelize 中设置查询的超时时间。
我已经查看了 Sequelize 文档以获取一些信息,但我无法完全找到我正在寻找的内容。我发现的最接近的是“pools.acquire”选项,但我不想设置传入连接的超时,而是设置正在进行的查询的超时,以便我可以快速短路死锁。
http://docs.sequelizejs.com/class/lib/sequelize.js~Sequelize.html
这是我的示例代码:
const db = new Sequelize( database, username, password, {
host : hostname,
dialect: "mysql",
define : {},
pool: {
max : 10,
min : 0,
idle: 10000
},
})
Run Code Online (Sandbox Code Playgroud)
任何见解将不胜感激!
amazon-emr ×1
apache-spark ×1
bigdata ×1
emr ×1
mysql ×1
node.js ×1
sequelize.js ×1
timeout ×1