Ser*_*nov 6 hadoop hadoop-yarn apache-spark pyspark
在我们在Yarn下运行的Hadoop集群中,我们遇到的问题是,一些"聪明"的人可以通过在pySpark Jupyter笔记本中配置Spark作业来吃掉更大的资源块:
conf = (SparkConf()
.setAppName("name")
.setMaster("yarn-client")
.set("spark.executor.instances", "1000")
.set("spark.executor.memory", "64g")
)
sc = SparkContext(conf=conf)
Run Code Online (Sandbox Code Playgroud)
这导致了这样一种情况,即这些人真正挤出其他人不那么"聪明".
有没有办法禁止用户自行分配资源并将资源分配仅留给Yarn?
YARN对多租户集群中通过队列进行容量规划有很好的支持,YARN ResourceManager默认使用CapacityScheduler 。
\n\n在这里,我们将队列名称作为Spark 提交中的alpha以进行演示。
\n\n$ ./bin/spark-submit --class path/to/class/file \\\n --master yarn-cluster \\\n --queue alpha \\\n jar/location \\\n args\n
Run Code Online (Sandbox Code Playgroud)\n\n设置队列:
\n\nCapacityScheduler 有一个名为 root 的预定义队列。系统中的所有队列都是根队列的子队列。其中capacity-scheduler.xml
,参数yarn.scheduler.capacity.root.queues
用于定义子队列;
例如,要创建 3 个队列,请在逗号分隔列表中指定队列名称。
\n\n<property>\n <name>yarn.scheduler.capacity.root.queues</name>\n <value>alpha,beta,default</value>\n <description>The queues at the this level (root is the root queue).</description>\n</property>\n
Run Code Online (Sandbox Code Playgroud)\n\n这些是容量规划时需要考虑的几个重要属性。
\n\n<property>\n <name>yarn.scheduler.capacity.root.alpha.capacity</name>\n <value>50</value>\n <description>Queue capacity in percentage (%) as a float (e.g. 12.5). The sum of capacities for all queues, at each level, must be equal to 100. Applications in the queue may consume more resources than the queue\xe2\x80\x99s capacity if there are free resources, providing elasticity.</description>\n</property>\n\n<property>\n <name>yarn.scheduler.capacity.root.alpha.maximum-capacity</name>\n <value>80</value>\n <description>Maximum queue capacity in percentage (%) as a float. This limits the elasticity for applications in the queue. Defaults to -1 which disables it.</description>\n</property>\n\n<property>\n <name>yarn.scheduler.capacity.root.alpha.minimum-capacity</name>\n <value>10</value>\n <description>Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value. The former (the minimum value) is set to this property value and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer.</description>\n</property>\n
Run Code Online (Sandbox Code Playgroud)\n\n链接:YARN CapacityScheduler 队列属性
\n