你好地球人!我正在使用Airflow来安排和运行Spark任务.我此时发现的只是Airflow可以管理的python DAG.
DAG示例:
spark_count_lines.py
import logging
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime
args = {
'owner': 'airflow'
, 'start_date': datetime(2016, 4, 17)
, 'provide_context': True
}
dag = DAG(
'spark_count_lines'
, start_date = datetime(2016, 4, 17)
, schedule_interval = '@hourly'
, default_args = args
)
def run_spark(**kwargs):
import pyspark
sc = pyspark.SparkContext()
df = sc.textFile('file:///opt/spark/current/examples/src/main/resources/people.txt')
logging.info('Number of lines in people.txt = {0}'.format(df.count()))
sc.stop()
t_main = PythonOperator(
task_id = 'call_spark'
, dag = dag
, …Run Code Online (Sandbox Code Playgroud) 我最近配置Airflow为执行我的任务.我有主节点和2个工作人员执行我的任务.我想用Graphite和监视我的集群Grafana.我所做的就是在主节点上安装Graphite并Grafana使用简单的bash命令对其进行测试.现在我想Airflow在执行任务时监视我的集群.我创建 metrics.properties并将其放入spark/conf:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=192.168.2.241
*.sink.graphite.port=2003
*.sink.graphite.period=10
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Run Code Online (Sandbox Code Playgroud)
我向我添加了以下标志spark-submit:
--files=/path/to/metrics.properties \
--conf spark.metrics.conf=metrics.properties
Run Code Online (Sandbox Code Playgroud)
打开后我能找到的Graphite ui只是Graphite->carbon->agents->cluster1-a一些图表.我确定它正在监视其他内容,而不是我的Airflow群集.
也许我需要安装grafana-spark-dashboards?但它完全是关于YARN我正在使用Airflow管理系统.
或加入一个块Carbon的storage-schemas.conf?
此块将显示在Graphite仪表板中:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
Run Code Online (Sandbox Code Playgroud)
我可以检查哪些指标Spark发送到Graphite …