如何在pyspark函数中使用全局变量

yun*_*ula 3 python-3.x apache-spark pyspark

首先,我在代码开头有两个变量。

numericColumnNames = []
categoricalColumnsNames = [];
Run Code Online (Sandbox Code Playgroud)

然后在 main 方法中,我为这些值赋值

def main():
  #clickRDD = sc.textFile("s3a://wer-display-ads/day_0_1000.csv"); 
  clickRDD = sc.textFile("data/day_0_1000.csv");
  numericColumnNames , categoricalColumnsNames = getColumnStructure();
Run Code Online (Sandbox Code Playgroud)

然后当我想在以下函数中使用这些变量时,这些变量没有更新并且为空

def dataToVectorForLinear(clickDF):
  print (categoricalColumnsNames) ## why this list is empty 
  clickDF = oneHotEncoding(clickDF,categoricalColumnsNames)
Run Code Online (Sandbox Code Playgroud)

可惜我找不到问题所在?感谢您的帮助

Moh*_*hif 7

只需在函数“global”关键字内重新初始化它们,如下所示

def main():

    global numericColumnNames
    global categoricalColumnsNames     

    clickRDD = sc.textFile("data/day_0_1000.csv");
    numericColumnNames , categoricalColumnsNames = getColumnStructure();
Run Code Online (Sandbox Code Playgroud)

相似地

def dataToVectorForLinear(clickDF):

    global categoricalColumnsNames
    print (categoricalColumnsNames) 
    clickDF = oneHotEncoding(clickDF,categoricalColumnsNames)
Run Code Online (Sandbox Code Playgroud)

参考: