我在Spark中有一个简单的程序:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
Run Code Online (Sandbox Code Playgroud)
当我尝试从spark-shell运行此程序时,即我登录到名称节点(Cloudera安装)并在spark-shell上顺序运行命令:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ") …Run Code Online (Sandbox Code Playgroud) 我有一个 python 包,如下所示:
\n\npackage/\n\xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 __init__.py\n\xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 PyMySQL-0.7.6-py2.7.egg\n\xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 pymysql\n\xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 PyMySQL-0.7.x.pth \n\xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 tests.py\nRun Code Online (Sandbox Code Playgroud)\n\n文件夹结构无法更改,因为它来自第三方库。
\n\n.pth 文件的内容是
\n\nimport sys; sys.__plen = len(sys.path)\n./PyMySQL-0.7.6-py2.7.egg\nimport sys; new=sys.path[sys.__plen:]; del sys.path[sys.__plen:]; p=getattr(sys,'__egginsert',0); sys.path[p:p]=new; sys.__egginsert = p+len(new)\nRun Code Online (Sandbox Code Playgroud)\n\n将pymysql包含在tests.py中的最佳方法是什么
\n\n我显然无法使用,from PyMySQL-0.7.6-py2.7.egg因为文件夹名称包含点。
PS 绝对路径未知,因为此代码应该部署到 AWS lambda
\napache-spark ×1
aws-lambda ×1
cloudera ×1
hadoop ×1
pymysql ×1
python ×1
python-2.7 ×1
scala ×1