尝试使用pyspark运行一个简单的GraphFrame示例.
火花版:2.0
graphframe版本:0.2.0
我可以在Jupyter中导入图形框架:
from graphframes import GraphFrame
GraphFrame
graphframes.graphframe.GraphFrame
Run Code Online (Sandbox Code Playgroud)
我尝试创建GraphFrame对象时收到此错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-23-2bf19c66804d> in <module>()
----> 1 gr_links = GraphFrame(df_web_page, df_parent_child_link)
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in __init__(self, v, e)
60 self._sc = self._sqlContext._sc
61 self._sc._jvm.org.apache.spark.ml.feature.Tokenizer()
---> 62 self._jvm_gf_api = _java_api(self._sc)
63 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
64
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in _java_api(jsc)
32 def _java_api(jsc):
33 javaClassName = "org.graphframes.GraphFramePythonAPI"
---> 34 return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
35 .newInstance()
36
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> …Run Code Online (Sandbox Code Playgroud) 嗨,我正在尝试构建一个简单的维基百科报废工具,可以让我分析文本,并使用python在一个人的生活中构建事件的时间表.我在网上搜索可能的方法,直到现在我已经能够使用BeautifulSoup和urllib2检索数据.到现在为止的代码看起来像这样:
from bs4 import BeautifulSoup
import urllib2
import re
import nltk
import json
#get source code of page (function used later)
def fetchsource(url):
source = urllib2.urlopen(url).read()
return source
if __name__=='__main__':
#url = "http://en.wikipedia.org/w/index.php?action=raw&title=Tom_Cruise" #works
url="http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&&titles=Tom_Cruise" #works
print url
source = fetchsource(url)
soup = BeautifulSoup(source)
print soup.prettify()
Run Code Online (Sandbox Code Playgroud)
现在虽然我可以使用它,但我得到的输出有点难以解析,我只是想问是否有更好的方法或可能更可管理的语法,我可以检索数据.请评论.