use*_*011 7 nlp nltk python-3.x pyspark
我想调用 NLTK 通过 pyspark 在 databricks 上做一些 NLP。我已经从 databricks 的库选项卡安装了 NLTK。它应该可以从所有节点访问。
我的 py3 代码:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
Run Code Online (Sandbox Code Playgroud)
我运行上面的代码并得到:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Run Code Online (Sandbox Code Playgroud)
当我运行以下代码时:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
Run Code Online (Sandbox Code Playgroud)
我收到错误:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
Run Code Online (Sandbox Code Playgroud)
我已经下载了“朋克”。它位于
/root/nltk_data/tokenizers
Run Code Online (Sandbox Code Playgroud)
我已经使用文件夹位置更新了 Spark 环境中的 PATH。
为什么找不到?
NLTK的解决方案。未找到 Punkt和此How to config nltk data directory from code? 但他们都不适合我。
我已尝试更新
nltk.data.path.append('/root/nltk_data/tokenizers/')
Run Code Online (Sandbox Code Playgroud)
这是行不通的。看来nltk看不到新添加的路径!
我还将 punkz 复制到 nltk 将搜索的路径中。
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
但是,nltk仍然看不到它。
谢谢
| 归档时间: |
|
| 查看次数: |
5862 次 |
| 最近记录: |