NLTK 被调用并收到错误“punkt”在 databricks pyspark 上未找到

use*_*011 7 nlp nltk python-3.x pyspark

我想调用 NLTK 通过 pyspark 在 databricks 上做一些 NLP。我已经从 databricks 的库选项卡安装了 NLTK。它应该可以从所有节点访问。

我的 py3 代码:

 import pyspark.sql.functions as F
 from pyspark.sql.types import StringType
 import nltk
 nltk.download('punkt')
 

 def get_keywords1(col):
     sentences = []
     sentence = nltk.sent_tokenize(col)
      

 get_keywords_udf = F.udf(get_keywords1, StringType())
Run Code Online (Sandbox Code Playgroud)

我运行上面的代码并得到:

 [nltk_data] Downloading package punkt to /root/nltk_data...
 [nltk_data]   Package punkt is already up-to-date!
Run Code Online (Sandbox Code Playgroud)

当我运行以下代码时:

 t = spark.createDataFrame(
 [(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
  (2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
 ], 
 ("year", "month", "u_id", "objects"))
 
 t1 = t.withColumn('keywords', get_keywords_udf('objects'))
 t1.show() # error here !
Run Code Online (Sandbox Code Playgroud)

我收到错误:

 <span class="ansi-red-fg">&gt;&gt;&gt; import nltk

 PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
 Traceback (most recent call last):
  
 LookupError: 
 **********************************************************************
 Resource punkt not found.
 Please use the NLTK Downloader to obtain the resource:

 >>> import nltk
 >>> nltk.download('punkt')

 For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/PY3/english.pickle

Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
Run Code Online (Sandbox Code Playgroud)

我已经下载了“朋克”。它位于

/root/nltk_data/tokenizers
Run Code Online (Sandbox Code Playgroud)

我已经使用文件夹位置更新了 Spark 环境中的 PATH。

为什么找不到?

NLTK的解决方案。未找到 Punkt和此How to config nltk data directory from code? 但他们都不适合我。

我已尝试更新

 nltk.data.path.append('/root/nltk_data/tokenizers/')
Run Code Online (Sandbox Code Playgroud)

这是行不通的。看来nltk看不到新添加的路径!

我还将 punkz 复制到 nltk 将搜索的路径中。

cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data

但是,nltk仍然看不到它。

谢谢

小智 -1

这帮助我解决了这个问题:

import nltk
nltk.download('all')
Run Code Online (Sandbox Code Playgroud)