标签: johnsnowlabs-spark-nlp

Spark-NLP 预训练管道仅适用于 linux 系统吗?

我正在尝试设置一个简单的代码,在其中传递数据帧并使用 johnSnowLabs Spark-NLP 库提供的预训练解释管道对其进行测试。我正在使用 anaconda 的 jupyter 笔记本,并使用 apache toree 进行了 spark scala kernet 设置。每次我运行应该加载预训练管道的步骤时,它都会抛出一个 tensorflow 错误。有没有办法可以在本地 Windows 上运行它?

I was trying this in a maven project earlier and the same error had happened. Another colleague tried it on a linux system and it worked. Below is the code I have tried and the error that it gave.


import org.apache.spark.ml.PipelineModel
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
    .builder()
    .appName("test")
    .master("local[*]")
    .config("spark.driver.memory", "4G")
    .config("spark.kryoserializer.buffer.max", "200M")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") …
Run Code Online (Sandbox Code Playgroud)

johnsnowlabs-spark-nlp

7
推荐指数
1
解决办法
1186
查看次数

无法下载 spark-nlp 库提供的管道

我无法使用 spark-nlp 库提供的预定义管道“recognize_entities_dl”

我尝试安装不同版本的 pyspark 和 spark-nlp 库

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

#create or get Spark Session

spark = sparknlp.start()

sparknlp.version()
spark.version

#download, load, and annotate a text by pre-trained pipeline

pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')
result = pipeline.annotate('Harry Potter is a great movie')

2.1.0
recognize_entities_dl download started this may take some time.
Run Code Online (Sandbox Code Playgroud)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-b71a0f77e93a> in <module>
     11 #download, load, and annotate a text by pre-trained pipeline
     12 
---> 13 pipeline = PretrainedPipeline('recognize_entities_dl', …
Run Code Online (Sandbox Code Playgroud)

python apache-spark johnsnowlabs-spark-nlp

7
推荐指数
1
解决办法
2050
查看次数

我们应该如何将 setDictionary 用于 Spark-NLP 中的词形还原注释器?

我有一个要求,我必须在词形还原步骤中添加一个字典。在尝试在管道中使用它并执行 pipeline.fit() 时,我收到一个 arrayIndexOutOfBounds 异常。实现这一点的正确方法是什么?有什么例子吗?

我将 token 作为词形还原的 inputcol 和 lemma 作为 outputcol 传递。以下是我的代码:

    // DocumentAssembler annotator
    val document = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
    // SentenceDetector annotator
    val sentenceDetector = new SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
    // tokenizer annotaor
    val token = new Tokenizer()
        .setInputCols("sentence")
        .setOutputCol("token")
    import com.johnsnowlabs.nlp.util.io.ExternalResource
     // lemmatizer annotator
    val lemmatizer = new Lemmatizer()
        .setInputCols(Array("token"))
        .setOutputCol("lemma")
     .setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
    val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
    val result= pipeline.fit(df).transform(df)
Run Code Online (Sandbox Code Playgroud)

错误信息是:

    Name: java.lang.ArrayIndexOutOfBoundsException
    Message: 1
    StackTrace:   at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:315)
      at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:312)
      at scala.collection.Iterator$class.foreach(Iterator.scala:891)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
      at …
Run Code Online (Sandbox Code Playgroud)

johnsnowlabs-spark-nlp

6
推荐指数
1
解决办法
1103
查看次数

安装 sparknlp 后,无法导入 sparknlp

以下在 Cloudera CDSW 集群网关上成功运行。

import pyspark
from pyspark.sql import SparkSession
spark = (SparkSession
            .builder
            .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3")
            .getOrCreate()
         )
Run Code Online (Sandbox Code Playgroud)

产生这个输出。

Ivy Default Cache set to: /home/cdsw/.ivy2/cache
The jars for the packages stored in: /home/cdsw/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found JohnSnowLabs#spark-nlp;1.2.3 in spark-packages
    found com.typesafe#config;1.3.0 in central
    found org.fusesource.leveldbjni#leveldbjni-all;1.8 in central
downloading http://dl.bintray.com/spark-packages/maven/JohnSnowLabs/spark-nlp/1.2.3/spark-nlp-1.2.3.jar ...
    [SUCCESSFUL ] JohnSnowLabs#spark-nlp;1.2.3!spark-nlp.jar (3357ms)
downloading https://repo1.maven.org/maven2/com/typesafe/config/1.3.0/config-1.3.0.jar ...
    [SUCCESSFUL ] com.typesafe#config;1.3.0!config.jar(bundle) (348ms)
downloading https://repo1.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar ... …
Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark apache-spark-mllib spark-packages johnsnowlabs-spark-nlp

5
推荐指数
2
解决办法
4297
查看次数

spark-nlp:DocumentAssembler 初始化失败,出现“java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class”

我正在尝试https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc 中提供的 ContenxtAwareSpellChecker

管道中的第一个组件是DocumentAssembler

from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp


spark = sparknlp.start()
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
Run Code Online (Sandbox Code Playgroud)

运行失败时的上述代码如下

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\base.py", line 148, in __init__
    super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\sparknlp\internal.py", line 72, in __init__
    self._java_obj = self._new_java_obj(classname, self.uid)
  File "C:\Users\pab\AppData\Local\Continuum\anaconda3.7\envs\MailChecker\lib\site-packages\pyspark\ml\wrapper.py", line 69, in _new_java_obj
    return …
Run Code Online (Sandbox Code Playgroud)

python apache-spark pyspark johnsnowlabs-spark-nlp

5
推荐指数
1
解决办法
1674
查看次数

如何从磁盘加载 spark-nlp 预训练模型

我从spark-nlpGitGub页面下载了一个.zip包含预训练 NerCRFModel的文件。zip 包含三个文件夹:embeddings、fields 和 metadata。

我如何将它加载到 Scala 中NerCrfModel以便我可以使用它?我是否必须将其放入 HDFS 或启动 Spark Shell 的主机中?我如何引用它?

nlp scala apache-spark apache-spark-mllib johnsnowlabs-spark-nlp

3
推荐指数
1
解决办法
1244
查看次数

如何使用 JohnSnowLabs NLP 拼写校正模块 NorvigSweetingModel?

我在这里浏览了 JohnSnowLabs SpellChecker 。

我在那里找到了Norvig的算法实现,示例部分只有以下两行:

import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()
Run Code Online (Sandbox Code Playgroud)

如何在df下面的数据框 ( )上应用此预训练模型以纠正“ names”列的拼写?

+----------------+---+------------+
|           names|age|       color|
+----------------+---+------------+
|      [abc, cde]| 19|    red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+
Run Code Online (Sandbox Code Playgroud)

我试过这样做:

val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")

val cdf = schk.transform(df)
Run Code Online (Sandbox Code Playgroud)

但是上面的代码给了我以下错误:

java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
  at scala.Predef$.require(Predef.scala:224)
  at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
  ... 49 elided
Run Code Online (Sandbox Code Playgroud)

nlp scala apache-spark apache-spark-ml johnsnowlabs-spark-nlp

3
推荐指数
1
解决办法
946
查看次数