小编Ray*_*ama的帖子

pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

Run Code Online (Sandbox Code Playgroud)

and I got the error as follow:

pickle.PicklingError: args[0] from newobj args has the …

python hadoop pickle python-2.7 pyspark

Tia*_*ana

2017 07-05

5
推荐指数

2
解决办法

3821
查看次数

根据另一列Python,Pandas中的值删除一列的重复项

我有这样的数据帧:

Date                PlumeO      Distance
2014-08-13 13:48:00  754.447905 5.844577 
2014-08-13 13:48:00  754.447905 6.888653
2014-08-13 13:48:00  754.447905 6.938860
2014-08-13 13:48:00  754.447905 6.977284
2014-08-13 13:48:00  754.447905 6.946430 
2014-08-13 13:48:00  754.447905 6.345506
2014-08-13 13:48:00  754.447905 6.133567
2014-08-13 13:48:00  754.447905 5.846046 
2014-08-13 16:59:00  754.447905 6.345506 
2014-08-13 16:59:00  754.447905 6.694847 
2014-08-13 16:59:00  754.447905 5.846046 
2014-08-13 16:59:00  754.447905 6.977284 
2014-08-13 16:59:00  754.447905 6.938860 
2014-08-13 16:59:00  754.447905 5.844577 
2014-08-13 16:59:00  754.447905 6.888653 
2014-08-13 16:59:00  754.447905 6.133567 
2014-08-13 16:59:00  754.447905 6.946430

Run Code Online (Sandbox Code Playgroud)

我试图保持最小距离的日期,所以删除重复日期并保持最小距离.

有没有办法在熊猫中实现这一点,df.drop_duplicates还是我坚持使用if语句找到最小的距离？

python duplicates conditional-statements dataframe pandas

Ahm*_*med

2017 07-12

5
推荐指数

2
解决办法

4100
查看次数