我有一个带有文本列的数据框。我将它们分成x_train和x_test。
我的问题是,最好Tokenizer.fit_on_text()在整个x数据集上使用Keras的功能x_train?
像这样:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_data)
Run Code Online (Sandbox Code Playgroud)
要么
tokenizer.fit_on_texts(x_train) # <- fixed typo
tokenizer.texts_to_sequences(x_train)
Run Code Online (Sandbox Code Playgroud)
有关系吗?我也必须x_test稍后再进行标记化,所以我可以只使用相同的标记化器吗?
我有这样一个DF:
Name Food Year_eaten Month_eaten
Maria Rice 2014 3
Maria Rice 2015 NaN
Maria Rice 2016 NaN
Jack Steak 2011 NaN
Jack Steak 2012 5
Jack Steak 2013 NaN
Run Code Online (Sandbox Code Playgroud)
我希望输出看起来像这样:
Name Food Year_eaten Month_eaten
Maria Rice 2014 3
Maria Rice 2015 3
Maria Rice 2016 3
Jack Steak 2011 5
Jack Steak 2012 5
Jack Steak 2013 5
Run Code Online (Sandbox Code Playgroud)
我想根据这个条件填写NaN:
If the row's Name, Food is the same and the Year's are consecutive:
Fill the NaN's with the Month_eaten corresponding to …Run Code Online (Sandbox Code Playgroud) 假设我有一个像这样的数据框:
ID Name Description
0 Manny V e r y calm
1 Joey Keen and a n a l y t i c a l
2 Lisa R a s h and careless
3 Ash Always joyful
Run Code Online (Sandbox Code Playgroud)
我想删除列中每个字母之间的所有空格Description,而不完全删除单词之间的所有必要空格。
Pandas 有一个简单的方法吗?
我有一个垃圾邮件数据集,它具有以下数据类型:
\n\npyspark.rdd.PipelinedRDD
当我这样做时spams.take(3),我得到:
[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s"],\n [\'WINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\'],\n [\'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with …