我有一个参考清单
ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']
Run Code Online (Sandbox Code Playgroud)
还有一个数据框
df = pd.DataFrame({'Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']]})
df
Month_List
0 [July]
1 [August]
2 [July, June]
3 [May, April, March]
Run Code Online (Sandbox Code Playgroud)
我想检查引用列表中的哪些元素存在于每一行中,然后转换为二进制列表
我可以使用 apply
def convert_month_to_binary(ref,lst):
s = pd.Series(ref)
return s.isin(lst).astype(int).tolist()
df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df
Month_List Binary_Month_List
0 [July] [0, 0, 1, 0, 0, 0, 0]
1 [August] [0, 1, 0, 0, 0, 0, 0]
2 [July, June] [0, 0, 1, 1, 0, 0, …Run Code Online (Sandbox Code Playgroud) 我想在1000到100000之间采样140个数字,这样140个数字的总和大约是2百万(2000000):
sample(1000:100000,140)
Run Code Online (Sandbox Code Playgroud)
这样:
sum(sample(1000:100000,140)) = 2000000
Run Code Online (Sandbox Code Playgroud)
任何指针如何实现这一目标?
我DenseVector RDD喜欢这个
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
Run Code Online (Sandbox Code Playgroud)
我想把它转换成一个Dataframe.我试过这样的
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
Run Code Online (Sandbox Code Playgroud)
它给出了这样的错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, …Run Code Online (Sandbox Code Playgroud) apache-spark pyspark apache-spark-ml apache-spark-mllib apache-spark-2.0
我想在我的python 2.6.6中安装pip ,我有Oracle Linux 6
我按照此链接Link给出的答案
我下载了get-pip.py文件并运行以下命令
sudo python2.6 get-pip.py
Run Code Online (Sandbox Code Playgroud)
但是我收到以下错误
[root@bigdatadev3 Downloads]# sudo python2.6 get-pip.py
DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6
Collecting pip
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3cad210>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pip/
Retrying (Retry(total=3, connect=None, …Run Code Online (Sandbox Code Playgroud) 我有一个pyspark Dataframe,我需要将其转换为python字典.
下面的代码是可重现的:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Run Code Online (Sandbox Code Playgroud)
一旦我有了这个数据帧,我需要将它转换为字典.
我试过这样的
df.set_index('name').to_dict()
Run Code Online (Sandbox Code Playgroud)
但它给出了错误.我怎样才能做到这一点
我有一个像这样的稀疏矢量
>>> countVectors.rdd.map(lambda vector: vector[1]).collect()
[SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})]
Run Code Online (Sandbox Code Playgroud)
我试图将此转换为像这样的pyspark 2.0.0中的密集向量
>>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1])
>>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect()
Run Code Online (Sandbox Code Playgroud)
我收到这样的错误:
16/12/26 14:03:35 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 13)
org.apache.spark.api.python.PythonException: Traceback (most …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用Apache OpenNLP 1.7构建自定义NER.从可用的文档在这里,我开发了以下代码
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
public class PersonClassifierTrainer {
static String modelFile = "/opt/NLP/data/en-ner-customperson.bin";
public static void main(String[] args) throws IOException {
Charset charset = Charset.forName("UTF-8");
**ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("/opt/NLP/data/person.train"), charset);**
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
TokenNameFinderFactory nameFinderFactory = null;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
nameFinderFactory);
} …Run Code Online (Sandbox Code Playgroud) 我有一个像这样的矩阵
df <- matrix(c(rep(1,3),rep(2,3)),nrow=3,ncol=2)
df:
[,1] [,2]
[1,] 1 2
[2,] 1 2
[3,] 1 2
Run Code Online (Sandbox Code Playgroud)
我想将每个单元格值转换为YES大于0,否则NO
我明白我可以这样做
apply(df, 2, function(x) ifelse(x > 0, "Yes","No"))
Run Code Online (Sandbox Code Playgroud)
然而,我的矩阵是非常巨大的(百万*5000),因此使用应用需要非常大的时间
我也试过了
df <- ifelse(df > 0, "Yes","No")
Run Code Online (Sandbox Code Playgroud)
然而,即使这需要很多时间
我能用这个获得更好的表现吗?
我有一个pyspark数据框,其中有一个包含字符串的列.我想把这个专栏分成几个字
码:
>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+
Expected Output
---------------
>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我怎样才能做到这一点?
我有data.table这样的
ds <- data.table(ID = c(1,1,1,1,1,2,2,2,2,2),
Month = c("Jan", "Feb", "Mar", "Apr", "May", "Jan", "Feb", "Mar", "Apr", "May"),
val = c(1,2,3,4,5,6,7,8,9,10))
ds
ID Month val
1: 1 Jan 1
2: 1 Feb 2
3: 1 Mar 3
4: 1 Apr 4
5: 1 May 5
6: 2 Jan 6
7: 2 Feb 7
8: 2 Mar 8
9: 2 Apr 9
10: 2 May 10
Run Code Online (Sandbox Code Playgroud)
我想让我data.table在每个ID组中的位置重新Month排列这样的顺序
ID Month val
4: 1 Apr 4
5: …Run Code Online (Sandbox Code Playgroud) apache-spark ×4
pyspark ×4
python ×3
r ×3
apply ×1
data.table ×1
dataframe ×1
dictionary ×1
java ×1
matrix ×1
nlp ×1
numpy ×1
opennlp ×1
pandas ×1
python-2.6 ×1
random ×1
sampling ×1
sorting ×1