小编bib*_*bib的帖子

我们如何保存巨大的 pyspark 数据框?

我有一个很大的 pyspark Dataframe,我想将它保存在myfile(.tsv) 中以供进一步使用。为此,我定义了以下代码:

with open(myfile, "a") as csv_file:
        writer = csv.writer(csv_file, delimiter='\t')
        writer.writerow(["vertex" + "\t" + "id_source" + "\t" + "id_target" + "\t"+ "similarity"])

        for part_id in range(joinDesrdd_df.rdd.getNumPartitions()):
            part_rdd = joinDesrdd_df.rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
            data_from_part_rdd = part_rdd.collect()
            vertex_list = set()

            for row in data_from_part_rdd:
                writer.writerow([....])

        csv_file.flush() 
Run Code Online (Sandbox Code Playgroud)

我的代码无法通过这一步,它会生成一个异常:

1.

 in the workers log:
19/07/22 08:58:57 INFO Worker: Executor app-20190722085320-0000/2 finished with state KILLED exitStatus 143
14: 19/07/22 08:58:57 INFO ExternalShuffleBlockResolver: Application app-20190722085320-0000 removed, cleanupLocalDirs = true
14: 19/07/22 08:58:57 INFO …
Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark pyspark-sql

7
推荐指数
1
解决办法
3726
查看次数

我如何在pyspark应用程序中维护临时字典?

我想在pyspark应用程序中使用预训练嵌入模型(fasttext).

因此,如果我广播文件(.bin),则抛出以下异常:Traceback(最近一次调用last):

cPickle.PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 2 GiB
Run Code Online (Sandbox Code Playgroud)

相反,我试图用sc.addFile(modelpath)其中modelpath=path/to/model.bin如下:

我创建了一个名为fasttextSpark.py的文件

import gensim
from gensim.models.fasttext import FastText as FT_gensim
# Load model (loads when this library is being imported)
model = FT_gensim.load_fasttext_format("/project/6008168/bib/wiki.en.bin")

# This is the function we use in UDF to predict the language of a given msg
def get_vector(msg):
    pred = model[msg]
    return pred
Run Code Online (Sandbox Code Playgroud)

和testSubmit.sh:

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH --mem 20000
#SBATCH --ntasks-per-node …
Run Code Online (Sandbox Code Playgroud)

python apache-spark word2vec pyspark fasttext

5
推荐指数
1
解决办法
288
查看次数

如何计算 PySpark 数据帧中每个键的百分位数?

我有一个 PySpark 数据框,由三列 x、y、z 组成。

X 在此数据框中可能有多行。如何分别计算 x 中每个键的百分位数?

+------+---------+------+
|  Name|     Role|Salary|
+------+---------+------+
|   bob|Developer|125000|
|  mark|Developer|108000|
|  carl|   Tester| 70000|
|  carl|Developer|185000|
|  carl|   Tester| 65000|
| roman|   Tester| 82000|
| simon|Developer| 98000|
|  eric|Developer|144000|
|carlos|   Tester| 75000|
| henry|Developer|110000|
+------+---------+------+
Run Code Online (Sandbox Code Playgroud)

需要的输出:

+------+---------+------+----------
|  Name|     Role|Salary|      50%|
+------+---------+------+----------
|   bob|Developer|125000|117500.0 |
|  mark|Developer|108000|117500.0 |
|  carl|   Tester| 70000|72500.0  |
|  carl|Developer|185000|117500.0 |
|  carl|   Tester| 65000|72500.0  |
| roman|   Tester| 82000|72500.0  |
| simon|Developer| 98000|117500.0 |
|  eric|Developer|144000|117500.0 …
Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql pyspark

4
推荐指数
2
解决办法
1万
查看次数

如何获取给定文档的 tfidf 向量

我有以下文件:

id  review
1   "Human machine interface for lab abc computer applications."
2   "A survey of user opinion of computer system response time."
3   "The EPS user interface management system."
4   "System and human system engineering testing of EPS."              
5   "Relation of user perceived response time to error measurement."
6   "The generation of random binary unordered trees."
7   "The intersection graph of paths in trees."
8   "Graph minors IV Widths of trees and well quasi ordering."
9   "Graph minors A …
Run Code Online (Sandbox Code Playgroud)

tf-idf

3
推荐指数
1
解决办法
5792
查看次数

作为我提交给Slurm的一部分,如何激活特定的Python环境?

我想在群集上运行脚本(SBATCH文件)。如何激活我的虚拟环境(路径/到/ env_name / bin /激活)。我是否只需要添加:

 module load python/2.7.14
source "/pathto/Python_directory/ENV2.7_new/bin/activate"
Run Code Online (Sandbox Code Playgroud)

在my_script.sh文件中?

python slurm sbatch

3
推荐指数
2
解决办法
2472
查看次数

如何将 SPARQL 中 COUNT 的输出格式化为仅数字?

SELECT ?class (count(distinct ?subClass) AS ?noci)  
WHERE { ?class jooo:hasJSubClass ?subClass}  
GROUP BY?class
Run Code Online (Sandbox Code Playgroud)

我有正确的答案,但格式如下:"2"^^<http//www.w3.org/2001/XMLSchema#integer>. 我需要的答案只有 2,那我该怎么办?

rdf sparql

1
推荐指数
1
解决办法
437
查看次数

如何在java中使用compareTo函数

public class Pair<F,S> implements Comparable<Pair<F,S>> {
public F first;
public S second;

public F first() {
return first;
 }

public void setFirst(F first) {
this.first=first;
 }

 public S second() {
 return second;
 }

 public void setSecond(S second) {
 this.second=second;
  }

 public Pair(F first, S second) {
 super();
  this.first=first;
   this.second=second;
  }  

 public int hashCode() {
  return(first.hashCode()^second.hashCode());
}

  @Override
 public boolean equals(Object obj) {   
 return obj instanceof Pair && ((Pair)obj).first.equals(first) &&      (Pair)obj).second.equals(second);
   }

  public String toString() {
  return first + " / …
Run Code Online (Sandbox Code Playgroud)

java

1
推荐指数
1
解决办法
406
查看次数

SPARQL:从dbpedia中提取不同的值

我想从DBPedia中提取5个电影(或电影)个人的图形。

我的查询是:

ParameterizedSparqlString qs = new ParameterizedSparqlString(“ +
” construct {?s?p?o}“ +” where {?http://dbpedia.org/ontology/Film。"+“?s?p?o”}偏移量0 LIMIT 5“);

我得到以下结果:

1- http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Film

2- http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2002/ 07 / owl#Thing

3- http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.wikidata.org/entity/ Q386724

4- http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Wikidata: Q11424

5- http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Work

问题:相同的影片是所有类别的5次返回:Film,Thing,Q386724,WIKIdata:Q11424和Work是等效类别(或存在Subclass关系)。

我的问题:

我想一次返回

 <http://dbpedia.org/resource/1001_Inventions_and_the_World_of_Ibn_Al-Haytham>    
 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
 <http://dbpedia.org/ontology/Film> .
Run Code Online (Sandbox Code Playgroud)

并滤除其他4个三元组

怎么办?

先感谢您

sparql dbpedia

1
推荐指数
1
解决办法
1316
查看次数