我在Zeppelin 0.7笔记本中使用Spark 2和Scala 2.11.我有一个数据帧,我可以像这样打印:
dfLemma.select("text", "lemma").show(20,false)
Run Code Online (Sandbox Code Playgroud)
输出看起来像:
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text |lemma |
+---------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|RT @Dope_Promo: When you and your crew beat your high scores on FUGLY FROG https://time.com/Sxp3Onz1w8 |[rt, @dope_promo, :, when, you, and, you, crew, beat, you, high, score, on, FUGLY, FROG, https://time.com/sxp3onz1w8] |
|RT @axolROSE: Did yall just call Kermit the frog a lizard? https://time.com/wDAEAEr1Ay |[rt, @axolrose, :, do, yall, just, call, Kermit, the, frog, a, lizard, ?, https://time.com/wdaeaer1ay] |
Run Code Online (Sandbox Code Playgroud)
我试图通过以下方式在Zeppelin中使输出更好:
val printcols= dfLemma.select("text", "lemma")
println("%table " + printcols) …
Run Code Online (Sandbox Code Playgroud) 我试图模仿某些 GraphQL 的功能,但我无权运行原始版本。它的形式如下:
query {
dataSources(dataType: Ais) {
... on AisDataSource {
messages(filter: {broadcastType: Static}) {
... on AisStaticBroadcast {
field1
field2
Run Code Online (Sandbox Code Playgroud)
(我省略了右括号)。
我的理解是... on
要么包含一个片段(这里没有),要么在替代方案之间进行选择(但这些是嵌套的)。那么这个查询是错误的,还是还有更多的问题... on
?
我使用的是HDP-2.6.0.3,但我需要Zeppelin 0.8,所以我已将其作为独立服务安装.当我跑:
%sql
show tables
Run Code Online (Sandbox Code Playgroud)
我什么都没回来,当我运行Spark2 SQL命令时,我得到'table not found'.表格可以在0.7 Zeppelin中看到,它是HDP的一部分.
任何人都可以告诉我我失踪了什么,让Zeppelin/Spark看到Hive?
我为创建zep0.8而执行的步骤如下:
maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Run Code Online (Sandbox Code Playgroud)
将/usr/hdp/2.6.0.3-8/zeppelin/conf中的zeppelin-site.xml和shiro.ini复制到/ home/ed/zeppelin/conf.
创建了/home/ed/zeppelin/conf/zeppeli-env.sh,其中我提出了以下内容:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8"
Run Code Online (Sandbox Code Playgroud)
将/etc/hive/conf/hive-site.xml复制到/ home/ed/zeppelin/conf
编辑:我也尝试过:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://s2.royble.co.uk:9083") // replace with your hivemetastore service's thrift url
.config("url", "jdbc:hive2://s2.royble.co.uk:10000/default")
.config("UID", "admin")
.config("PWD", "admin")
.config("driver", "org.apache.hive.jdbc.HiveDriver")
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate() …
Run Code Online (Sandbox Code Playgroud) 我正在使用一个将BinaryFiles(jpegs)转换为Hadoop序列文件(HSF)的映射器:
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
Run Code Online (Sandbox Code Playgroud)
然后我有一个读取HSF的第二个映射器,因此:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context …
Run Code Online (Sandbox Code Playgroud) 我需要比较大量类似于50358c591cef4d76的字符串.我有一个汉明距离函数(使用pHash)我可以使用.我该如何有效地做到这一点?我的伪代码是:
For each string
currentstring= string
For each string other than currentstring
Calculate Hamming distance
Run Code Online (Sandbox Code Playgroud)
我想将结果输出为矩阵并能够检索值.我也想通过Hadoop Streaming运行它!
感激地收到任何指针.
这是我尝试过的但是很慢:
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
print 'fname',fname, 'setOfFiles', len(setOfFiles)
oneLessSetOfFiles=setOfFiles
oneLessSetOfFiles.remove(fname)
i+=1
for compareFile in oneLessSetOfFiles:
j+=1
hash1 = pHash.imagehash( fname )
hash2 = pHash.imagehash( compareFile)
print ...
Run Code Online (Sandbox Code Playgroud) 我正在使用Spark的MultilayerPerceptronClassifier.这会在"预测"中生成"预测"列.当我尝试显示它时,我收到错误:
SparkException: Failed to execute user defined function($anonfun$1: (vector) => double) ...
Caused by: java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!
Run Code Online (Sandbox Code Playgroud)
其他列,例如,矢量显示OK.预测模式的一部分:
|-- vector: vector (nullable = true)
|-- prediction: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我的代码是:
//racist is boolean, needs to be string:
val train2 = train.withColumn("racist", 'racist.cast("String"))
val test2 = test.withColumn("racist", 'racist.cast("String"))
val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val word2Vec = new Word2Vec().setInputCol("lemma").setOutputCol("vector") //.setVectorSize(3).setMinCount(0)
val layers = Array[Int](4,5, 2)
val mpc = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100).setFeaturesCol("vector").setLabelCol("indexracist")
val pipeline = new Pipeline().setStages(Array(indexer, word2Vec, …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用以下方法构建:
FROM mcr.microsoft.com/dotnet/core/sdk:2.1 AS builder
COPY pythonnet/src/ pythonnet/src
WORKDIR /pythonnet/src/runtime
RUN dotnet build -f netstandard2.0 -p:DefineConstants=\"MONO_LINUX\;XPLAT\;PYTHON3\;PYTHON37\;UCS4\;NETSTANDARD\" Python.Runtime.15.csproj
# copy myApp csproj and restore
COPY src/myApp/*.csproj /src/myApp/
WORKDIR /src/myApp
RUN dotnet restore
# now copy everything else as separate docker step
# (copy to staging folder, remove csproj, and copy down - so we don't overwrite project above)
WORKDIR /
COPY src/myApp/ ./staging/src/myApp
RUN rm ./staging/src/myApp/*.csproj \
&& cp -r ./staging/* ./ \
&& rm -rf ./staging
Run Code Online (Sandbox Code Playgroud)
这工作正常,在 Windows 10 中仍然如此,但在 …
我遇到了与此类似的问题。我已经更新了我的 GraphQLHttpClient,现在我需要提供一个额外的参数,给出的解决方案是:
GraphQLHttpClient gql = new GraphQLHttpClient(o => {
o.EndPoint = _config["API:Endpoint"];
o.JsonSerializer = new NewtonsoftJsonSerializer();
});
Run Code Online (Sandbox Code Playgroud)
但这告诉我:
Error CS1729 'GraphQLHttpClient' does not contain a constructor that takes 1 arguments
我也尝试过:
using Newtonsoft.Json
GraphQLHttpClient gql = new GraphQLHttpClient(_options.Url, new Newtonsoft.Json.JsonSerializer());
Run Code Online (Sandbox Code Playgroud)
这使Error CS1503 Argument 2: cannot convert from 'Newtonsoft.Json.JsonSerializer' to 'GraphQL.Client.Abstractions.Websocket.IGraphQLWebsocketJsonSerializer'
我对 C# 知之甚少,所以我将不胜感激任何指点。
我有一个 Azure 存储帐户。当我允许所有网络访问时,我的 Github 操作可以运行和更新我的 Azure 静态网站。
当我禁止除命名网络(147.243.0.0/16 和我的机器的 IP)之外的所有网络时,我在 Github Actions 中收到 403(请求被拒绝)错误。
我假设我需要将 GitHub 添加到这些 IP,但是当我运行时:
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/meta
Run Code Online (Sandbox Code Playgroud)
有大量的IP!我需要将它们全部添加吗?
我在Linux CLI上没什么用,我试图运行以下命令来随机排序,然后使用输出文件前缀'out'分割一个文件(一个输出文件将有50行,其余的则是其他行):
sort -R somefile | split -l 50 out
Run Code Online (Sandbox Code Playgroud)
我得到错误
split: cannot open ‘out’ for reading: No such file or directory
Run Code Online (Sandbox Code Playgroud)
大概是因为split的第三个参数应该是它的输入文件。如何传递排序结果以进行拆分?TIA!