我正在尝试在BigQuery上使用标准SQL方言(即不是旧版SQL)运行查询.我的查询是:
SELECT
date, hits.referer
FROM `refresh.ga_sessions_xxxxxx*`
LIMIT 1000
Run Code Online (Sandbox Code Playgroud)
但不断收到错误
Error: Cannot access field referer on a value with type
ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [2:12]
Run Code Online (Sandbox Code Playgroud)
有人知道正确的语法吗?
我有一个字典的字典:
{'user':{movie:rating} }
Run Code Online (Sandbox Code Playgroud)
例如,
{Jill': {'Avenger: Age of Ultron': 7.0,
'Django Unchained': 6.5,
'Gone Girl': 9.0,
'Kill the Messenger': 8.0}
'Toby': {'Avenger: Age of Ultron': 8.5,
'Django Unchained': 9.0,
'Zoolander': 2.0}}
Run Code Online (Sandbox Code Playgroud)
我想把这个dicts的dict转换成一个pandas数据帧,第1列是用户名,其他列是电影评级,即
user Gone_Girl Horrible_Bosses_2 Django_Unchained Zoolander etc. \
Run Code Online (Sandbox Code Playgroud)
但是,有些用户没有为电影评分,因此这些电影不包含在该用户键()的值()中.在这些情况下,用NaN填充条目会很好.
截至目前,我迭代密钥,填写列表,然后使用此列表创建数据框:
data=[]
for i,key in enumerate(movie_user_preferences.keys() ):
try:
data.append((key
,movie_user_preferences[key]['Gone Girl']
,movie_user_preferences[key]['Horrible Bosses 2']
,movie_user_preferences[key]['Django Unchained']
,movie_user_preferences[key]['Zoolander']
,movie_user_preferences[key]['Avenger: Age of Ultron']
,movie_user_preferences[key]['Kill the Messenger']))
# if no entry, skip
except:
pass
df=pd.DataFrame(data=data,columns=['user','Gone_Girl','Horrible_Bosses_2','Django_Unchained','Zoolander','Avenger_Age_of_Ultron','Kill_the_Messenger'])
Run Code Online (Sandbox Code Playgroud)
但这只给了我一个用户评估集合中所有电影的数据框.
我的目标是通过迭代电影标签(而不是上面显示的暴力方法)附加到数据列表,其次,创建一个包含所有用户的数据框,并将空值放在没有电影评级的元素中.
我正在研究Databricks 示例.数据框架的架构如下所示:
> parquetDF.printSchema
root
|-- department: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- salary: integer (nullable = true)
Run Code Online (Sandbox Code Playgroud)
在该示例中,他们将展示如何将employees列分解为4个附加列:
val explodeDF = parquetDF.explode($"employees") {
case Row(employee: Seq[Row]) …Run Code Online (Sandbox Code Playgroud) scala distributed-computing apache-spark spark-dataframe databricks
我正在尝试从 sql 表创建一个 Pandas 数据框。我使用 读入数据data=pd.read_sql(query,con=con),这工作得很好。但是,我希望设置数据框中哪种类型的元素是 NaN。读取 csv 时,可以使用pd.read_csv('file.csv',na_values=['',[]']). 使用 read_sql 是否有类似的标志可用?
我的目标是编写一个能够从人类语言查询中提取语气,个性和意图的程序(例如,我键入:您今天过得怎么样?而AI系统的响应则是:很好。您怎么样?)
我知道这是一个不平凡的问题,因此我应该开始熟悉哪些深度学习主题以及哪些Python模块最有用?我已经开始研究NLTK。谢谢。
我正在尝试在Spark数据帧的单个列中替换":" - >"_"的所有实例.我正在尝试这样做:
val url_cleaner = (s:String) => {
s.replaceAll(":","_")
}
val url_cleaner_udf = udf(url_cleaner)
val df = old_df.withColumn("newCol", url_cleaner_udf(old_df("oldCol")) )
Run Code Online (Sandbox Code Playgroud)
但我一直收到错误:
SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 692, ip-10-81-194-29.ec2.internal): java.lang.NullPointerException
Run Code Online (Sandbox Code Playgroud)
我在udf哪里出错了?
我正在尝试将存储在Google Cloud Project(Project1)中的BigQuery表(Table1)复制到另一个Google Cloud Project(Project2)。该表按TB的顺序排列。这样做的最佳方法是什么,这样我就不必在本地导出表了?我应该将表格从Project1导出到Google Cloud Storage,然后再导出到Project2吗?或者,还有更好的方法?
export google-cloud-storage google-bigquery google-cloud-platform
我在Ubuntu 16.04,我试图根据从源代码与GPU支持建立TensorFlow 此.一切正常,直到"Build TensorFlow"步骤,我执行:
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
Run Code Online (Sandbox Code Playgroud)
编译遇到输出错误:
ERROR: /home/thomas/tensorflow/tensorflow/core/BUILD:978:28: Executing genrule //tensorflow/core:proto_text_srcs_all failed: bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions: /home/thomas/anaconda2/lib/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions)
bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions: /home/thomas/anaconda2/lib/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions)
bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions: /home/thomas/anaconda2/lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by bazel-out/host/bin/tensorflow/tools/proto_text/gen_proto_text_functions)
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
Run Code Online (Sandbox Code Playgroud)
我怀疑这个错误与anaconda有关,因为bazel似乎在寻找libstdc …
我有一个字典,其中包含电影标题对作为键,而相似度得分作为值:
{('Source Code ', 'Hobo with a Shotgun '): 1.0, ('Adjustment Bureau, The ', 'Just Go with It '): 1.0, ('Limitless ', 'Arthur '): 1.0, ('Adjustment Bureau, The ', 'Kung Fu Panda 2 '): 1.0, ('Rise of the Planet of the Apes ', 'Scream 4 '): 1.0, ('Source Code ', 'Take Me Home Tonight '): 1.0, ('Midnight in Paris ', 'Take Me Home Tonight '): 1.0, ('Harry Potter and the Deathly Hallows: Part 2 ', 'Pina '): 1.0, ('Avengers, The …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用空值替换Scala列表的null元素map.我目前有:
val arr = Seq("A:B|C", "C:B|C", null)
val arr2 = arr.map(_.replaceAll(null, "") )
Run Code Online (Sandbox Code Playgroud)
这给了我一个NullPointerExpection.做这个的最好方式是什么?
我正在使用 Databricks 运行 Spark 集群。我想使用 curl 从服务器传输数据。例如,
curl -H "Content-Type: application/json" -H "auth:xxxx" -X GET "https://websites.net/Automation/Offline?startTimeInclusive=201609240100&endTimeExclusive=201609240200&dataFormat=json" -k > automation.json
Run Code Online (Sandbox Code Playgroud)
如何在 Databricks 笔记本中执行此操作(最好使用 python,但 Scala 也可以)?
我正在尝试使用布尔数组来选择另一个数组中的特定元素.例如:
val arr = Seq("A", "B", "C")
val mask = Seq(true,false,true)
Run Code Online (Sandbox Code Playgroud)
我希望输出成为一个新数组:
val arr_new = Seq("A","C")
Run Code Online (Sandbox Code Playgroud)
有没有办法在Scala中实现这一目标?
scala ×4
apache-spark ×3
databricks ×2
dictionary ×2
pandas ×2
python ×2
anaconda ×1
arrays ×1
bazel ×1
bitmask ×1
curl ×1
dataframe ×1
export ×1
heatmap ×1
installation ×1
list ×1
matplotlib ×1
nested ×1
nlp ×1
null ×1
sql ×1
tensorflow ×1
tuples ×1
ubuntu ×1