小编jas*_*san的帖子

TypeError:类型为'float'的参数不可迭代

我是python和TensorFlow的新手.我最近开始理解并执行TensorFlow示例,并遇到了这个示例:https://www.tensorflow.org/versions/r0.10/tutorials/wide_and_deep/index.html

我得到了错误,TypeError:类型'float'的参数不可迭代,我相信问题在于以下代码行:

df_train [LABEL_COLUMN] =(df_train ['income_bracket'].apply(lambda x:'> 50K'in x)).astype(int)

(income_bracket是人口普查数据集的标签列,其中'> 50K'是可能的标签值之一,另一个标签是'= <50K'.数据集被读入df_train.文档中提供了解释.上面做的原因是,"由于任务是二元分类问题,我们将构造一个名为"label"的标签列,如果收入超过50K,其值为1,否则为0.")

如果有人能够解释我究竟发生了什么,我该如何解决它,这将是伟大的.我尝试使用Python2.7和Python3.4,我认为问题不在于语言的版本.此外,如果有人知道TensorFlow和Pandas新手的精彩教程,请分享链接.

完整计划:

import pandas as pd
import urllib
import tempfile
import tensorflow as tf

gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["female", "male"])
race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=["Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status", hash_bucket_size=100)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)


age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, …

Run Code Online (Sandbox Code Playgroud)

python pandas tensorflow

jas*_*san

2016 08-31

6
推荐指数

2
解决办法

2万
查看次数

我的程序流程是这样的：

1. 将 40 亿行 (~700GB) 的数据从镶木地板文件读取到数据框中。使用的分区大小为 2296

2. 清理它并过滤掉 25 亿行

3. 使用管道模型和训练模型转换剩余的 15 亿行。该模型使用逻辑回归模型进行训练，其中预测 0 或 1，并且 30% 的数据从转换后的数据框中过滤掉。

4. 上述数据框与另一个约 1 TB 的数据集（也从镶木地板文件中读取）进行左外连接。分区大小为 4000

5. 将其与另一个大约 100 MB 的数据集连接，如

joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")

6. 然后分解上述数据框到 ~2000 的因子

exploded_data = joined_data.withColumn('field', explode('field_list'))

7. 执行聚合

aggregate = exploded_data.groupBy(*cols_to_select)\ .agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all'))

cols_to_select列表中共有 10 列。

8. 最后aggregate.count()执行一个动作。

问题是，倒数第三个计数阶段（200 个任务）永远卡在任务 199 处。尽管分配了 4 个内核和 56 个执行程序，但计数仅使用一个内核和一个执行程序来运行作业。我尝试将大小从 40 亿行分解为 7 亿行，这是 1/6 的一部分，花了四个小时。我真的很感激在如何加快这个过程方面的一些帮助谢谢

hadoop-yarn apache-spark pyspark spark-dataframe

jas*_*san

2017 12-15

4
推荐指数

1
解决办法

2106
查看次数

Presto Cassandra连接器：连接数

我正在考虑增加Presto与ScyllaDB的连接数。我正在使用Presto的Cassandra连接器连接到ScyllaDB。我在文档中看不到任何可用于增加连接数量的属性。https://prestodb.io/docs/current/connector/cassandra.html

这是我的scylladb.properties文件

connector.name=cassandra
cassandra.contact-points=scylla1,scylla2,scylla3,scylla4
cassandra.client.read-timeout=3600000ms
cassandra.split-size=1024
cassandra.fetch-size=5000
cassandra.load-policy.token-aware.shuffle-replicas=true
cassandra.load-policy.use-token-aware=true

Run Code Online (Sandbox Code Playgroud)

Presto与cassandra / scylladb的默认连接数是多少，如何设置此属性？谢谢

presto scylla

jas*_*san

2018 11-21

2
推荐指数

1
解决办法

147
查看次数