小编Pet*_*rov的帖子

Pandas:使用数据类型过滤数据框

我有数据框。这是一部分

\n\n
        member_id event_duration             domain           category\n0          299819             17  element.yandex.ru               None\n1          299819              0        mozilla.org          \xd0\x9f\xd1\x80\xd0\xbe\xd0\xb3\xd1\x80\xd0\xb0\xd0\xbc\xd0\xbc\xd1\x8b\n2          299819              4          vbmail.ru               None\n3          299819              aaa          vbmail.ru               None\n
Run Code Online (Sandbox Code Playgroud)\n\n

如何用 type 过滤 df ?\n通常我用 来过滤str.contains,也许指定类似 \n 的值是正常的df[df.event_duration.astype(int) == True]

\n

python pandas

13
推荐指数
2
解决办法
3万
查看次数

TypeError:第一个参数必须是pandas对象的可迭代,你传递了一个"DataFrame"类型的对象

我有一个很大的数据帧,我试图拆分它,之后concat.我用

df2 = pd.read_csv('et_users.csv', header=None, names=names2, chunksize=100000)
for chunk in df2:
    chunk['ID'] = chunk.ID.map(rep.set_index('member_id')['panel_mm_id'])

df2 = pd.concat(chunk, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)

但它返回一个错误

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题?

python pandas

12
推荐指数
4
解决办法
4万
查看次数

Python:UserWarning:此模式具有匹配组.要实际获取组,请使用str.extract

我有一个数据帧,我尝试获取字符串,其中列包含一些字符串Df看起来像

member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0
Run Code Online (Sandbox Code Playgroud)

和另一个df与网址

url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola
Run Code Online (Sandbox Code Playgroud)

我用

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
    res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]
Run Code Online (Sandbox Code Playgroud)

但它回报了我

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题?

python regex pandas

12
推荐指数
5
解决办法
1万
查看次数

熊猫:将数据框附加到另一个df

我有附加数据帧的问题.我尝试执行此代码

df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
    for i in substr:
        res = df[df['url'].str.contains(i)]
        df_res.append(res)
Run Code Online (Sandbox Code Playgroud)

当我尝试保存时,df_res我得到空的数据帧. df_all好像

ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
Run Code Online (Sandbox Code Playgroud)

urls看起来像

url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
Run Code Online (Sandbox Code Playgroud)

当我res在循环中打印时,它不会为空.但是当我df_res在追加后尝试在循环中打印时,它返回空数据帧.我找不到我的错误.我该如何解决?

python pandas

10
推荐指数
3
解决办法
2万
查看次数

熊猫:到日期时间的时间戳

我有数据框和列,日期看起来像

 date
 1476329529    
 1476329530    
 1476329803 
 1476329805 
 1476329805  
 1476329805 
Run Code Online (Sandbox Code Playgroud)

df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d %H:%M:%S') 用来转换它,但我得到奇怪的结果

 date 
 1970-01-01 00:00:01.476329529   
 1970-01-01 00:00:01.476329530   
 1970-01-01 00:00:01.476329803  
 1970-01-01 00:00:01.476329805    
 1970-01-01 00:00:01.476329805   
 1970-01-01 00:00:01.476329805   
Run Code Online (Sandbox Code Playgroud)

也许我做错了什么

python datetime pandas

10
推荐指数
1
解决办法
6032
查看次数

熊猫:联合重复的字符串

我有数据帧

ID     url     date   active_seconds
111    vk.com   12.01.2016   5
111    facebook.com   12.01.2016   4
111    facebook.com   12.01.2016   3
111    twitter.com    12.01.2016    12
222    vk.com      12.01.2016   8
222    twitter.com    12.01.2016   34
111    facebook.com   12.01.2016   5
Run Code Online (Sandbox Code Playgroud)

我需要得到

ID     url     date   active_seconds
111    vk.com   12.01.2016   5
111    facebook.com   12.01.2016   7
111    twitter.com    12.01.2016    12
222    vk.com      12.01.2016   8
222    twitter.com    12.01.2016   34
111    facebook.com   12.01.2016   5
Run Code Online (Sandbox Code Playgroud)

如果我试试

df.groupby(['ID', 'url'])['active_seconds'].sum()
Run Code Online (Sandbox Code Playgroud)

它结合所有的字符串.我该怎么做才能获得理想?

python pandas

5
推荐指数
1
解决办法
139
查看次数

Python:使用地图和多处理

我正在尝试编写一个可以接受两个参数的函数,然后将其添加到multiprocessing.Pool并并行化.当我尝试编写这个简单的函数时,我遇到了一些复杂问题.

df = pd.DataFrame()
df['ind'] = [111, 222, 333, 444, 555, 666, 777, 888]
df['ind1'] = [111, 444, 222, 555, 777, 333, 666, 777]

def mult(elem1, elem2):
    return elem1 * elem2

if __name__ == '__main__':
    pool = Pool(processes=4) 
    print(pool.map(mult, df.ind.astype(int).values.tolist(), df.ind1.astype(int).values.tolist()))
    pool.terminate()
Run Code Online (Sandbox Code Playgroud)

它返回一个错误:

TypeError: unsupported operand type(s) for //: 'int' and 'list'
Run Code Online (Sandbox Code Playgroud)

我无法理解什么是错的.任何人都可以解释这个错误的含义以及我如何解决它?

python multiprocessing

5
推荐指数
1
解决办法
751
查看次数

Python:使用eval执行字符串中的代码

我有一个语法树

Tree(if, [Tree(condition, [Token(VARIABLE, 'age'), Token(ACTION_OPERATOR, '>'), Token(SIGNED_NUMBER, '18')]), Tree(result, [Tree(if, [Tree(condition, [Token(VARIABLE, 'salary'), Token(ACTION_OPERATOR, '>'), Token(SIGNED_NUMBER, '100000')]), Tree(result, [Token(STRING, 'success')]), Tree(condition, [Token(VARIABLE, 'salary'), Token(ACTION_OPERATOR, '<'), Token(SIGNED_NUMBER, '50000')]), Tree(result, [Token(STRING, 'fail')]), Tree(else, [Token(STRING, 'get_more_info')])])]), Tree(else, [Token(STRING, 'fail')])])
Run Code Online (Sandbox Code Playgroud)

我将其转换为字符串:

if age > 18:
    if salary > 100000:
        print('success')
    elif salary < 50000:
        print('fail')
    else: 
        print('get_more_info')
else:
    print('fail')
Run Code Online (Sandbox Code Playgroud)

我声明变量:

age = 20
salary = 60000
Run Code Online (Sandbox Code Playgroud)

并尝试执行此代码

eval(code)
Run Code Online (Sandbox Code Playgroud)

并得到一个错误

File "<string>", line 1
if age > 18: 
 ^
SyntaxError: invalid syntax
Run Code Online (Sandbox Code Playgroud)

python

5
推荐指数
1
解决办法
2239
查看次数

Pandas:有条件的 groupby

我有数据框:

ID,used_at,active_seconds,subdomain,visiting,category
123,2016-02-05 19:39:21,2,yandex.ru,2,Computers
123,2016-02-05 19:43:01,1,mail.yandex.ru,2,Computers
123,2016-02-05 19:43:13,6,mail.yandex.ru,2,Computers
234,2016-02-05 19:46:09,16,avito.ru,2,Automobiles
234,2016-02-05 19:48:36,21,avito.ru,2,Automobiles
345,2016-02-05 19:48:59,58,avito.ru,2,Automobiles
345,2016-02-05 19:51:21,4,avito.ru,2,Automobiles
345,2016-02-05 19:58:55,4,disk.yandex.ru,2,Computers
345,2016-02-05 19:59:21,2,mail.ru,2,Computers
456,2016-02-05 19:59:27,2,mail.ru,2,Computers
456,2016-02-05 20:02:15,18,avito.ru,2,Automobiles
456,2016-02-05 20:04:55,8,avito.ru,2,Automobiles
456,2016-02-05 20:07:21,24,avito.ru,2,Automobiles
567,2016-02-05 20:09:03,58,avito.ru,2,Automobiles
567,2016-02-05 20:10:01,26,avito.ru,2,Automobiles
567,2016-02-05 20:11:51,30,disk.yandex.ru,2,Computers
Run Code Online (Sandbox Code Playgroud)

我需要去做

group = df.groupby(['category']).agg({'active_seconds': sum}).rename(columns={'active_seconds': 'count_sec_target'}).reset_index()
Run Code Online (Sandbox Code Playgroud)

但我想在那里添加条件

df.groupby(['category'])['ID'].count()
Run Code Online (Sandbox Code Playgroud)

如果计数category小于5,我想放弃这个类别。我不知道,我怎么能在那里写这个条件。

python group-by filter conditional-statements pandas

4
推荐指数
1
解决办法
3万
查看次数

Python中零之间的列表元素的元素

我有一个清单:

lst = [1, 2, 3, 5, 0, 0, 9, 45, 3, 0, 1, 7]
Run Code Online (Sandbox Code Playgroud)

我需要0新列表中s 之间的元素总和.我试过了

lst1 = []
summ = 0
for i, elem in enumerate(lst):
    if elem != 0:
        summ = summ + elem
    else:
        lst1.append(summ)
        lst1.append(elem)
        summ = 0
Run Code Online (Sandbox Code Playgroud)

但它回归[11, 0, 0, 0, 57, 0],而我期待 [11, 0, 0, 57, 0, 8]

python list

4
推荐指数
1
解决办法
485
查看次数