小编Kub*_*888的帖子

什么是糟糕,体面,优秀和出色的F1测量范围？

我知道F1-measure是精确度和召回率的调和平均值.但是什么价值观定义了F1衡量标准的优劣？我似乎找不到任何引用(谷歌或学术)回答我的问题.

precision performance measurement machine-learning precision-recall

Kub*_*888

lucky-day

15
推荐指数

2
解决办法

2万
查看次数

如何解释几乎完美的准确性和AUC-ROC但零f1分数,精确度和召回率

我正在训练ML逻辑分类器使用python scikit-learn对两个类进行分类.他们的数据非常不平衡(约14300:1).我的准确度和ROC-AUC几乎达到100%,但精度,召回率和f1得分均为0%.我知道准确性通常对非常不平衡的数据没有用,但为什么ROC-AUC测量结果也接近完美？

from sklearn.metrics import roc_curve, auc

# Get ROC 
y_score = classifierUsed2.decision_function(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC-'+'=',roc_auc

1= class1
0= class2
Class count:
0    199979
1        21

Accuracy: 0.99992
Classification report:
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     99993
          1       0.00      0.00      0.00         7

avg / total       1.00      1.00      1.00    100000

Confusion matrix:
[[99992     1]
 [    7     0]]
AUC= 0.977116255281

Run Code Online (Sandbox Code Playgroud)

以上是使用逻辑回归,下面是使用决策树,决策矩阵看起来几乎相同,但AUC有很大不同.

1= class1
0= class2
Class count:
0    199979
1        21 …

Run Code Online (Sandbox Code Playgroud)

classification machine-learning roc precision-recall

Kub*_*888

2016 03-12

12
推荐指数

1
解决办法

9634
查看次数

良好的ROC曲线但精确回忆曲线较差

我有一些我不太了解的机器学习结果.我正在使用python sciki-learn,拥有大约14个功能的200多万个数据.对于精确回忆曲线,'ab'的分类看起来非常差,但是Ab的ROC看起来和大多数其他群体的分类一样好.有什么可以解释的？

machine-learning performance-testing roc scikit-learn precision-recall

Kub*_*888

2016 03-12

9
推荐指数

1
解决办法

2893
查看次数

Python/BeautifulSoup抓取中的多线程技术根本没有加速

我有一个csv文件("SomeSiteValidURLs.csv"),它列出了我需要抓取的所有链接.代码正在运行,将通过csv中的url,抓取信息并记录/保存在另一个csv文件("Output.csv")中.但是,由于我计划在网站的大部分区域(大于10,000,000页)进行此操作,因此速度非常重要.对于每个链接,爬行并将信息保存到csv大约需要1秒,这对于项目的大小来说太慢了.所以我已经整合了多线程模块,令我惊讶的是它根本没有加速,它仍然需要1个人链接.我做错什么了吗？还有其他方法可以加快处理速度吗？

没有多线程:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") as f:
        for URLrecords in f:

            OpenSomeSiteURL = urllib2.urlopen(URLrecords)
            Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
            OpenSomeSiteURL.close()

            tbodyTags = Soup_SomeSite.find("tbody")
            trTags = tbodyTags.find_all("tr", class_="result-item ")

            placeHolder = []

            for trTag in trTags:
                tdTags = trTag.find("td", class_="result-value")
                tdTags_string = tdTags.string
                placeHolder.append(tdTags_string)

            with open("Output.csv", "ab") as f:
                writeFile = csv.writer(f)
                writeFile.writerow(placeHolder)

crawltoCSV("SomeSiteValidURLs.csv")

Run Code Online (Sandbox Code Playgroud)

使用多线程:

import urllib2
import csv
from bs4 import BeautifulSoup
import threading

def crawlToCSV(FileName):

    with open(FileName, "rb") …

Run Code Online (Sandbox Code Playgroud)

parallel-processing multithreading beautifulsoup web-scraping python-2.7

Kub*_*888

lucky-day

8
推荐指数

1
解决办法

1万
查看次数

在64位Windows 7和64位Python 2.7上安装Pygame

标题说明了一切.我确实看到类似的问题,有人建议http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygame,但所有pygame可下载文件都是.whl格式,我不知道如何在Windows上运行7.我试过"cd [directory]> pip install [filename]"但没有成功.

pygame installation-package python-2.7

Kub*_*888

lucky-day

7
推荐指数

1
解决办法

1万
查看次数

使用 Scikit-learn 进行加权线性回归

我的数据：

State           N           Var1            Var2
Alabama         23          54              42
Alaska          4           53              53
Arizona         53          75              65

Run Code Online (Sandbox Code Playgroud)

Var1并且Var2是州级的汇总百分比值。N是每个状态的参与者数量。我想之间运行的线性回归Var1和Var2与所述考虑的N作为重量与在Python 2.7 sklearn。

一般线路是：

fit(X, y[, sample_weight])

Run Code Online (Sandbox Code Playgroud)

假设数据被加载到df使用 Pandas 并且N变成了df["N"]，我是简单地将数据放入下一行还是我需要在使用它之前以某种方式处理 N 就像sample_weight在命令中一样？

fit(df["Var1"], df["Var2"], sample_weight=df["N"])

Run Code Online (Sandbox Code Playgroud)

python regression machine-learning python-2.7 scikit-learn

Kub*_*888

2020 05-01

7
推荐指数

1
解决办法

2万
查看次数

如何从本地目录而不是 pip 安装的库导入模块？

我已经将 python 库克隆xyz到我的计算机中。文件结构如下：

>> project (folder)
    * main.py
    >> xyz_git (folder)
          >> xyz (folder)

Run Code Online (Sandbox Code Playgroud)

在里面main.py

import sys
sys.path.insert(0, './xyz_git')
from xyz import Xyz

instance = Xyz()
print(instance.some_function())

Run Code Online (Sandbox Code Playgroud)

问题是我也有 pip 安装的xyz默认 python 模块。即使我删除本地文件夹，from xyz import Xyz由于默认的 pip 安装，它仍然可以工作。如何确保xyz导入不是从本地目录导入时会报错？

python module python-import python-3.x

Kub*_*888

2019 04-04

7
推荐指数

1
解决办法

3369
查看次数

如何在python中修改/覆盖继承的类函数？

我say_hello在父类和继承类中都有确切的函数名称。我想name在 Kitten 类中指定参数，但允许用户在 Cat 类中指定参数。

有没有办法避免需要在 Kitten 类return ('Hello '+name)中的say_hello函数中重复该行？

目前：

class Cat:
    def __init__(self):
        pass

    def say_hello(name):
        return ('Hello '+name)

class Kitten(Cat):
    def __init__(self):
        super().__init__()

    def say_hello(name='Thomas'):
        return ('Hello '+name)

x = Cat
print (x.say_hello("Sally"))
y = Kitten
print (y.say_hello())

Run Code Online (Sandbox Code Playgroud)

理想情况下：

class Cat:
    def __init__(self):
        pass

    def say_hello(name):
        return ('Hello '+name)

class Kitten(Cat):
    def __init__(self):
        super().__init__()

    def say_hello():
        return super().say_hello(name='Thomas') # Something like this, so this portion of the code doesn't …

Run Code Online (Sandbox Code Playgroud)

inheritance overriding class python-3.x

Kub*_*888

2019 01-26

6
推荐指数

1
解决办法

5707
查看次数

如何使用tweepy.Cursor和api.search从Tweepy中提取Hashtags？

Tweepy通过应用tweepy.Cursor和api.search方法(如下所示)提取了我需要的所有其他信息(除了主题标签).我从文档中知道Hashtags属于这种结构Status> entities> hashtags.我试图找到(下面)方法中的"hashtags"目录,但无济于事:

print "tweet", dir(tweet)
print "////////////////"
print "tweet._api", dir(tweet._api)
print "////////////////"
print "tweet.text", dir(tweet.text)
print "////////////////"
print "tweet.entities", dir(tweet.entities)
print "////////////////"
print "tweet.author", dir(tweet.author)
print "////////////////"
print "tweet.user", dir(tweet.user)

Run Code Online (Sandbox Code Playgroud)

我的代码在这里:

import tweepy

ckey = ""
csecret = ""
atoken = ""
asecret = ""

OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
    'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)

for tweet in tweepy.Cursor(api.search, q=('"good book"'), since='2014-09-16', until='2014-09-17').items(5):

    print "Name:", tweet.author.name.encode('utf8')
    print "Screen-name:", tweet.author.screen_name.encode('utf8')
    print "Tweet created:", tweet.created_at
    print "Tweet:", tweet.text.encode('utf8') …

Run Code Online (Sandbox Code Playgroud)

twitter tweepy python-2.7 hashtag

Kub*_*888

lucky-day

5
推荐指数

1
解决办法

9782
查看次数