Reasonable way to have different versions of None?

Question

Reasonable way to have different versions of None?

Clu*_*cat 6 python numpy python-3.x nonetype

Working in Python3.

Say you have a million beetles, and your task is to catalogue the size of their spots. So you will make a table, where each row is a beetle and the number in the row represent the size of spots;

 [[.3, 1.2, 0.5],
  [.6, .7],
  [1.4, .9, .5, .7],
  [.2, .3, .1, .7, .1]]

Run Code Online (Sandbox Code Playgroud)

Also, you decide to store this in a numpy array, for which you pad the lists with None (numpy will convert this to np.nan).

 [[.3, 1.2, 0.5, None, None],
  [.6, .7, None, None, None],
  [1.4, .9, .5, .7, None],
  [.2, .3, .1, .7, .1]]

Run Code Online (Sandbox Code Playgroud)

But there is a problem, values represented as None can be None for one of 3 reasons;

The beetle dosn't have many spots; that quantity does not exist.
The beetle won't stay still and you can't measure the spot.
You haven't got round to measuring that beetle yet, so the value is unassigned.

My problem doesn't actually involve beetles, but the principles are the same. I want 3 different None values so I can keep these missing value causes distinct. My current solution is to use a value so large that it is physically improbable, but this is not a very safe solution.

Assume you cannot use negative numbers - in reality the quantity I am measuring could be negative.

The data is big and read speed is important.

Edit; comments rightly point out that saying speed is important without saying what operations is a bit meaningless. Principle component analysis is probably going to be used for variable decorrilation, Euclidean distance squared calculations for a clustering algorithm (but the data is sparse in that variable) possibly some interpolation. Eventually a recursive neural network, but that will come from a library so I will just have to but the data into an input form. So maybe nothing worse than linear algebra, it should all fit in the RAM if I am careful I think.

What is a good strategy?

Answer 1

Jos*_*der 5

最简单的方法是使用字符串：“不计算”，“未知”和“不适用”。但是，如果要以numpy快速处理，则混合数字/对象的数组不是您的朋友。

我的建议是添加与数据相同形状的几个数组，包括0和1。因此array missing= 1，点缺失，否则为0，依此类推，与array相同，依此类推not_measured。

然后，您可以在任何地方使用NaN，然后使用掩盖数据的方式np.where(missing == 1)来轻松找到所需的特定NaN。

Answer 2

Oli*_*çon 1

建议object为每个案例创建三个不同的实例。

由于您希望这些对象具有的属性NaN，因此您可以尝试创建三个不同的NaN实例。

NOT_APPLICABLE = float("nan")
NOT_MEASURED = float("nan")
UNKNOWN = float("nan")

Run Code Online (Sandbox Code Playgroud)

这是 hack 的极限，所以使用时需要您自担风险，但我不相信任何 Python 实现都会优化NaN以始终重用相同的对象。尽管如此，您仍然可以添加一个哨兵条件来在运行之前进行检查。

if NOT_APPLICABLE is NOT_MEASURED or NOT_MEASURED is UNKNOWN or UNKNOWN is NOT_APPLICABLE :
    raise ValueError # or try something else

Run Code Online (Sandbox Code Playgroud)

如果这有效的话，它的优点是允许您比较NaNid 以检查其含义。

row = [1.0, 2.4, UNKNOWN]

...

if value is UNKNOWN:
    ...

Run Code Online (Sandbox Code Playgroud)

同时，它保留了numpy对其数组可能进行的任何优化。

披露：这是一个hacky建议，我渴望听到其他人对此的看法。

归档时间：	6 年，8 月前
查看次数：	144 次
最近记录：	6 年，8 月前