为什么对大整数的操作会默默地溢出？

Question

为什么对大整数的操作会默默地溢出？

我有一个包含非常大的整数的列表，我想将其转换为具有特定数据类型的 pandas 列。举个例子，如果列表包含2**31，超出了 int32 dtype 的限制，则将其转换为 dtype int32 会引发溢出错误，这让我知道要使用另一个 dtype 或提前以其他方式处理该数字。

\n

import pandas as pd\npd.Series([2**31], dtype=\'int32\')\n\n# OverflowError: Python int too large to convert to C long\n

Run Code Online (Sandbox Code Playgroud)\n

但是，如果一个数字很大，但在 dtype 限制内（即2**31-1），并且向其中添加了一些数字，导致值超出了 dtype 限制，则执行该操作时不会出现任何错误，而不是出现 OverflowError，但值现在已反转，成为该列的完全错误的数字。

\n

pd.Series([2**31-1], dtype=\'int32\') + 1\n\n0   -2147483648\ndtype: int32\n

Run Code Online (Sandbox Code Playgroud)\n

为什么会发生这种情况？为什么 \xe2\x80\x99 不会像第一种情况那样引发错误？

\n

附言。我在 Python 3.12.0 上使用 pandas 2.1.1 和 numpy 1.26.0。

\n

Answer 1

Tim*_*ess 8

为什么对大整数的操作会默默地溢出？

简而言之，这是因为numpy处理溢出的方式。

在我的平台上（具有与您相同版本的 Python/Packages）：

from platform import *
import numpy as np; import pandas as pd

system(), version(), machine()
python_version(), pd.__version__, np.__version__

('Linux',
 '#34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 13:12:03 UTC 2',
 'x86_64')
('3.12.0', '2.1.1', '1.26.0')

Run Code Online (Sandbox Code Playgroud)

我可以重现您的问题，但使用比您选择的示例更大的整数：

pd.Series([2**63], dtype="int32")提出这个：

OverflowError: Python int 太大，无法转换为 C long

虽然pd.Series([2**31], dtype="int32")提出了这个：

ValueError：值太大，无法无损转换为int32。无论如何要投射，请使用pd.Series(values).astype(int32)

细节

我们同意您使用两种不同类型的对象，这可能意味着两种不同的情况，即 1) 引发错误或 2) 未引发错误：

pd.Series：系列构造函数
pd.Series.add: 前一种方法

那个工程：`pd.Series([2**31], dtype="int32")`

它是在后台处理的，在这种情况下sanitize_array，它接收您的输入（列表 [2**31]，即）并调用。后者将使用以下方法进行经典的 NumPy 构造：[2147483648]maybe_cast_to_integer_arraynp.array

casted = np.array([2147483648], dtype="int32")

Run Code Online (Sandbox Code Playgroud)

DeprecationWarning：NumPy 将停止允许将越界 Python 整数转换为整数数组。以后2147483648to的转换会失败。int32对于旧的行为，通常： np.array(value).astype(dtype) 将给出所需的结果（强制转换溢出）。 np.array([2147483648], dtype='int32')

您可能会问自己，为什么在构建您的系列时上面的警告没有显示出来，那是因为 pandas将其静音。现在，在转换之后，pandas 会np.asarray在不指定 dtype 的情况下进行调用，让 NumPy 推断出 dtype（位于 int64 此处）arr = np.asarray(arr)：。由于，casted.dtype < arr.dtype，ValueError被触发。

补充：`pd.Series([2**31-1], dtype="int32") + 1`

该操作被委托给_na_arithmetic_op接收array([2147483647], dtype=int32)和，1并尝试在的帮助下将它们加在一起，_evaluate_standard以进行operator.add相当于和的经典操作，np.array([2147483647]) + 1因为当一个值需要的内存多于可用内存时，NumPy 数字类型的固定大小可能会导致溢出错误。数据类型，结果是array([-2147483648], dtype=int32)传递给sanitize_array构造回 Series 的：

pd.Series([2**31-1], dtype="int32") + 1

0   -2147483648
dtype: int32

Run Code Online (Sandbox Code Playgroud)

注意：当超出的限制时int32，NumPy 会回绕到最小值：

a = np.array([2**31-1], dtype="int32"); b = 1
a+b # this gives array([-2147483648], dtype=int32)

Run Code Online (Sandbox Code Playgroud)

这是其他一些例子：

def wrap_int32(i, N, l=2**31):
    return ((i+N) % l) - l

wrap_int32(2**31, 0) # -2147483648
wrap_int32(2**31, 1) # -2147483647
wrap_int32(2**31, 2) # -2147483646
wrap_int32(2**31, 3) # -2147483645
# ...

Run Code Online (Sandbox Code Playgroud)

我有一个包含非常大的整数的列表，我想将其转换为具有特定数据类型的 pandas 列。举个例子，如果列表包含2**31，超出了 dtype 的限制int32，则将其转换为 dtypeint32会抛出一个OverflowError，这让我知道要使用另一个 dtype 或预先以其他方式处理该数字。

也许您应该考虑提出一个问题，以便 pandas 进行的算术运算在溢出时引发错误。作为您的用例的解决方法（或者可能是解决方案？），您可以尝试捕获上游不属于范围内的整数int32：

iint32 = np.iinfo(np.int32)

lst = [100, 1234567890000, -1e19, 2**31, 2**31-1, -350]

out = [i for i in lst if iint32.min <= i and i <= iint32.max]
# [100, 2147483647, -350]

Run Code Online (Sandbox Code Playgroud)

归档时间：	2 年前
查看次数：	596 次
最近记录：	2 年前

为什么对大整数的操作会默默地溢出？

细节

那个工程 ：pd.Series([2**31], dtype="int32")

补充：pd.Series([2**31-1], dtype="int32") + 1

那个工程：`pd.Series([2**31], dtype="int32")`

补充：`pd.Series([2**31-1], dtype="int32") + 1`