How to find out `DataFrame.to_numpy` did not create a copy

Mar*_*tin 7 python numpy pandas

The pandas.DataFrame.to_numpy method has a copy argument with the following documentation:

copy : bool, default False

Whether to ensure that the returned value is a not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

Playing around a bit, it seems like calling to_numpy on data that is both adjacent in memory and not of mixed types, keeps a view. But how do I check whether the resulting numpy array shares the memory with the data frame it was created from, without changing data?

Example of memory sharing:

import pandas as pd
import numpy as np

# some data frame that I expect not to be copied
frame = pd.DataFrame(np.arange(144).reshape(12,12))
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0  1  2  3  4  5  6  7  8  9  10  11
# 0   0  0  0  0  0  0  0  0  0  0   0   0
# 1   0  0  0  0  0  0  0  0  0  0   0   0
# 2   0  0  0  0  0  0  0  0  0  0   0   0
# 3   0  0  0  0  0  0  0  0  0  0   0   0
# 4   0  0  0  0  0  0  0  0  0  0   0   0
# 5   0  0  0  0  0  0  0  0  0  0   0   0
# 6   0  0  0  0  0  0  0  0  0  0   0   0
# 7   0  0  0  0  0  0  0  0  0  0   0   0
# 8   0  0  0  0  0  0  0  0  0  0   0   0
# 9   0  0  0  0  0  0  0  0  0  0   0   0
# 10  0  0  0  0  0  0  0  0  0  0   0   0
# 11  0  0  0  0  0  0  0  0  0  0   0   0
Run Code Online (Sandbox Code Playgroud)

Example not sharing memory:

import pandas as pd
import numpy as np

# some data frame that I expect to be copied
types = [int, str, float]
frame = pd.DataFrame({
    i: [types[i%len(types)](value) for value in col]
    for i, col in enumerate(np.arange(144).reshape(12,12).T)
})
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0   1     2   3   4     5   6   7      8    9    10     11
# 0    0  12  24.0  36  48  60.0  72  84   96.0  108  120  132.0
# 1    1  13  25.0  37  49  61.0  73  85   97.0  109  121  133.0
# 2    2  14  26.0  38  50  62.0  74  86   98.0  110  122  134.0
# 3    3  15  27.0  39  51  63.0  75  87   99.0  111  123  135.0
# 4    4  16  28.0  40  52  64.0  76  88  100.0  112  124  136.0
# 5    5  17  29.0  41  53  65.0  77  89  101.0  113  125  137.0
# 6    6  18  30.0  42  54  66.0  78  90  102.0  114  126  138.0
# 7    7  19  31.0  43  55  67.0  79  91  103.0  115  127  139.0
# 8    8  20  32.0  44  56  68.0  80  92  104.0  116  128  140.0
# 9    9  21  33.0  45  57  69.0  81  93  105.0  117  129  141.0
# 10  10  22  34.0  46  58  70.0  82  94  106.0  118  130  142.0
# 11  11  23  35.0  47  59  71.0  83  95  107.0  119  131  143.0
Run Code Online (Sandbox Code Playgroud)

ywb*_*aek 5

numpy.shares_memory你可以使用:

# Your first example
print(np.shares_memory(array, frame))  # True, they are sharing memory

# Your second example
print(np.shares_memory(array2, frame2))  # False, they are not sharing memory
Run Code Online (Sandbox Code Playgroud)

还有numpy.may_share_memory,它更快但只能用于确保事物共享内存(因为它只检查边界是否重叠),所以严格来说不回答问题。阅读本文了解差异。

请注意将这些 numpy 函数与 pandas 数据结构一起使用: 第一个示例np.shares_memory(frame, frame)返回TrueFalse第二个示例返回,可能是因为__array__第二个示例中的数据框方法在幕后创建了一个副本。