在docker Alpine中安装pandas

dat*_*den 22 python numpy pandas docker alpine-linux

我有一个真的很难试图安装在一个稳定的数据包的科学配置docker.使用这种主流的相关工具应该会更容易.

以下是曾经工作的Dockerfile,有点破解,从包核心中删除并单独安装,指定,因为据称,更高版本与之冲突.pandaspandas<0.21.0numpy

    FROM alpine:3.6

    ENV PACKAGES="\
    dumb-init \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    bash \
    git \
    ca-certificates \
    freetype \
    libgfortran \
    libgcc \
    libstdc++ \
    openblas \
    tcl \
    tk \
    libssl1.0 \
    "

ENV PYTHON_PACKAGES="\
    numpy \
    matplotlib \
    scipy \
    scikit-learn \
    nltk \
    " 

RUN apk add --no-cache --virtual build-dependencies python3 \
    && apk add --virtual build-runtime \
    build-base python3-dev openblas-dev freetype-dev pkgconfig gfortran \
    && ln -s /usr/include/locale.h /usr/include/xlocale.h \
    && python3 -m ensurepip \
    && rm -r /usr/lib/python*/ensurepip \
    && pip3 install --upgrade pip setuptools \
    && ln -sf /usr/bin/python3 /usr/bin/python \
    && ln -sf pip3 /usr/bin/pip \
    && rm -r /root/.cache \
    && pip install --no-cache-dir $PYTHON_PACKAGES \
    && pip3 install 'pandas<0.21.0' \    #<---------- PANDAS
    && apk del build-runtime \
    && apk add --no-cache --virtual build-dependencies $PACKAGES \
    && rm -rf /var/cache/apk/*

# set working directory
WORKDIR /usr/src/app

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt # other than data science packages go here
RUN pip install -r requirements.txt

# add entrypoint.sh
COPY ./entrypoint.sh /usr/src/app/entrypoint.sh

RUN chmod +x /usr/src/app/entrypoint.sh

# add app
COPY . /usr/src/app

# run server
CMD ["/usr/src/app/entrypoint.sh"]
Run Code Online (Sandbox Code Playgroud)

上面的配置用于工作.现在发生的是构建确实经历了,但导入pandas失败出现以下错误:

ImportError: Missing required dependencies ['numpy']
Run Code Online (Sandbox Code Playgroud)

numpy 1.16.1安装以来,我不知道哪个numpy pandas试图找到...

有谁知道如何为此获得稳定的解决方案?

注意:一个解决方案包括从docker数据科学的交钥匙图像中拉出至少上面提到的包,Dockerfile上面也是非常受欢迎的.


编辑1:

如果我requirements.txt按照注释中的建议移动数据包的安装,如下所示:

requirements.txt

(...)
numpy==1.16.1 # or numpy==1.16.0
scikit-learn==0.20.2
scipy==1.2.1
nltk==3.4   
pandas==0.24.1 # or pandas== 0.23.4
matplotlib==3.0.2 
(...)
Run Code Online (Sandbox Code Playgroud)

Dockerfile:

# add and install requirements
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip install -r requirements.txt
Run Code Online (Sandbox Code Playgroud)

它再次打破pandas,抱怨numpy.

Collecting numpy==1.16.1 (from -r requirements.txt (line 61))
  Downloading https://files.pythonhosted.org/packages/2b/26/07472b0de91851b6656cbc86e2f0d5d3a3128e7580f23295ef58b6862d6c/numpy-1.16.1.zip (5.1MB)
Collecting scikit-learn==0.20.2 (from -r requirements.txt (line 62))
  Downloading https://files.pythonhosted.org/packages/49/0e/8312ac2d7f38537361b943c8cde4b16dadcc9389760bb855323b67bac091/scikit-learn-0.20.2.tar.gz (10.3MB)
Collecting scipy==1.2.1 (from -r requirements.txt (line 63))
  Downloading https://files.pythonhosted.org/packages/a9/b4/5598a706697d1e2929eaf7fe68898ef4bea76e4950b9efbe1ef396b8813a/scipy-1.2.1.tar.gz (23.1MB)
Collecting nltk==3.4 (from -r requirements.txt (line 64))
  Downloading https://files.pythonhosted.org/packages/6f/ed/9c755d357d33bc1931e157f537721efb5b88d2c583fe593cc09603076cc3/nltk-3.4.zip (1.4MB)
Collecting pandas==0.24.1 (from -r requirements.txt (line 65))
  Downloading https://files.pythonhosted.org/packages/81/fd/b1f17f7dc914047cd1df9d6813b944ee446973baafe8106e4458bfb68884/pandas-0.24.1.tar.gz (11.8MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 359, in get_provider
        module = sys.modules[moduleOrReq]
    KeyError: 'numpy'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 732, in <module>
        ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
      File "/tmp/pip-install-_e5z6o6_/pandas/setup.py", line 475, in maybe_cythonize
        numpy_incl = pkg_resources.resource_filename('numpy', 'core/include')
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1144, in resource_filename
        return get_provider(package_or_requirement).get_resource_filename(
      File "/usr/local/lib/python3.7/site-packages/pkg_resources/__init__.py", line 361, in get_provider
        __import__(moduleOrReq)
    ModuleNotFoundError: No module named 'numpy'

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-_e5z6o6_/pandas/
Run Code Online (Sandbox Code Playgroud)

编辑2:

这似乎是一个悬而未决的pandas问题.有关详细信息,请参阅:

pandas-dev github

"不幸的是,这意味着requirements.txt文件不足以设置安装了pandas的新环境(如在docker容器中)".

  **ImportError**:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the multiarray numpy extension module failed.  Most
  likely you are trying to import a failed build of numpy.
  Here is how to proceed:
  - If you're working with a numpy git repository, try `git clean -xdf`
    (removes all files not under version control) and rebuild numpy.
  - If you are simply trying to use the numpy version that you have installed:
    your installation is broken - please reinstall numpy.
  - If you have already reinstalled and that did not fix the problem, then:
    1. Check that you are using the Python you expect (you're using /usr/local/bin/python),
       and that you have no directories in your PATH or PYTHONPATH that can
       interfere with the Python and numpy versions you're trying to use.
    2. If (1) looks fine, you can open a new issue at
       https://github.com/numpy/numpy/issues.  Please include details on:
       - how you installed Python
       - how you installed numpy
       - your operating system
       - whether or not you have multiple versions of Python installed
       - if you built from source, your compiler versions and ideally a build log
Run Code Online (Sandbox Code Playgroud)

编辑3

requirements.txt ---> https://pastebin.com/0icnx0iu

val*_*ano 13

如果您未绑定Alpine 3.6,则应使用Alpine 3.7(或更高版本)。

在Alpine 3.6上,安装matplotlib失败:

Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz (36.6MB)
    Complete output from command python setup.py egg_info:
    Download error on https://pypi.org/simple/numpy/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    Couldn't find index page for 'numpy' (maybe misspelled?)
    Download error on https://pypi.org/simple/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833) -- Some packages may not be found!
    No local packages or working download links found for numpy>=1.10.0
Run Code Online (Sandbox Code Playgroud)

但是,在Alpine 3.7上,它起作用了。这可能是由于numpy版本问题(请参见此处),但是我无法确定。克服了这个问题,软件包的构建和安装成功完成-花了大约30分钟的时间(由于Alpine的musl-libc与Python的Wheels格式不兼容,因此所有使用pip安装的软件包都必须从源代码构建)。

请注意,这是一项重要的更改:您只应在之后删除build-runtime虚拟包(apk del build-runtimepip install。此外,如果适用,您可以取代numpy的1.16.11.16.2,这是出厂的版本(否则1.16.2将被卸载,1.16.1从源头建立,进一步提高构建时间) -我还没有尝试这样做,虽然。

作为参考,这是我稍作修改的Dockerfiledocker build输出

注意:

通常,选择Alpine作为最小化图像大小的基础(Alpine也很光滑,但是由于glibc / musl而与大陆Linux应用程序存在兼容性问题)。为此,必须从源代码构建Python软件包,因为您会得到一个非常肿的映像-在进行任何清理之前需要900MB,这也需要很长时间才能构建。可以通过除去所有中间编译工件,构建依赖项等来极大地压缩映像,但是仍然可以。

如果无法获得Python软件包版本,而无需从源代码构建它们,则需要在Alpine上工作,我建议您尝试使用其他更小,更兼容的基本映像,例如debian-slimubuntu

编辑:

在具有附加要求的“编辑3”之后,这里是更新的Dockerfile和Docker build输出。添加了以下软件包来满足构建依赖性:

postgresql-dev libffi-dev libressl-dev libxml2 libxml2-dev libxslt libxslt-dev libjpeg-turbo-dev zlib-dev
Run Code Online (Sandbox Code Playgroud)

对于由于特定标头而无法构建的软件包,我使用了Alpine的软件包内容搜索来查找丢失的软件包。专门针对cffiffi.h缺少标头,需要libffi-dev打包:https : //pkgs.alpinelinux.org/contents?file=ffi.h&path=&name=&branch=v3.7

或者,当软件包构建失败不是很明显时,可以参考特定软件包的安装说明,例如Pillow

在压缩之前,新的映像大小为1.04GB。为了减少它,您可以删除Python和pip缓存:

RUN apk del build-runtime && \
    find -type d -name __pycache__ -prune -exec rm -rf {} \; && \
    rm -rf ~/.cache/pip
Run Code Online (Sandbox Code Playgroud)

使用时,图片大小可减少到661MB docker build --squash


小智 5

尝试将其添加到您的 requirements.txt 文件中:

numpy==1.16.0
pandas==0.23.4
Run Code Online (Sandbox Code Playgroud)

自昨天以来,我一直面临同样的错误,此更改为我解决了这个问题。