无法使用自定义容器在 Cloud AI Platform 中创建用于预测的版本

如何解决无法使用自定义容器在 Cloud AI Platform 中创建用于预测的版本

由于某些 VPC 限制，我被迫使用自定义容器对在 Tensorflow 上训练的模型进行预测。根据 documentation 要求，我使用 Tensorflow Serving 创建了一个 HTTP 服务器。用于build镜像的 Dockerfile 如下：

FROM tensorflow/serving:2.3.0-gpu

# copy the model file
ENV MODEL_NAME=my_model
COPY my_model /models/my_model

其中 my_model 包含名为 saved_model 的文件夹中的 1/。

然后我将容器映像推送到 Artifact Registry，然后创建了一个 Model。为了创建 Version，我在 Cloud Console UI 上选择了 Customer Container 并将路径添加到 Container Image。然后我将预测路线和健康路线提到/v1/models/my_model:predict并将端口更改为8501 .我还选择了机器类型为 n1-standard-16 类型和 1 个 P100 GPU 的单个计算节点，并保持扩展 Auto scaling。

点击保存后，我可以看到 Tensorflow 服务器正在启动，在查看日志时，我们可以看到以下消息：

Successfully loaded servable version {name: my_model version: 1}

Running gRPC ModelServer at 0.0.0.0:8500

Exporting HTTP/REST API at:localhost:8501

NET_LOG: Entering the event loop

但是，大约 20-25 分钟后，version 创建就停止抛出以下错误：

Error: model server never became ready. Please validate that your model file or container configuration are valid.

我无法弄清楚为什么会发生这种情况。我能够在我的本地机器上运行相同的 docker 镜像，并且我能够通过点击创建的端点来成功获得预测：http://localhost:8501/v1/models/my_model:predict

在这方面的任何帮助将不胜感激。

解决方法

在与 Google Cloud 支持团队合作找出错误后自己回答了这个问题。

事实证明，我在其上创建 Version 的端口与 Cloud AI Platform 端的 Kubernetes 部署冲突。因此，我将 Dockerfile 更改为以下内容，并能够在 Classic AI Platform 和 Unified AI Platform 上成功运行 Online Predictions：

FROM tensorflow/serving:2.3.0-gpu

# Set where models should be stored in the container
ENV MODEL_BASE_PATH=/models
RUN mkdir -p ${MODEL_BASE_PATH}

# copy the model file
ENV MODEL_NAME=my_model
COPY my_model /models/my_model

EXPOSE 5000

EXPOSE 8080

CMD ["tensorflow_model_server","--rest_api_port=8080","--port=5000","--model_name=my_model","--model_base_path=/models/my_model"]

您是否尝试过使用不同的健康路径？我相信 "type": "shell","command": "wsl","args": ["bash","-c","iverilog\ -t vvp -o ${fileBasename}.vvp -l /opt/Xilinx/14.7/ISE_DS/ISE/DCM_SP.v\ -I /opt/Xilinx/14.7/ISE_DS/ISE/verilog/src/unisims\ -I /opt/Xilinx/14.7/ISE_DS/ISE/verilog/src/XilinxCoreLib/\ $(wslpath '${workspaceFolder}${pathSeparator}${relativeFileDirname}${pathSeparator}${fileBasenameNoExtension}.v')"] 使用 /v1/models/my_model:predict，但健康检查通常使用 HTTP POST

您的运行状况检查路径可能需要一个 HTTP GET 端点。

编辑：从文档 https://www.tensorflow.org/tfx/serving/api_rest 中，您可以仅使用 GET 作为您的健康端点进行测试

无法使用自定义容器在 Cloud AI Platform 中创建用于预测的版本

如何解决无法使用自定义容器在 Cloud AI Platform 中创建用于预测的版本

解决方法

相关推荐