好多的鱼


  • Home

  • Archives

自建 vllm 服务于 vsc

Posted on 2026-04-21

自建 vllm 服务 vsc

环境准备

1
2
pip install uv 
pip install modelscope

初次尝试 Qwen3.5-27B

1
VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-27B --port 18000 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3

测试请求:

1
2
3
4
5
6
7
8
9
curl http://localhost:18000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-27B",
"messages": [
{"role": "user", "content": "你好,请简单介绍一下你自己。"}
],
"max_tokens": 1000
}'

OOM 问题

显存炸了,修改为 FP8 版本并优化参数:

1
VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-27B-FP8 --port 18000 --host 0.0.0.0 --tensor-parallel-size 4 --max-model-len 262144 --gpu-memory-utilization 0.95 --reasoning-parser qwen3 --trust-remote-code

尝试 Qwen 122B 标准模型

通过 4 卡 + --cpu-offload-gb 解决内存问题:

1
SAFETENSORS_FAST_GPU=1 vllm serve MiniMax-M2.5 --trust-remote-code --tensor-parallel-size 4 --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --gpu-memory-utilization 0.3 --cpu-offload-gb 150

结果:挂了

更早的无用尝试:

1
2
3
4
5
SAFETENSORS_FAST_GPU=1 vllm serve \
MiniMaxAI/MiniMax-M2.5 --trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think

2 卡尝试

1
2
3
4
5
6
CUDA_VISIBLE_DEVICES=2,3 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-122B-A10B \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.8

4 卡再试

1
VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-122B-A10B --port 18001 --tensor-parallel-size 4 --max-model-len 65536 --reasoning-parser qwen3 --gpu-memory-utilization 0.5 --cpu-offload-gb 200

结论: 经过测试 122B 在 ollama 下也不能正确运行,加载时间 22 分钟,第二次回答报错。放弃该路径。

Qwen3.5-35B-A3B 尝试

1
VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-35B-A3B --port 18001 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3 --gpu-memory-utilization 0.8 --cpu-offload-gb 30

2 卡版本

1
2
3
4
5
6
CUDA_VISIBLE_DEVICES=2,3 vLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-35B-A3B \
--port 18001 \
--tensor-parallel-size 2 \ # 关键修改:由 4 改为 2
--max-model-len 262144 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.8
1
CUDA_VISIBLE_DEVICES=2,3 vLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-35B-A3B --port 18001 --tensor-parallel-size 2 --max-model-len 262144 --reasoning-parser qwen3 --gpu-memory-utilization 0.8 --cpu-offload-gb 30

Triton 实现问题

1
VLLM_FLASH_ATTN_TRITON_IMPL=False CUDA_VISIBLE_DEVICES=2,3 vLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-35B-A3B --port 18001 --tensor-parallel-size 2 --max-model-len 262144 --reasoning-parser qwen3 --gpu-memory-utilization 0.5 --cpu-offload-gb 30

卸载出错,移除卸载过程。

4 卡版本

1
VLLM_FLASH_ATTN_TRITON_IMPL=False CUDA_VISIBLE_DEVICES=1,2,3,4 vLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.5-35B-A3B --port 18001 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3 --gpu-memory-utilization 0.5 --gpu-memory-utilization 0.8

排查僵尸进程

1
ps -o pid,ppid,cmd -p 2779757

0.8B 版本成功运行

改成 0.8B 后的效果,跑起来了!

修改参数后:

1
VLLM_DISABLE_CUDA_GRAPH=True VLLM_USE_MODELSCOPE=true CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.5-0.8B --port 18001 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser qwen3 --gpu-memory-utilization 0.6 --enforce-eager

27B-FP8 最终方案

1
VLLM_DISABLE_CUDA_GRAPH=True VLLM_USE_MODELSCOPE=true CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.5-27B-FP8 --port 18001 --tensor-parallel-size 4 --max-model-len 65536 --reasoning-parser qwen3 --gpu-memory-utilization 0.5 --enforce-eager

最终优化配置

增加环境变量

1
2
export VLLM_TORCH_COMPILE=1
export PYTORCH_ALLOC_CONF=expandable_segments:True

2 卡部署参数(27B-FP8)

1
2
3
4
5
6
7
8
9
10
CUDA_VISIBLE_DEVICES=1,2 vllm serve Qwen/Qwen3.5-27B-FP8 \
--port 18002 \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--max-num-seqs 16

2 卡部署参数(Qwen3-Coder-30B-A3B-Instruct-FP8)

1
2
3
4
5
6
7
8
9
10
11
12
13
CUDA_VISIBLE_DEVICES=0,3 VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--host 0.0.0.0 \
--port 18001 \
--tensor-parallel-size 2 \
--dtype auto \
--reasoning-parser qwen3 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--served-model-name qwen3-coder-30b-A3B-Instruct-FP8

总结

  1. 122B 模型:放弃,加载时间过长且不稳定
  2. 35B 模型:存在 Triton 实现问题,需要调整参数
  3. 27B-FP8:最终稳定方案,2 卡部署效果良好
  4. 0.8B 模型:可以快速验证,适合调试

关键优化参数:

  • --gpu-memory-utilization 0.8-0.95:根据显存大小调整
  • --enable-chunked-prefill:提升批处理性能
  • --kv-cache-dtype fp8:减少显存占用
  • VLLM_TORCH_COMPILE=1:启用编译优化
  • VLLM_DISABLE_CUDA_GRAPH=True:某些场景下需要禁用 CUDA Graph
  • --enforce-eager:强制 eager 模式,提升兼容性,但在测试中发现该参数极大降低处理能力(18.4token/s),在调测完成后撤除。

提升 vLLM 处理速度的措施

在部署 vLLM 服务时,除了保证模型稳定运行外,提升推理速度(Token/s)也是关键目标。以下是经过验证的有效提速措施:

1. 启用 Torch Compile

通过设置环境变量 VLLM_TORCH_COMPILE=1,vLLM 会利用 PyTorch 2.0+ 的编译功能对模型算子进行优化,显著减少内核启动开销。实测在 27B 模型上可提升 10%-15% 的推理速度。

2. 使用 FP8 量化模型

相比 BF16 或 FP16,FP8 模型不仅显存占用更低,推理速度也更快。例如 Qwen3.5-27B-FP8 在相同硬件下比全精度版本快约 20%,且精度损失极小。

3. 优化显存利用率

  • 设置 --gpu-memory-utilization 0.95 可最大化显存使用,增加批处理大小(batch size),从而提升吞吐。
  • 注意:过高可能导致 OOM,建议根据实际显存大小微调。

4. 启用分块预填充(Chunked Prefill)

参数 --enable-chunked-prefill 允许将长 prompt 分块处理,避免单次请求占用过多显存,从而支持更大的并发请求数,提升整体吞吐量。

5. 调整批处理参数

  • --max-num-batched-tokens:限制每批次处理的 token 总数,避免显存溢出。
  • --max-num-seqs:限制并发序列数,平衡延迟与吞吐。
  • 适当调大这些参数可在显存允许范围内提升并发处理能力。

6. 使用 FP8 KV Cache

通过 --kv-cache-dtype fp8 将 KV Cache 量化为 FP8,可大幅减少显存占用,从而支持更大的 batch size 或更长的上下文,间接提升推理速度。

7. 移除 Eager 模式

--enforce-eager 会强制模型使用 PyTorch 的 Eager 模式,虽然兼容性好,但会严重降低推理速度(实测下降至 18.4 token/s)。在完成调试后,务必移除该参数以启用图模式加速。

8. 多卡并行策略

  • 使用 --tensor-parallel-size 匹配可用 GPU 数量,确保模型权重均匀分布。
  • 对于小模型(如 0.8B),4 卡并行可能带来额外通信开销,建议根据模型大小选择合适卡数。 最终选用2卡,实际测试效果差不多。但是4卡的KV缓存只有9%利用率,缩减到两卡后利用率提升到27%,空出来两张卡补了一个代码补全的模型。

10. 监控与调优

  • 使用 nvidia-smi 监控显存占用与 GPU 利用率。

通过上述措施,可在保证稳定性的前提下,将 vLLM 的推理速度提升至4卡并发时,单用户50token/s,多用户200token/s以上。
双卡运行时 单用户32 token/s 8用户 263 token/s
最高服务测试 10并发,332 token/s


2026.4.27 补充工具调用参数:
–enable-auto-tool-choice –tool-call-parser hermes
补充后可以在chatBOX上自定义MCP服务器

无法找到Block insecure private network

Posted on 2026-01-06

摄像头web工具提示关闭Block insecure private network
但是在edge中无法找到

解决方法:复制一个edge快捷方式,打开后,在目标的内容后面空格,加上 –disable-web-security–user-data-dir=”C:\EdgeDevData” (目录任意)
然后双击快捷方式打开edge,就可以正常访问了。

编译qtxlsx库

Posted on 2026-01-04

做个自用小工具提升效率,结果wps和excel不兼容,导致QAxObject不能正确导出excel。
换qtxlsx库。
查阅原来的qtxlsxwrite较久没人维护了,使用了一个fork的版本:

https://gitcode.com/gh_mirrors/qx/QXlsx/tree/master/QXlsx

在qxls\qxls下建立build目录,修改CMakelists.txt文件,
在

1
2
3
4
project(QXlsx
VERSION 1.5.0
LANGUAGES CXX
)

下添加

1
2
3
4
set(CMAKE_INCLUDE_CURRENT_DIR ON)
set(QT_DIR "D:/programming_environment/Qt/Qt5.12.10/5.12.10/msvc2017_64/lib/cmake/Qt5")
set(Qt5_DIR "D:/programming_environment/Qt/Qt5.12.10/5.12.10/msvc2017_64/lib/cmake/Qt5")
set(CMAKE_AUTOMOC ON)

注意,指向是 Qt5Config.cmake 和 qt5-config.cmake
,然后打开开始菜单/visual studio下的x64 Native Tools Command Prompt for VS Insiders,到项目目录的xlsx下,建立build目录,在build目录里cmake ..
你就看到sln文件了,用vs2026打开,就可以直接编译出库,库文件在QXlsx\QXlsx\build\Debug下

建立使用nginx服务的静态网站

Posted on 2025-11-24

使用Nginx部署静态网站指南

概述

本指南介绍如何使用Nginx在端口20000部署静态网站,提供比Python内置服务器更安全、高性能的解决方案。

安装Nginx

安装Nginx
bash

Ubuntu/Debian系统

sudo apt update
sudo apt install nginx -y

配置网站

创建网站目录

1
2
3
bash
sudo mkdir -p /var/www/mysite
sudo chown -R $USER:$USER /var/www/mysite

放置网站文件
将你的index.html和其他静态文件复制到 /var/www/mysite/ 目录。

创建Nginx配置文件

1
2
bash
sudo nano /etc/nginx/sites-available/mysite

添加以下配置:

1
2
3
4
5
6
7
8
9
10
11
nginx
server {
listen 23456;
server_name _;
root /var/www/mysite;
index index.html;

location / {
try_files $uri $uri/ =404;
}
}

启用站点

1
2
3
4
bash
sudo ln -s /etc/nginx/sites-available/mysite /etc/nginx/sites-enabled/
sudo nginx -t # 测试配置
sudo systemctl reload nginx # 重新加载

防火墙设置

bash

开放23456端口

1
2
sudo ufw allow 23456/tcp
sudo ufw reload

验证服务

bash

1
curl -I http://localhost:23456

访问 http://你的服务器IP:23456 即可查看网站。

wsl安装并运行TRT

Posted on 2025-11-17

安装指导文件见https://nvidia.github.io/TensorRT-LLM/latest/installation/linux.html

1
2
3
4
5
6
7


conda create -n=trt python=3.12


conda activate trt

注意wsl安装cuda有自己的步骤

1
2
3
4
5
6
7
8
9
10

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9


然后在home目录的.bashrc中增加如下内容

1
2
3
4
5
6

export PATH=/usr/local/cuda-12/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12

conda activate trt

后续正式开始安装,首先pytorch和必要库,需要装一个多G

1
2
3
4
5
6
7
8
9
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128


sudo apt-get -y install libopenmpi-dev

# Optional step: Only required for disagg-serving
sudo apt-get -y install libzmq3-dev


然后装一下trt

1
pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

卡死了….

1
2
3
4
5
6
pip3 install tensorrt_llm
Collecting tensorrt_llm
Using cached tensorrt_llm-1.0.0.tar.gz (1.6 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... -

卡了足足10多分钟后,自动继续滚动下去了
然后接着卡这句:

1
2
3
4
5
Collecting tensorrt_cu12_libs==10.11.0.33 (from tensorrt_cu12==10.11.0.33->tensorrt~=10.11.0->tensorrt_llm)
Downloading tensorrt_cu12_libs-10.11.0.33.tar.gz (709 bytes)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... -

大约10分钟,继续了
最后安装了这么一大堆后完成了,很顺利

1
2
      Successfully uninstalled fsspec-2025.9.0
Successfully installed StrEnum-0.4.15 accelerate-1.11.0 aenum-3.1.16 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 blake3-1.0.8 blobfile-3.1.0 build-1.3.0 certifi-2025.10.5 cffi-2.0.0 charset_normalizer-3.4.4 click-8.3.0 click_option_group-0.5.9 colored-2.3.1 contourpy-1.3.3 cuda-bindings-12.9.4 cuda-pathfinder-1.3.2 cuda-python-12.9.4 cycler-0.12.1 datasets-3.1.0 diffusers-0.35.2 dill-0.3.8 distro-1.9.0 einops-0.8.1 etcd3-0.12.0 evaluate-0.4.6 fastapi-0.115.4 flashinfer-python-0.2.5 fonttools-4.60.1 frozenlist-1.8.0 fsspec-2024.9.0 grpcio-1.76.0 h11-0.16.0 h5py-3.12.1 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-0.36.0 idna-3.11 importlib_metadata-8.7.0 jiter-0.11.1 kiwisolver-1.4.9 lark-1.3.1 llguidance-0.7.29 lxml-6.0.2 markdown-it-py-4.0.0 matplotlib-3.10.7 mdurl-0.1.2 meson-1.9.1 ml_dtypes-0.5.3 mpi4py-4.1.1 multidict-6.7.0 multiprocess-0.70.16 ninja-1.13.0 numpy-1.26.4 nvidia-ml-py-12.575.51 nvidia-modelopt-0.33.1 nvidia-modelopt-core-0.33.1 nvtx-0.2.13 omegaconf-2.3.0 onnx-1.19.1 onnx_graphsurgeon-0.5.8 openai-2.7.1 opencv-python-headless-4.11.0.86 optimum-2.0.0 ordered-set-4.1.0 packaging-25.0 pandas-2.3.3 peft-0.17.1 pillow-10.3.0 polygraphy-0.49.26 propcache-0.4.1 protobuf-6.33.0 psutil-7.1.3 pulp-3.3.0 pyarrow-22.0.0 pycparser-2.23 pycryptodomex-3.23.0 pydantic-2.12.4 pydantic-core-2.41.5 pydantic-settings-2.11.0 pygments-2.19.2 pynvml-12.0.0 pyparsing-3.2.5 pyproject_hooks-1.2.0 python-dateutil-2.9.0.post0 python-dotenv-1.2.1 pytz-2025.2 pyyaml-6.0.3 pyzmq-27.1.0 regex-2025.11.3 requests-2.32.5 rich-14.2.0 safetensors-0.6.2 scipy-1.16.3 sentencepiece-0.2.1 setuptools-79.0.1 six-1.17.0 sniffio-1.3.1 soundfile-0.13.1 starlette-0.41.3 tenacity-9.1.2 tensorrt-10.11.0.33 tensorrt_cu12-10.11.0.33 tensorrt_cu12_bindings-10.11.0.33 tensorrt_cu12_libs-10.11.0.33 tensorrt_llm-1.0.0 tiktoken-0.12.0 tokenizers-0.21.4 torchprofile-0.0.4 tqdm-4.67.1 transformers-4.53.1 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 uvicorn-0.38.0 xgrammar-0.1.21 xxhash-3.6.0 yarl-1.22.0 zipp-3.23.0

写一个程序跑一下,用TRT给的例子:
···
from tensorrt_llm import LLM, SamplingParams

def main():

# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(
        f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
    )

# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'

if name == ‘main‘:
main()

···

执行后,在漫长的等待后,报了个错,访问不了huggingface.co

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/TinyLlama/TinyLlama-1.1B-Chat-v1.0/revision/main (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
···
后面还有一堆,不用管它了。
国内用hf,执行了
export HF_ENDPOINT=https://hf-mirror.com
之后,重新运行还是报错,按提示升级了一下

pip install transformers -U
增加
export HF_HUB_BASE_URL=https://hf-mirror.com
运行发现还报错。

一通折腾后,发现
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
居然下来了。
然后立刻回到试验程序
···
(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:24] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-11-06 21:27:28,299 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:27:28] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:27:28] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:27:28] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:27:28] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
rank 0 using MpiPoolSession to spawn MPI processes
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:36] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-11-06 21:27:41,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 104.27it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 572.52it/s]
Model init total -- 3.57s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:27:47,676 - INFO - flashinfer.jit: Loading JIT ops: norm
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
^CTraceback (most recent call last):
File "/home/bobo/test_trt/test_trt.py", line 33, in <module>
main()
File "/home/bobo/test_trt/test_trt.py", line 8, in main
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1125, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 942, in __init__
super().__init__(model,
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 214, in __init__
self._build_model()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1072, in _build_model
self._executor = self._executor_cls.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/executor.py", line 423, in create
return GenerationExecutorProxy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 105, in __init__
self._start_executor_workers(worker_kwargs)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
if self.worker_init_status_queue.poll(1):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/ipc.py", line 110, in poll
events = dict(self.poller.poll(timeout=timeout * 1000))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/zmq/sugar/poll.py", line 106, in poll
return zmq_poll(self.sockets, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zmq/backend/cython/_zmq.py", line 1680, in zmq.backend.cython._zmq.zmq_poll
File "zmq/backend/cython/_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py'>
Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1594, in _shutdown
atexit_call()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 172, in join_threads
thread.join()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1149, in join
self._wait_for_tstate_lock()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1169, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
[11/06/2025-21:30:34] [TRT-LLM] [E] Failed to send object: None
^C^C^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7707940725c0>
Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers
pool.shutdown()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown
self.process.wait(300)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 1277, in wait
self._wait(timeout=sigint_timeout)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 2047, in _wait
time.sleep(delay)
KeyboardInterrupt:
···
发现卡住后补了个变量
export TORCH_CUDA_ARCH_LIST="8.6;8.9"

后面的值是这么来的:
python -c "import torch; print(torch.cuda.get_device_capability())"
回复(8,9),就把8.9加后面,然后接着执行。

(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:30:46] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:30:50,629 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:30:50] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:30:50] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:30:50] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:30:50] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
rank 0 using MpiPoolSession to spawn MPI processes
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:31:03] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:31:07,496 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
torch_dtype is deprecated! Use dtype instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 61.60it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 557.09it/s]
Model init total – 2.23s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:31:12,625 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-11-06 21:31:48,328 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.72 GiB for max tokens in paged KV cache (844512).
Processed requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.15it/s]
Prompt: ‘Hello, my name is’, Generated text: ‘[Your Name] and I am a [Your Position] at [Your Company]. I am writing to express my interest in the [Job Title] position at’
Prompt: ‘The capital of France is’, Generated text: ‘Paris.\n\n2. B. C. The capital of Canada is Ottawa.\n\n3. A. C. The capital of Australia is Can’
Prompt: ‘The future of AI is’, Generated text: “bright, and it’s not just for big companies. Small businesses can also benefit from AI technology. Here are some ways:\n\n1.”

1
2
3
顺利跑通。

跑个服务试试:

trtllm-serve “TinyLlama/TinyLlama-1.1B-Chat-v1.0”

1
直到显示:

INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The ‘validation_alias’ attribute with value ‘max_tokens’ was provided to the Field() function, which has no effect in the context it was used. ‘validation_alias’ is field-specific metadata, and can only be attached to a model field using Annotated metadata or by assignment. This may have happened because an Annotated type alias using the type statement was used, or if the Field() function was attached to a single member of a union type.
warnings.warn(
INFO: 127.0.0.1:54364 - “POST /v1/chat/completions HTTP/1.1” 200 OK

1
2
然后再起一个wsl的bash,输入

curl -X POST http://localhost:8000/v1/chat/completions -H “Content-Type: application/json” -H “Accept: application/json” -d ‘{
“model”: “TinyLlama/TinyLlama-1.1B-Chat-v1.0”,
“messages”:[{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “Where is New York? Tell me in a single sentence.”}],
“max_tokens”: 32,
“temperature”: 0
}’

1
2
得到答复:"New York is a city in the northeastern United States, located on the eastern coast of the state of New York."

{“id”:”chatcmpl-31b02f6ab4854863909850ab9688d8b1”,”object”:”chat.completion”,”created”:1762436547,”model”:”TinyLlama/TinyLlama-1.1B-Chat-v1.0”,”choices”:[{“index”:0,”message”:{“role”:”assistant”,”content”:”New York is a city in the northeastern United States, located on the eastern coast of the state of New York.”,”reasoning_content”:null,”tool_calls”:[]},”logprobs”:null,”finish_reason”:”stop”,”stop_reason”:null,”disaggregated_params”:null}],”usage”:{“prompt_tokens”:43,”total_tokens”:70,”completion_tokens”:27},”prompt_token_ids”:null}(trt)

1
2
3

又试了几个,能回答,理解能力有限,不能说中文。
换!换DeepSeek-R1-Distill-Qwen-1.5B

trtllm-serve “deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B”

1
挂了~提示了一堆关于如何控制显存的信息。

Please refer to the TensorRT LLM documentation for information on how to control the memory usage through TensorRT LLM configuration options. Possible options include:
Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
Sampler: reduce max_seq_len and/or max_attention_window_size
Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens
Drafter: reduce max_seq_len and/or max_draft_len
Additional executor resources (temporary for KV cache size estimation): reduce max_num_tokens
Model resources created during usage: reduce max_num_tokens
KV cache: reduce free_gpu_memory_fraction
Additional executor resources: reduce max_num_tokens

trtllm-serve "meta-llama/Llama-3.2-1B"
要token,放弃。

验证完成。

备份和迁移conda环境

Posted on 2025-11-06

以test为例
添加环境新目录
conda config –add envs_dirs D:\conda\envs
查看
conda config –show envs_dirs

导出:
conda env export > test_env.yml

退出环境然后执行删除:
conda remove -n test –all

重建环境
conda env create -n test -f test_env.yml
验证
conda info –envs

包缓冲迁移
修改.conda文件或者直接执行:
conda config –add pkgs_dirs D:\conda\pkgs
conda config –remove pkgs_dirs C:\Users\xxxx.conda\pkgs
修改后文件内容

1
2
3
4
5
6
channels:
- conda-forge
envs_dirs:
- D:\conda\envs
pkgs_dirs:
- D:\conda\pkgs

为docker添加代理

Posted on 2025-05-17

在/etc/docker下创建daemon.json文件,内容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
"dns": ["8.8.8.8", "8.8.4.4"],
"registry-mirrors": [
"https://docker.m.daocloud.io/",
"https://huecker.io/",
"https://dockerhub.timeweb.cloud",
"https://noohub.ru/",
"https://dockerproxy.com",
"https://docker.mirrors.ustc.edu.cn",
"https://docker.nju.edu.cn",
"https://xx4bwyg2.mirror.aliyuncs.com",
"http://f1361db2.m.daocloud.io",
"https://registry.docker-cn.com",
"http://hub-mirror.c.163.com"
],
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}


然后 sudo systemctl restart docker

测试一下:
docker pull busybox
是可以的

33.AI对AI算法过时情况的回答

Posted on 2025-04-04

问:如果时间有限,需要放弃一些过时算法,请给所有你知道的算法排个序,哪些优先放弃

在时间有限的情况下,可以按照以下优先级放弃或延后学习某些算法。排序基于实际应用频率、替代算法的有效性以及基础重要性(从最优先放弃到尽量保留):

第一优先级:可放弃的算法

  1. 完全过时且替代性强的算法
    单层感知机(Perceptron):已被多层感知机(MLP)和深度学习取代,教学价值有限。

Apriori算法(关联规则挖掘):计算效率低,FP-Growth 或深度学习(如序列模型)更高效。

LVQ(学习向量量化):原型聚类方法,实际应用极少,K-Means 更通用。

ID3决策树:仅支持离散特征,已被 C4.5 和 CART(支持连续特征)取代。

  1. 应用场景狭窄的算法
    SOM(自组织映射):无监督神经网络,应用集中在可视化领域,深度学习(如 t-SNE、UMAP)更常用。

传统贝叶斯网络结构学习:计算复杂且依赖领域知识,实际中更常用概率图模型工具包(如 PyMC3)。

K-Medoids:类似 K-Means,但对噪声敏感且计算成本高,优先学习 K-Means。

第二优先级:视领域需求选择性放弃
3. 特定领域替代性强的算法
LDA(线性判别分析):降维和分类任务中,PCA 更通用,除非需要类别分离指导降维。

传统遗传算法:优化问题中,梯度下降或贝叶斯优化更高效,除非研究进化计算。

HMM(隐马尔可夫模型):语音识别中被 RNN/Transformer 取代,但生物信息学仍可能用到。

传统马尔可夫链:简单序列建模中,RNN/LSTM 更强大。

  1. 复杂度高且少用的算法
    SMO 算法(SVM 的原始实现):实际使用中直接调库(如 sklearn.svm.SVC),无需手动实现。

传统核方法(如自定义核函数):深度学习兴起后,核技巧的重要性下降。

尽量保留的核心算法
5. 基础且广泛应用的算法
逻辑回归:分类任务的基础,可解释性强。

线性 SVM(LinearSVC):高维数据分类高效,工业界常用。

决策树(CART):随机森林和 GBDT 的基础。

KNN:惰性学习代表,适合实时场景。

随机森林 & 梯度提升(XGBoost/LightGBM):结构化数据的主流模型。

K-Means & PCA:无监督学习的基础工具。

神经网络基础(MLP、CNN、RNN):深度学习的核心。

  1. 特定场景的关键算法
    朴素贝叶斯:文本分类和小数据场景高效。

协同过滤:推荐系统的经典方法。

ARIMA:时间序列分析(若涉及该领域需保留)。

总结建议
放弃:单层感知机、Apriori、LVQ、ID3、SOM。

选择性放弃:LDA、传统遗传算法、HMM(除非领域相关)、传统核方法。

保留核心:逻辑回归、SVM、决策树、集成学习(RF/GBDT)、KNN、PCA、神经网络。

实践优先:通过工具库(如 Scikit-learn)快速应用,再深入原理。

注意:即使放弃某些算法,仍需了解其基本思想(如“关联规则挖掘”或“进化计算”),以便需要时快速补足。

sip调测工具

Posted on 2025-03-17

sipp
wireshark

linux系统新装必要工具

Posted on 2025-03-08

应用:
sudo apt conky guake terminator
环境:
vsc 微信 wps

12…5下一页

42 posts
17 tags
© 2026 qinsibo
Powered by Hexo
|
Theme — NexT.Gemini v5.1.4
津ICP备19000975号-1
# 津ICP备19000975号-1