wsl安装并运行TRT

安装指导文件见https://nvidia.github.io/TensorRT-LLM/latest/installation/linux.html

1
2
3
4
5
6
7


conda create -n=trt python=3.12


conda activate trt

注意wsl安装cuda有自己的步骤

1
2
3
4
5
6
7
8
9
10

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9


然后在home目录的.bashrc中增加如下内容

1
2
3
4
5
6

export PATH=/usr/local/cuda-12/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12

conda activate trt

后续正式开始安装,首先pytorch和必要库,需要装一个多G

1
2
3
4
5
6
7
8
9
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128


sudo apt-get -y install libopenmpi-dev

# Optional step: Only required for disagg-serving
sudo apt-get -y install libzmq3-dev


然后装一下trt

1
pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

卡死了….

1
2
3
4
5
6
pip3 install tensorrt_llm
Collecting tensorrt_llm
Using cached tensorrt_llm-1.0.0.tar.gz (1.6 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... -

卡了足足10多分钟后,自动继续滚动下去了
然后接着卡这句:

1
2
3
4
5
Collecting tensorrt_cu12_libs==10.11.0.33 (from tensorrt_cu12==10.11.0.33->tensorrt~=10.11.0->tensorrt_llm)
Downloading tensorrt_cu12_libs-10.11.0.33.tar.gz (709 bytes)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... -

大约10分钟,继续了
最后安装了这么一大堆后完成了,很顺利

1
2
      Successfully uninstalled fsspec-2025.9.0
Successfully installed StrEnum-0.4.15 accelerate-1.11.0 aenum-3.1.16 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 blake3-1.0.8 blobfile-3.1.0 build-1.3.0 certifi-2025.10.5 cffi-2.0.0 charset_normalizer-3.4.4 click-8.3.0 click_option_group-0.5.9 colored-2.3.1 contourpy-1.3.3 cuda-bindings-12.9.4 cuda-pathfinder-1.3.2 cuda-python-12.9.4 cycler-0.12.1 datasets-3.1.0 diffusers-0.35.2 dill-0.3.8 distro-1.9.0 einops-0.8.1 etcd3-0.12.0 evaluate-0.4.6 fastapi-0.115.4 flashinfer-python-0.2.5 fonttools-4.60.1 frozenlist-1.8.0 fsspec-2024.9.0 grpcio-1.76.0 h11-0.16.0 h5py-3.12.1 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-0.36.0 idna-3.11 importlib_metadata-8.7.0 jiter-0.11.1 kiwisolver-1.4.9 lark-1.3.1 llguidance-0.7.29 lxml-6.0.2 markdown-it-py-4.0.0 matplotlib-3.10.7 mdurl-0.1.2 meson-1.9.1 ml_dtypes-0.5.3 mpi4py-4.1.1 multidict-6.7.0 multiprocess-0.70.16 ninja-1.13.0 numpy-1.26.4 nvidia-ml-py-12.575.51 nvidia-modelopt-0.33.1 nvidia-modelopt-core-0.33.1 nvtx-0.2.13 omegaconf-2.3.0 onnx-1.19.1 onnx_graphsurgeon-0.5.8 openai-2.7.1 opencv-python-headless-4.11.0.86 optimum-2.0.0 ordered-set-4.1.0 packaging-25.0 pandas-2.3.3 peft-0.17.1 pillow-10.3.0 polygraphy-0.49.26 propcache-0.4.1 protobuf-6.33.0 psutil-7.1.3 pulp-3.3.0 pyarrow-22.0.0 pycparser-2.23 pycryptodomex-3.23.0 pydantic-2.12.4 pydantic-core-2.41.5 pydantic-settings-2.11.0 pygments-2.19.2 pynvml-12.0.0 pyparsing-3.2.5 pyproject_hooks-1.2.0 python-dateutil-2.9.0.post0 python-dotenv-1.2.1 pytz-2025.2 pyyaml-6.0.3 pyzmq-27.1.0 regex-2025.11.3 requests-2.32.5 rich-14.2.0 safetensors-0.6.2 scipy-1.16.3 sentencepiece-0.2.1 setuptools-79.0.1 six-1.17.0 sniffio-1.3.1 soundfile-0.13.1 starlette-0.41.3 tenacity-9.1.2 tensorrt-10.11.0.33 tensorrt_cu12-10.11.0.33 tensorrt_cu12_bindings-10.11.0.33 tensorrt_cu12_libs-10.11.0.33 tensorrt_llm-1.0.0 tiktoken-0.12.0 tokenizers-0.21.4 torchprofile-0.0.4 tqdm-4.67.1 transformers-4.53.1 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 uvicorn-0.38.0 xgrammar-0.1.21 xxhash-3.6.0 yarl-1.22.0 zipp-3.23.0

写一个程序跑一下,用TRT给的例子:
···
from tensorrt_llm import LLM, SamplingParams

def main():

# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(
        f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
    )

# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'

if name == ‘main‘:
main()

···

执行后,在漫长的等待后,报了个错,访问不了huggingface.co

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/TinyLlama/TinyLlama-1.1B-Chat-v1.0/revision/main (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
···
后面还有一堆,不用管它了。
国内用hf,执行了
export HF_ENDPOINT=https://hf-mirror.com
之后,重新运行还是报错,按提示升级了一下

pip install transformers -U
增加
export HF_HUB_BASE_URL=https://hf-mirror.com
运行发现还报错。

一通折腾后,发现
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
居然下来了。
然后立刻回到试验程序
···
(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:24] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-11-06 21:27:28,299 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:27:28] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:27:28] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:27:28] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:27:28] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
rank 0 using MpiPoolSession to spawn MPI processes
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:36] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
2025-11-06 21:27:41,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 104.27it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 572.52it/s]
Model init total -- 3.57s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:27:47,676 - INFO - flashinfer.jit: Loading JIT ops: norm
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
^CTraceback (most recent call last):
File "/home/bobo/test_trt/test_trt.py", line 33, in <module>
main()
File "/home/bobo/test_trt/test_trt.py", line 8, in main
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1125, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 942, in __init__
super().__init__(model,
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 214, in __init__
self._build_model()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1072, in _build_model
self._executor = self._executor_cls.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/executor.py", line 423, in create
return GenerationExecutorProxy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 105, in __init__
self._start_executor_workers(worker_kwargs)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
if self.worker_init_status_queue.poll(1):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/ipc.py", line 110, in poll
events = dict(self.poller.poll(timeout=timeout * 1000))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/zmq/sugar/poll.py", line 106, in poll
return zmq_poll(self.sockets, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "zmq/backend/cython/_zmq.py", line 1680, in zmq.backend.cython._zmq.zmq_poll
File "zmq/backend/cython/_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py'>
Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1594, in _shutdown
atexit_call()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 172, in join_threads
thread.join()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1149, in join
self._wait_for_tstate_lock()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1169, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
[11/06/2025-21:30:34] [TRT-LLM] [E] Failed to send object: None
^C^C^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7707940725c0>
Traceback (most recent call last):
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers
pool.shutdown()
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown
self.process.wait(300)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 1277, in wait
self._wait(timeout=sigint_timeout)
File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 2047, in _wait
time.sleep(delay)
KeyboardInterrupt:
···
发现卡住后补了个变量
export TORCH_CUDA_ARCH_LIST="8.6;8.9"

后面的值是这么来的:
python -c "import torch; print(torch.cuda.get_device_capability())"
回复(8,9),就把8.9加后面,然后接着执行。

(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:30:46] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:30:50,629 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:30:50] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:30:50] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:30:50] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:30:50] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
rank 0 using MpiPoolSession to spawn MPI processes
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:31:03] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:31:07,496 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
torch_dtype is deprecated! Use dtype instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 61.60it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 557.09it/s]
Model init total – 2.23s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:31:12,625 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-11-06 21:31:48,328 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.72 GiB for max tokens in paged KV cache (844512).
Processed requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.15it/s]
Prompt: ‘Hello, my name is’, Generated text: ‘[Your Name] and I am a [Your Position] at [Your Company]. I am writing to express my interest in the [Job Title] position at’
Prompt: ‘The capital of France is’, Generated text: ‘Paris.\n\n2. B. C. The capital of Canada is Ottawa.\n\n3. A. C. The capital of Australia is Can’
Prompt: ‘The future of AI is’, Generated text: “bright, and it’s not just for big companies. Small businesses can also benefit from AI technology. Here are some ways:\n\n1.”

1
2
3
顺利跑通。

跑个服务试试:

trtllm-serve “TinyLlama/TinyLlama-1.1B-Chat-v1.0”

1
直到显示:

INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The ‘validation_alias’ attribute with value ‘max_tokens’ was provided to the Field() function, which has no effect in the context it was used. ‘validation_alias’ is field-specific metadata, and can only be attached to a model field using Annotated metadata or by assignment. This may have happened because an Annotated type alias using the type statement was used, or if the Field() function was attached to a single member of a union type.
warnings.warn(
INFO: 127.0.0.1:54364 - “POST /v1/chat/completions HTTP/1.1” 200 OK

1
2
然后再起一个wsl的bash,输入

curl -X POST http://localhost:8000/v1/chat/completions -H “Content-Type: application/json” -H “Accept: application/json” -d ‘{
“model”: “TinyLlama/TinyLlama-1.1B-Chat-v1.0”,
“messages”:[{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “Where is New York? Tell me in a single sentence.”}],
“max_tokens”: 32,
“temperature”: 0
}’

1
2
得到答复:"New York is a city in the northeastern United States, located on the eastern coast of the state of New York."

{“id”:”chatcmpl-31b02f6ab4854863909850ab9688d8b1”,”object”:”chat.completion”,”created”:1762436547,”model”:”TinyLlama/TinyLlama-1.1B-Chat-v1.0”,”choices”:[{“index”:0,”message”:{“role”:”assistant”,”content”:”New York is a city in the northeastern United States, located on the eastern coast of the state of New York.”,”reasoning_content”:null,”tool_calls”:[]},”logprobs”:null,”finish_reason”:”stop”,”stop_reason”:null,”disaggregated_params”:null}],”usage”:{“prompt_tokens”:43,”total_tokens”:70,”completion_tokens”:27},”prompt_token_ids”:null}(trt)

1
2
3

又试了几个,能回答,理解能力有限,不能说中文。
换!换DeepSeek-R1-Distill-Qwen-1.5B

trtllm-serve “deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B”

1
挂了~提示了一堆关于如何控制显存的信息。

Please refer to the TensorRT LLM documentation for information on how to control the memory usage through TensorRT LLM configuration options. Possible options include:
Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
Sampler: reduce max_seq_len and/or max_attention_window_size
Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens
Drafter: reduce max_seq_len and/or max_draft_len
Additional executor resources (temporary for KV cache size estimation): reduce max_num_tokens
Model resources created during usage: reduce max_num_tokens
KV cache: reduce free_gpu_memory_fraction
Additional executor resources: reduce max_num_tokens

trtllm-serve "meta-llama/Llama-3.2-1B"
要token,放弃。

验证完成。