wsl安装并运行TRT

安装指导文件见https://nvidia.github.io/TensorRT-LLM/latest/installation/linux.html



conda create -n=trt python=3.12


conda activate trt

注意wsl安装cuda有自己的步骤


wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-9-local_12.9.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9

然后在home目录的.bashrc中增加如下内容


export PATH=/usr/local/cuda-12/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-12

conda activate trt

后续正式开始安装，首先pytorch和必要库,需要装一个多G

pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128


sudo apt-get -y install libopenmpi-dev

# Optional step: Only required for disagg-serving
sudo apt-get -y install libzmq3-dev

然后装一下trt

1	pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

卡死了….

pip3 install tensorrt_llm
Collecting tensorrt_llm
  Using cached tensorrt_llm-1.0.0.tar.gz (1.6 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... -

卡了足足10多分钟后，自动继续滚动下去了
然后接着卡这句：

Collecting tensorrt_cu12_libs==10.11.0.33 (from tensorrt_cu12==10.11.0.33->tensorrt~=10.11.0->tensorrt_llm)
  Downloading tensorrt_cu12_libs-10.11.0.33.tar.gz (709 bytes)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... -

大约10分钟，继续了
最后安装了这么一大堆后完成了，很顺利

1
2

      Successfully uninstalled fsspec-2025.9.0
Successfully installed StrEnum-0.4.15 accelerate-1.11.0 aenum-3.1.16 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 blake3-1.0.8 blobfile-3.1.0 build-1.3.0 certifi-2025.10.5 cffi-2.0.0 charset_normalizer-3.4.4 click-8.3.0 click_option_group-0.5.9 colored-2.3.1 contourpy-1.3.3 cuda-bindings-12.9.4 cuda-pathfinder-1.3.2 cuda-python-12.9.4 cycler-0.12.1 datasets-3.1.0 diffusers-0.35.2 dill-0.3.8 distro-1.9.0 einops-0.8.1 etcd3-0.12.0 evaluate-0.4.6 fastapi-0.115.4 flashinfer-python-0.2.5 fonttools-4.60.1 frozenlist-1.8.0 fsspec-2024.9.0 grpcio-1.76.0 h11-0.16.0 h5py-3.12.1 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-0.36.0 idna-3.11 importlib_metadata-8.7.0 jiter-0.11.1 kiwisolver-1.4.9 lark-1.3.1 llguidance-0.7.29 lxml-6.0.2 markdown-it-py-4.0.0 matplotlib-3.10.7 mdurl-0.1.2 meson-1.9.1 ml_dtypes-0.5.3 mpi4py-4.1.1 multidict-6.7.0 multiprocess-0.70.16 ninja-1.13.0 numpy-1.26.4 nvidia-ml-py-12.575.51 nvidia-modelopt-0.33.1 nvidia-modelopt-core-0.33.1 nvtx-0.2.13 omegaconf-2.3.0 onnx-1.19.1 onnx_graphsurgeon-0.5.8 openai-2.7.1 opencv-python-headless-4.11.0.86 optimum-2.0.0 ordered-set-4.1.0 packaging-25.0 pandas-2.3.3 peft-0.17.1 pillow-10.3.0 polygraphy-0.49.26 propcache-0.4.1 protobuf-6.33.0 psutil-7.1.3 pulp-3.3.0 pyarrow-22.0.0 pycparser-2.23 pycryptodomex-3.23.0 pydantic-2.12.4 pydantic-core-2.41.5 pydantic-settings-2.11.0 pygments-2.19.2 pynvml-12.0.0 pyparsing-3.2.5 pyproject_hooks-1.2.0 python-dateutil-2.9.0.post0 python-dotenv-1.2.1 pytz-2025.2 pyyaml-6.0.3 pyzmq-27.1.0 regex-2025.11.3 requests-2.32.5 rich-14.2.0 safetensors-0.6.2 scipy-1.16.3 sentencepiece-0.2.1 setuptools-79.0.1 six-1.17.0 sniffio-1.3.1 soundfile-0.13.1 starlette-0.41.3 tenacity-9.1.2 tensorrt-10.11.0.33 tensorrt_cu12-10.11.0.33 tensorrt_cu12_bindings-10.11.0.33 tensorrt_cu12_libs-10.11.0.33 tensorrt_llm-1.0.0 tiktoken-0.12.0 tokenizers-0.21.4 torchprofile-0.0.4 tqdm-4.67.1 transformers-4.53.1 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 uvicorn-0.38.0 xgrammar-0.1.21 xxhash-3.6.0 yarl-1.22.0 zipp-3.23.0

写一个程序跑一下，用TRT给的例子：
···
from tensorrt_llm import LLM, SamplingParams

def main():

# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(
        f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
    )

# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'

if name == ‘main‘:
main()

···

执行后，在漫长的等待后，报了个错，访问不了huggingface.co


Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/TinyLlama/TinyLlama-1.1B-Chat-v1.0/revision/main (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x716f0fd40b30>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
···
后面还有一堆，不用管它了。
国内用hf，执行了
export HF_ENDPOINT=https://hf-mirror.com
之后，重新运行还是报错，按提示升级了一下

pip install transformers -U
增加
export HF_HUB_BASE_URL=https://hf-mirror.com
运行发现还报错。

一通折腾后，发现
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
居然下来了。
然后立刻回到试验程序
···
(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:24] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:27:28,299 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:27:28] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:27:28] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:27:28] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:27:28] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
rank 0 using MpiPoolSession to spawn MPI processes
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:27:36] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin.
  warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
2025-11-06 21:27:41,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM][INFO] Refreshed the MPI local session
`torch_dtype` is deprecated! Use `dtype` instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 104.27it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 572.52it/s]
Model init total -- 3.57s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:27:47,676 - INFO - flashinfer.jit: Loading JIT ops: norm
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
^CTraceback (most recent call last):
  File "/home/bobo/test_trt/test_trt.py", line 33, in <module>
    main()
  File "/home/bobo/test_trt/test_trt.py", line 8, in main
    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1125, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 942, in __init__
    super().__init__(model,
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 214, in __init__
    self._build_model()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/llmapi/llm.py", line 1072, in _build_model
    self._executor = self._executor_cls.create(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/executor.py", line 423, in create
    return GenerationExecutorProxy(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 105, in __init__
    self._start_executor_workers(worker_kwargs)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/proxy.py", line 319, in _start_executor_workers
    if self.worker_init_status_queue.poll(1):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/tensorrt_llm/executor/ipc.py", line 110, in poll
    events = dict(self.poller.poll(timeout=timeout * 1000))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/zmq/sugar/poll.py", line 106, in poll
    return zmq_poll(self.sockets, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "zmq/backend/cython/_zmq.py", line 1680, in zmq.backend.cython._zmq.zmq_poll
  File "zmq/backend/cython/_zmq.py", line 179, in zmq.backend.cython._zmq._check_rc
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py'>
Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1594, in _shutdown
    atexit_call()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 172, in join_threads
    thread.join()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1149, in join
    self._wait_for_tstate_lock()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/threading.py", line 1169, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:
[11/06/2025-21:30:34] [TRT-LLM] [E] Failed to send object: None
^C^C^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7707940725c0>
Traceback (most recent call last):
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers
    pool.shutdown()
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown
    self.process.wait(300)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 1277, in wait
    self._wait(timeout=sigint_timeout)
  File "/home/bobo/miniforge3/envs/trt/lib/python3.12/subprocess.py", line 2047, in _wait
    time.sleep(delay)
KeyboardInterrupt:
···
发现卡住后补了个变量
 export TORCH_CUDA_ARCH_LIST="8.6;8.9"

后面的值是这么来的：
python -c "import torch; print(torch.cuda.get_device_capability())"
回复(8,9)，就把8.9加后面，然后接着执行。

(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:30:46] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:30:50,629 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:30:50] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:30:50] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:30:50] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:30:50] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
rank 0 using MpiPoolSession to spawn MPI processes
:1301: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-11-06 21:31:03] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:31:07,496 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
torch_dtype is deprecated! Use dtype instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 61.60it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 557.09it/s]
Model init total – 2.23s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:31:12,625 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-11-06 21:31:48,328 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.72 GiB for max tokens in paged KV cache (844512).
Processed requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.15it/s]
Prompt: ‘Hello, my name is’, Generated text: ‘[Your Name] and I am a [Your Position] at [Your Company]. I am writing to express my interest in the [Job Title] position at’
Prompt: ‘The capital of France is’, Generated text: ‘Paris.\n\n2. B. C. The capital of Canada is Ottawa.\n\n3. A. C. The capital of Australia is Can’
Prompt: ‘The future of AI is’, Generated text: “bright, and it’s not just for big companies. Small businesses can also benefit from AI technology. Here are some ways:\n\n1.”

1
2
3

顺利跑通。

跑个服务试试：

trtllm-serve “TinyLlama/TinyLlama-1.1B-Chat-v1.0”

直到显示：

INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The ‘validation_alias’ attribute with value ‘max_tokens’ was provided to the Field() function, which has no effect in the context it was used. ‘validation_alias’ is field-specific metadata, and can only be attached to a model field using Annotated metadata or by assignment. This may have happened because an Annotated type alias using the type statement was used, or if the Field() function was attached to a single member of a union type.
warnings.warn(
INFO: 127.0.0.1:54364 - “POST /v1/chat/completions HTTP/1.1” 200 OK

1 2	然后再起一个wsl的bash，输入

curl -X POST http://localhost:8000/v1/chat/completions -H “Content-Type: application/json” -H “Accept: application/json” -d ‘{
“model”: “TinyLlama/TinyLlama-1.1B-Chat-v1.0”,
“messages”:[{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “Where is New York? Tell me in a single sentence.”}],
“max_tokens”: 32,
“temperature”: 0
}’

1 2	得到答复："New York is a city in the northeastern United States, located on the eastern coast of the state of New York."

{“id”:”chatcmpl-31b02f6ab4854863909850ab9688d8b1”,”object”:”chat.completion”,”created”:1762436547,”model”:”TinyLlama/TinyLlama-1.1B-Chat-v1.0”,”choices”:[{“index”:0,”message”:{“role”:”assistant”,”content”:”New York is a city in the northeastern United States, located on the eastern coast of the state of New York.”,”reasoning_content”:null,”tool_calls”:[]},”logprobs”:null,”finish_reason”:”stop”,”stop_reason”:null,”disaggregated_params”:null}],”usage”:{“prompt_tokens”:43,”total_tokens”:70,”completion_tokens”:27},”prompt_token_ids”:null}(trt)

1
2
3


又试了几个，能回答，理解能力有限，不能说中文。
换！换DeepSeek-R1-Distill-Qwen-1.5B

trtllm-serve “deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B”

1	挂了~提示了一堆关于如何控制显存的信息。

Please refer to the TensorRT LLM documentation for information on how to control the memory usage through TensorRT LLM configuration options. Possible options include:
Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
Sampler: reduce max_seq_len and/or max_attention_window_size
Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens
Drafter: reduce max_seq_len and/or max_draft_len
Additional executor resources (temporary for KV cache size estimation): reduce max_num_tokens
Model resources created during usage: reduce max_num_tokens
KV cache: reduce free_gpu_memory_fraction
Additional executor resources: reduce max_num_tokens

trtllm-serve "meta-llama/Llama-3.2-1B"
要token，放弃。

验证完成。