安装指导文件见https://nvidia.github.io/TensorRT-LLM/latest/installation/linux.html
1 |
|
注意wsl安装cuda有自己的步骤
1 |
|
然后在home目录的.bashrc中增加如下内容
1 |
|
后续正式开始安装,首先pytorch和必要库,需要装一个多G
1 | pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 |
然后装一下trt
1 | pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm |
卡死了….
1 | pip3 install tensorrt_llm |
卡了足足10多分钟后,自动继续滚动下去了
然后接着卡这句:
1 | Collecting tensorrt_cu12_libs==10.11.0.33 (from tensorrt_cu12==10.11.0.33->tensorrt~=10.11.0->tensorrt_llm) |
大约10分钟,继续了
最后安装了这么一大堆后完成了,很顺利
1 | Successfully uninstalled fsspec-2025.9.0 |
写一个程序跑一下,用TRT给的例子:
···
from tensorrt_llm import LLM, SamplingParams
def main():
# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Sample prompts.
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
for output in llm.generate(prompts, sampling_params):
print(
f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
)
# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
if name == ‘main‘:
main()
···
执行后,在漫长的等待后,报了个错,访问不了huggingface.co
1 |
|
(trt) bobo@DESKTOP-K65EUBR:~/test_trt$ python test_trt.py
[2025-11-06 21:30:46] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:30:50,629 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[11/06/2025-21:30:50] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/06/2025-21:30:50] [TRT-LLM] [W] Using default gpus_per_node: 1
[11/06/2025-21:30:50] [TRT-LLM] [I] Set nccl_plugin to None.
[11/06/2025-21:30:50] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
rank 0 using MpiPoolSession to spawn MPI processes
[2025-11-06 21:31:03] INFO config.py:54: PyTorch version 2.7.1+cu128 available.
Multiple distributions found for package optimum. Picked distribution: optimum
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/utils/import_utils.py:32: UserWarning: Failed to import huggingface plugin due to: AttributeError(“module ‘transformers.modeling_utils’ has no attribute ‘Conv1D’”). You may ignore this warning if you do not need this plugin.
warnings.warn(
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.57.1 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-06 21:31:07,496 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[TensorRT-LLM][INFO] Refreshed the MPI local sessiontorch_dtype is deprecated! Use dtype instead!
Loading safetensors weights in parallel: 100%|██████████| 1/1 [00:00<00:00, 61.60it/s]
Loading weights: 100%|██████████| 449/449 [00:00<00:00, 557.09it/s]
Model init total – 2.23s
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.18 GiB for max tokens in paged KV cache (8352).
2025-11-06 21:31:12,625 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-11-06 21:31:48,328 - INFO - flashinfer.jit: Finished loading JIT ops: norm
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.72 GiB for max tokens in paged KV cache (844512).
Processed requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.15it/s]
Prompt: ‘Hello, my name is’, Generated text: ‘[Your Name] and I am a [Your Position] at [Your Company]. I am writing to express my interest in the [Job Title] position at’
Prompt: ‘The capital of France is’, Generated text: ‘Paris.\n\n2. B. C. The capital of Canada is Ottawa.\n\n3. A. C. The capital of Australia is Can’
Prompt: ‘The future of AI is’, Generated text: “bright, and it’s not just for big companies. Small businesses can also benefit from AI technology. Here are some ways:\n\n1.”
1 | 顺利跑通。 |
trtllm-serve “TinyLlama/TinyLlama-1.1B-Chat-v1.0”
1 | 直到显示: |
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
/home/bobo/miniforge3/envs/trt/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The ‘validation_alias’ attribute with value ‘max_tokens’ was provided to the Field() function, which has no effect in the context it was used. ‘validation_alias’ is field-specific metadata, and can only be attached to a model field using Annotated metadata or by assignment. This may have happened because an Annotated type alias using the type statement was used, or if the Field() function was attached to a single member of a union type.
warnings.warn(
INFO: 127.0.0.1:54364 - “POST /v1/chat/completions HTTP/1.1” 200 OK
1 | 然后再起一个wsl的bash,输入 |
curl -X POST http://localhost:8000/v1/chat/completions -H “Content-Type: application/json” -H “Accept: application/json” -d ‘{
“model”: “TinyLlama/TinyLlama-1.1B-Chat-v1.0”,
“messages”:[{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “Where is New York? Tell me in a single sentence.”}],
“max_tokens”: 32,
“temperature”: 0
}’
1 | 得到答复:"New York is a city in the northeastern United States, located on the eastern coast of the state of New York." |
{“id”:”chatcmpl-31b02f6ab4854863909850ab9688d8b1”,”object”:”chat.completion”,”created”:1762436547,”model”:”TinyLlama/TinyLlama-1.1B-Chat-v1.0”,”choices”:[{“index”:0,”message”:{“role”:”assistant”,”content”:”New York is a city in the northeastern United States, located on the eastern coast of the state of New York.”,”reasoning_content”:null,”tool_calls”:[]},”logprobs”:null,”finish_reason”:”stop”,”stop_reason”:null,”disaggregated_params”:null}],”usage”:{“prompt_tokens”:43,”total_tokens”:70,”completion_tokens”:27},”prompt_token_ids”:null}(trt)
1 |
|
trtllm-serve “deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B”
1 | 挂了~提示了一堆关于如何控制显存的信息。 |
Please refer to the TensorRT LLM documentation for information on how to control the memory usage through TensorRT LLM configuration options. Possible options include:
Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
Sampler: reduce max_seq_len and/or max_attention_window_size
Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens
Drafter: reduce max_seq_len and/or max_draft_len
Additional executor resources (temporary for KV cache size estimation): reduce max_num_tokens
Model resources created during usage: reduce max_num_tokens
KV cache: reduce free_gpu_memory_fraction
Additional executor resources: reduce max_num_tokens
trtllm-serve "meta-llama/Llama-3.2-1B"
要token,放弃。
验证完成。