SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
An innovative library for efficient LLM inference via low-bit quantization
a simple pipline of int8 quantization based on tensorrt.
👀 Apply YOLOv8 exported with ONNX or TensorRT(FP16, INT8) to the Real-time camera
NCNN+Int8+YOLOv4 quantitative modeling and real-time inference
#大语言模型#A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
INT8 calibrator for ONNX model with dynamic batch_size at the input and NMS module at the output. C++ Implementation.
#大语言模型#LLM-Lora-PEFT_accumulate explores optimizations for Large Language Models (LLMs) using PEFT, LORA, and QLORA. Contribute experiments and implementations to enhance LLM efficiency. Join discussions and...
quantization example for pqt & qat
VB.NET api wrapper for llm-inference chatllm.cpp