OpenVINO Local Pipelines
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. The OpenVINO™ Runtime can infer models on different hardware devices. It can help to boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common tasks.
OpenVINO models can be run locally through the HuggingFacePipeline
class.
To deploy a model with OpenVINO, you can specify the
backend="openvino"
parameter to trigger OpenVINO as backend inference
framework.
To use, you should have the optimum-intel
with OpenVINO Accelerator
python package
installed.
%pip install --upgrade-strategy eager "optimum[openvino,nncf]" --quiet
Model Loading
Models can be loaded by specifying the model parameters using the
from_model_id
method.
If you have an Intel GPU, you can specify
model_kwargs={"device": "GPU"}
to run inference on it.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
ov_llm = HuggingFacePipeline.from_model_id(
model_id="gpt2",
task="text-generation",
backend="openvino",
model_kwargs={"device": "CPU", "ov_config": ov_config},
pipeline_kwargs={"max_new_tokens": 10},
)
They can also be loaded by passing in an existing optimum-intel
pipeline directly
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
model_id = "gpt2"
device = "CPU"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ov_model = OVModelForCausalLM.from_pretrained(
model_id, device=device, ov_config=ov_config
)
ov_pipe = pipeline(
"text-generation", model=ov_model, tokenizer=tokenizer, max_new_tokens=10
)
hf = HuggingFacePipeline(pipeline=ov_pipe)
Create Chain
With the model loaded into memory, you can compose it with a prompt to form a chain.
from langchain.prompts import PromptTemplate
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)
chain = prompt | ov_llm
question = "What is electroencephalography?"
print(chain.invoke({"question": question}))
Inference with local OpenVINO model
It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.
!optimum-cli export openvino --model gpt2 ov_model_dir
It is recommended to apply 8 or 4-bit weight quantization to reduce
inference latency and model footprint using --weight-format
:
!optimum-cli export openvino --model gpt2 --weight-format int8 ov_model_dir # for 8-bit quantization
!optimum-cli export openvino --model gpt2 --weight-format int4 ov_model_dir # for 4-bit quantization
ov_llm = HuggingFacePipeline.from_model_id(
model_id="ov_model_dir",
task="text-generation",
backend="openvino",
model_kwargs={"device": "CPU", "ov_config": ov_config},
pipeline_kwargs={"max_new_tokens": 10},
)
ov_chain = prompt | ov_llm
question = "What is electroencephalography?"
print(ov_chain.invoke({"question": question}))
You can get additional inference speed improvement with Dynamic
Quantization of activations and KV-cache quantization. These options can
be enabled with ov_config
as follows:
ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
For more information refer to OpenVINO LLM guide.