Skip to main content

OpenVINO Local Pipelines

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. The OpenVINO™ Runtime can infer models on different hardware devices. It can help to boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common tasks.

OpenVINO models can be run locally through the HuggingFacePipeline class. To deploy a model with OpenVINO, you can specify the backend="openvino" parameter to trigger OpenVINO as backend inference framework.

To use, you should have the optimum-intel with OpenVINO Accelerator python package installed.

%pip install --upgrade-strategy eager "optimum[openvino,nncf]" --quiet

Model Loading

Models can be loaded by specifying the model parameters using the from_model_id method.

If you have an Intel GPU, you can specify model_kwargs={"device": "GPU"} to run inference on it.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

ov_llm = HuggingFacePipeline.from_model_id(
model_id="gpt2",
task="text-generation",
backend="openvino",
model_kwargs={"device": "CPU", "ov_config": ov_config},
pipeline_kwargs={"max_new_tokens": 10},
)

They can also be loaded by passing in an existing optimum-intel pipeline directly

from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "gpt2"
device = "CPU"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ov_model = OVModelForCausalLM.from_pretrained(
model_id, device=device, ov_config=ov_config
)
ov_pipe = pipeline(
"text-generation", model=ov_model, tokenizer=tokenizer, max_new_tokens=10
)
hf = HuggingFacePipeline(pipeline=ov_pipe)

Create Chain

With the model loaded into memory, you can compose it with a prompt to form a chain.

from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | ov_llm

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

Inference with local OpenVINO model

It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.

!optimum-cli export openvino --model gpt2 ov_model_dir

It is recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint using --weight-format:

!optimum-cli export openvino --model gpt2  --weight-format int8 ov_model_dir # for 8-bit quantization

!optimum-cli export openvino --model gpt2 --weight-format int4 ov_model_dir # for 4-bit quantization
ov_llm = HuggingFacePipeline.from_model_id(
model_id="ov_model_dir",
task="text-generation",
backend="openvino",
model_kwargs={"device": "CPU", "ov_config": ov_config},
pipeline_kwargs={"max_new_tokens": 10},
)

ov_chain = prompt | ov_llm

question = "What is electroencephalography?"

print(ov_chain.invoke({"question": question}))

You can get additional inference speed improvement with Dynamic Quantization of activations and KV-cache quantization. These options can be enabled with ov_config as follows:

ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}

For more information refer to OpenVINO LLM guide.


Help us out by providing feedback on this documentation page: