Natural Language Processing (NLP): Inference with BERT-Large (Uncased)
Table of Contents
- Overview
- Prerequisites
- Install Required Packages
- Quick Start: Keras Mixed Precision
- Deploying with TensorFlow Serving (bfloat16 Auto Mixed Precision)
- Client Inference (REST)
- Optional: Graph Freezing for Additional Performance
- Key Validation Steps
Overview
BERT (Bidirectional Encoder Representations from Transformers) is a transformer model pretrained on large English text corpora using self-supervised objectives. This example shows how to run BERT-Large (uncased, Hugging Face) for sequence (binary) classification on Intel Xeon processors with AMX acceleration using bfloat16 (BF16) mixed precision.
Prerequisites
- Intel Xeon 4th Gen (or newer) with AMX
bfloat16support - Docker (for TensorFlow Serving deployment)
- Python environment with
pip - Internet access to download model weights
Install Required Packages
Pinned versions are shown below for reproducibility.
pip install tensorflow==2.20.0
pip install tf-keras==2.20.1 --no-deps
pip install transformers==4.49.0
Quick Start: Keras Mixed Precision
To reuse the standard FP32 (pretrained) BERT-Large model while executing layers in BF16 on AMX, enable a mixed_bfloat16 policy BEFORE creating/loading the model. This keeps model weights in FP32 for stability while executing math (matmul, attention, feed‑forward) in BF16 on AMX-capable Intel Xeon processors. Note that this approach to enable auto-mixed precision can be used for any Keras model.
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer
tf.keras.mixed_precision.set_global_policy("mixed_bfloat16")
model_name = "bert-large-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
text = "This is a great movie!"
inputs = tokenizer(text, return_tensors="tf", padding=True, truncation=True)
outputs = model(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
training=False
)
logits = tf.cast(outputs.logits, tf.float32) # ensure float32 for downstream usage
pred = tf.argmax(logits, axis=-1)
print("Logits:", logits.numpy())
print("Predicted class:", pred.numpy())
Notes:
- Set
ONEDNN_VERBOSE=1to confirm AMX usage (look forbrg_matmul ... amx). - Revert to full FP32 by removing the policy or setting mixed_precision to float32.
Deploying with TensorFlow Serving (BF16 Auto Mixed Precision)
Export the Model (SavedModel, FP32 weights)
Note: We don’t need to explicitly enable
bfloat16mixed precision with Keras while exporting the model, because the--mixed_precision=bfloat16flag passed when starting the inference server handles that automatically (see Start the Server (Enable bfloat16) below).
Create export_bert_large.py:
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer
model_name = "bert-large-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
@tf.function(input_signature=[
tf.TensorSpec([None, None], tf.int32, name="input_ids"),
tf.TensorSpec([None, None], tf.int32, name="attention_mask")
])
def serving_fn(input_ids, attention_mask):
out = model(input_ids, attention_mask=attention_mask, training=False)
logits = tf.cast(out.logits, tf.float32)
return {"logits": tf.identity(logits, name="float32_output")}
output_model_path = "/tmp/bert_large_hf/1" # versioned directory
tf.saved_model.save(model, output_model_path, signatures={"serving_default": serving_fn})
print("Exported to:", output_model_path)
Run:
python export_bert_large.py
Pull TensorFlow Serving
Pull the official TensorFlow Serving CPU image:
docker pull tensorflow/serving
Reference setup guide: https://github.com/tensorflow/serving?tab=readme-ov-file#set-up
Start the Server (Enable BF16)
TensorFlow Serving (CPU) currently supports bfloat16 mixed precision (fp16 not yet enabled for CPU).
docker run -t --rm \
-p 8501:8501 \
-v /tmp/bert_large_hf:/models/bert_large_hf \
-e MODEL_NAME=bert_large_hf \
-e ONEDNN_VERBOSE=1 \
tensorflow/serving --mixed_precision=bfloat16
Sample log indicators:
auto_mixed_precision_onednn_bfloat16graph optimizerbrg_matmulwithamxandsrc_bf16/wei_bf16
2025-09-24 22:45:48.953387: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /models/bert_large_hf/1
I0000 00:00:1758753949.448357 905 auto_mixed_precision.cc:2335] Running auto_mixed_precision_onednn_bfloat16 graph optimizer
I0000 00:00:1758753949.450298 905 auto_mixed_precision.cc:1511] No allowlist ops found, nothing to do
2025-09-24 22:45:49.465449: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 3153824 microseconds.
2025-09-24 22:45:50.229979: I tensorflow_serving/model_servers/server.cc:428] Running gRPC ModelServer at 0.0.0.0:8500 ...
2025-09-24 22:45:50.292271: I tensorflow_serving/model_servers/server.cc:449] Exporting HTTP/REST API at:localhost:8501 ...
Troubleshooting 403:
- Ensure URL model name matches MODEL_NAME.
- Check container logs: docker logs
. - Disable proxies: export no_proxy=localhost,127.0.0.1.
Client Inference (REST)
Install:
pip install requests==2.32.5 numpy==2.3.3 transformers==4.49.0 tensorflow==2.20.0
pip install tf-keras==2.20.1 --no-deps
Create infer_bert_large.py:
import requests, json, numpy as np, tensorflow as tf
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
text = "I love this product!"
inputs = tokenizer(
text,
return_tensors="tf",
max_length=128,
padding="max_length",
truncation=True
)
payload = {
"instances": [{
"input_ids": inputs["input_ids"][0].numpy().tolist(),
"attention_mask": inputs["attention_mask"][0].numpy().tolist()
}]
}
resp = requests.post(
"http://127.0.0.1:8501/v1/models/bert_large_hf:predict",
data=json.dumps(payload),
headers={"content-type": "application/json"},
proxies={"http": None, "https": None}
)
if resp.status_code == 200:
preds = np.array(resp.json()["predictions"])
print("Inference successful!")
logits = preds[0]
probs = tf.nn.softmax(logits).numpy()
print("Logits:", logits)
print("Probabilities:", probs)
else:
print("Error:", resp.text)
Run:
python infer_bert_large.py
Expected Logs on the Server
I0000 00:00:1758754431.951130 3797 auto_mixed_precision.cc:2335] Running auto_mixed_precision_onednn_bfloat16 graph optimizer
I0000 00:00:1758754431.978969 3797 auto_mixed_precision.cc:2263] Converted 1837/4155 nodes to bfloat16 precision using 2 cast(s) to bfloat16 (excluding Const and Variable casts)
Expected Logs on the Client
Logits and probabilities for binary classification (untrained weights will give random results):
Inference successful!
Logits: [-0.123 0.456]
Probabilities: [0.412 0.588]
Optional: Graph Freezing for Additional Performance
Freeze variables to constants for a lean inference graph (removes variable-loading overhead).
Script (public reference): https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/freeze_optimize_v2.py
Example:
python freeze_optimize_v2.py \
--input_saved_model_dir=/tmp/bert_large_hf/1 \
--output_saved_model_dir=/tmp/bert_large_hf_frozen/1
Run this after exporting the SavedModel (server side).
Key Validation Steps
- Functional: REST returns logits JSON
- Precision: Logs show
auto_mixed_precision_onednn_bfloat16 - AMX:
ONEDNN_VERBOSElines includeamxandbf16datatypes - Rollback: Remove
--mixed_precisionflag; delete policy in Keras path
Summary
Enabled BF16 mixed precision on Xeon with minimal code change, deployed via TensorFlow Serving, verified AMX acceleration, and optionally optimized the model by freezing the graph.