Skip to the content.

Natural Language Processing (NLP): Inference with BERT-Large (Uncased)

Table of Contents

Overview

BERT (Bidirectional Encoder Representations from Transformers) is a transformer model pretrained on large English text corpora using self-supervised objectives. This example shows how to run BERT-Large (uncased, Hugging Face) for sequence (binary) classification on Intel Xeon processors with AMX acceleration using bfloat16 (BF16) mixed precision.

Prerequisites

Install Required Packages

Pinned versions are shown below for reproducibility.

pip install tensorflow==2.20.0
pip install tf-keras==2.20.1 --no-deps
pip install transformers==4.49.0

Quick Start: Keras Mixed Precision

To reuse the standard FP32 (pretrained) BERT-Large model while executing layers in BF16 on AMX, enable a mixed_bfloat16 policy BEFORE creating/loading the model. This keeps model weights in FP32 for stability while executing math (matmul, attention, feed‑forward) in BF16 on AMX-capable Intel Xeon processors. Note that this approach to enable auto-mixed precision can be used for any Keras model.

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer

tf.keras.mixed_precision.set_global_policy("mixed_bfloat16")

model_name = "bert-large-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

text = "This is a great movie!"
inputs = tokenizer(text, return_tensors="tf", padding=True, truncation=True)

outputs = model(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    training=False
)

logits = tf.cast(outputs.logits, tf.float32)  # ensure float32 for downstream usage
pred = tf.argmax(logits, axis=-1)

print("Logits:", logits.numpy())
print("Predicted class:", pred.numpy())

Notes:

Deploying with TensorFlow Serving (BF16 Auto Mixed Precision)

Export the Model (SavedModel, FP32 weights)

Note: We don’t need to explicitly enable bfloat16 mixed precision with Keras while exporting the model, because the --mixed_precision=bfloat16 flag passed when starting the inference server handles that automatically (see Start the Server (Enable bfloat16) below).

Create export_bert_large.py:

import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer

model_name = "bert-large-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

@tf.function(input_signature=[
    tf.TensorSpec([None, None], tf.int32, name="input_ids"),
    tf.TensorSpec([None, None], tf.int32, name="attention_mask")
])
def serving_fn(input_ids, attention_mask):
    out = model(input_ids, attention_mask=attention_mask, training=False)
    logits = tf.cast(out.logits, tf.float32)
    return {"logits": tf.identity(logits, name="float32_output")}

output_model_path = "/tmp/bert_large_hf/1"  # versioned directory
tf.saved_model.save(model, output_model_path, signatures={"serving_default": serving_fn})
print("Exported to:", output_model_path)

Run:

python export_bert_large.py

Pull TensorFlow Serving

Pull the official TensorFlow Serving CPU image:

docker pull tensorflow/serving

Reference setup guide: https://github.com/tensorflow/serving?tab=readme-ov-file#set-up

Start the Server (Enable BF16)

TensorFlow Serving (CPU) currently supports bfloat16 mixed precision (fp16 not yet enabled for CPU).

docker run -t --rm \
  -p 8501:8501 \
  -v /tmp/bert_large_hf:/models/bert_large_hf \
  -e MODEL_NAME=bert_large_hf \
  -e ONEDNN_VERBOSE=1 \
  tensorflow/serving --mixed_precision=bfloat16

Sample log indicators:

2025-09-24 22:45:48.953387: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /models/bert_large_hf/1
I0000 00:00:1758753949.448357     905 auto_mixed_precision.cc:2335] Running auto_mixed_precision_onednn_bfloat16 graph optimizer
I0000 00:00:1758753949.450298     905 auto_mixed_precision.cc:1511] No allowlist ops found, nothing to do
2025-09-24 22:45:49.465449: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:471] SavedModel load for tags { serve }; Status: success: OK. Took 3153824 microseconds.
2025-09-24 22:45:50.229979: I tensorflow_serving/model_servers/server.cc:428] Running gRPC ModelServer at 0.0.0.0:8500 ...
2025-09-24 22:45:50.292271: I tensorflow_serving/model_servers/server.cc:449] Exporting HTTP/REST API at:localhost:8501 ...

Troubleshooting 403:

Client Inference (REST)

Install:

pip install requests==2.32.5 numpy==2.3.3 transformers==4.49.0 tensorflow==2.20.0
pip install tf-keras==2.20.1 --no-deps

Create infer_bert_large.py:

import requests, json, numpy as np, tensorflow as tf
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
text = "I love this product!"
inputs = tokenizer(
    text,
    return_tensors="tf",
    max_length=128,
    padding="max_length",
    truncation=True
)

payload = {
    "instances": [{
        "input_ids": inputs["input_ids"][0].numpy().tolist(),
        "attention_mask": inputs["attention_mask"][0].numpy().tolist()
    }]
}

resp = requests.post(
    "http://127.0.0.1:8501/v1/models/bert_large_hf:predict",
    data=json.dumps(payload),
    headers={"content-type": "application/json"},
    proxies={"http": None, "https": None}
)

if resp.status_code == 200:
    preds = np.array(resp.json()["predictions"])
    print("Inference successful!")
    logits = preds[0]
    probs = tf.nn.softmax(logits).numpy()
    print("Logits:", logits)
    print("Probabilities:", probs)
else:
    print("Error:", resp.text)

Run:

python infer_bert_large.py

Expected Logs on the Server

I0000 00:00:1758754431.951130    3797 auto_mixed_precision.cc:2335] Running auto_mixed_precision_onednn_bfloat16 graph optimizer
I0000 00:00:1758754431.978969    3797 auto_mixed_precision.cc:2263] Converted 1837/4155 nodes to bfloat16 precision using 2 cast(s) to bfloat16 (excluding Const and Variable casts)

Expected Logs on the Client

Logits and probabilities for binary classification (untrained weights will give random results):

Inference successful!
Logits: [-0.123 0.456]
Probabilities: [0.412 0.588]

Optional: Graph Freezing for Additional Performance

Freeze variables to constants for a lean inference graph (removes variable-loading overhead).

Script (public reference): https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_InferenceOptimization/scripts/freeze_optimize_v2.py

Example:

python freeze_optimize_v2.py \
  --input_saved_model_dir=/tmp/bert_large_hf/1 \
  --output_saved_model_dir=/tmp/bert_large_hf_frozen/1

Run this after exporting the SavedModel (server side).

Key Validation Steps

Summary

Enabled BF16 mixed precision on Xeon with minimal code change, deployed via TensorFlow Serving, verified AMX acceleration, and optionally optimized the model by freezing the graph.