Customized Operators

Public API for extended XPU operators is provided by the itex.ops namespace. The extended API provides better performance than the original public API.

itex.ops.AdamWithWeightDecayOptimizer

This optimizer implements the Adam algorithm with weight decay.

itex.ops.AdamWithWeightDecayOptimizer(
    weight_decay=0.001, learning_rate=0.001, beta_1=0.9, beta_2=0.999,
    epsilon=1e-07, name='Adam', **kwargs
)

This is an implementation of the AdamW optimizer described in “Decoupled Weight Decay Regularization” by Loshch ilov & Hutter (pdf). This Python API itex.ops.AdamWithWeightDecayOptimizer replaces tfa.optimizers.AdamW.

For example:

step = tf.Variable(0, trainable=False)
schedule = tf.optimizers.schedules.PiecewiseConstantDecay(
    [10000, 15000], [1e-0, 1e-1, 1e-2])
# lr and wd can be a function or a tensor
lr = 1e-1 * schedule(step)
wd = lambda: 1e-4 * schedule(step)

# ...

optimizer = itex.ops.AdamWithWeightDecayOptimizer(learning_rate=lr, weight_decay=wd)

itex.ops.LAMBOptimizer

This optimizer implements the LAMB algorithm.

itex.ops.LAMBOptimizer(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999,
    epsilon=1e-06, weight_decay=0.001, name='LAMB', **kwargs
)

This is an implementation of the LAMB optimizer described in “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (pdf). This Python API itex.ops.LAMBOptimizer replaces tfa.optimizers.LAMB.

For example:

step = tf.Variable(0, trainable=False)
schedule = tf.optimizers.schedules.PiecewiseConstantDecay(
    [10000, 15000], [1e-0, 1e-1, 1e-2])
# lr and wd can be a function or a tensor
lr = 1e-1 * schedule(step)
wd = lambda: 1e-4 * schedule(step)

# ...

optimizer = itex.ops.LAMBOptimizer(learning_rate=lr, weight_decay=wd)

itex.ops.LayerNormalization

Layer normalization layer (Ba et al., 2016).

itex.ops.LayerNormalization(
    axis=-1, epsilon=0.001, center=True, scale=True,
    beta_initializer='zeros', gamma_initializer='ones',
    beta_regularizer=None, gamma_regularizer=None, beta_constraint=None,
    gamma_constraint=None, **kwargs
)

Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. This applies a transformation that maintains the mean activation within each example close to 0, and the activation standard deviation close to 1. This python API itex.ops.LayerNormalization replaces tf.keras.layers.LayerNormalization.

For example:

>>> import intel_extension_for_tensorflow as itex
>>> data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32)
>>> layer = itex.ops.LayerNormalization(axis=1)
>>> output = layer(data, training=False)
>>> print(output)
tf.Tensor(
[[-0.99998  0.99998]
 [-0.99998  0.99998]
 [-0.99998  0.99998]
 [-0.99998  0.99998]
 [-0.99998  0.99998]], shape=(5, 2), dtype=float32)

itex.ops.GroupNormalization

Group normalization layer (Yuxin Wu, Kaiming He).

itex.ops.GroupNormalization(
        groups=32,
        axis=-1,
        epsilon=1e-3,
        center=True,
        scale=True,
        beta_initializer="zeros",
        gamma_initializer="ones",
        beta_regularizer=None,
        gamma_regularizer=None,
        beta_constraint=None,
        gamma_constraint=None,
        **kwargs
)

Group Normalization divides the channels into groups and computes within each group the mean and variance for normalization. Empirically, its accuracy is more stable than batch norm in a wide range of small batch sizes, if learning rate is adjusted linearly with batch sizes. This python API itex.ops.GroupNormalization replaces tf.keras.layers.GroupNormalization. Note, ITEX provide faster GPU implement for 4d input and axis=-1, others is same as original keras.

For example:

>>> import tensorflow as tf
>>> import intel_extension_for_tensorflow as itex
>>> data = tf.random.normal((1, 8, 8, 32))
>>> layer = itex.ops.GroupNormalization(axis=-1)
>>> output = layer(data)

itex.ops.gelu

Applies the Gaussian error linear unit (GELU) activation function.

itex.ops.gelu(
    features, approximate=False, name=None
)

Gaussian error linear unit (GELU) computes x * P(X <= x), where P(X) ~ N(0, 1). The (GELU) nonlinearity weights inputs by their value, rather than gating inputs by their sign as in ReLU. This Python API itex.ops.gelu replaces tf.nn.gelu.

For example:

>>> import intel_extension_for_tensorflow as itex
>>> x = tf.constant([-3.0, -1.0, 0.0, 1.0, 3.0], dtype=tf.float32)
>>> y = itex.ops.gelu(x)
>>> y.numpy()
array([-0.00404969, -0.15865526,  0.        ,  0.8413447 ,  2.9959502 ],
      dtype=float32)
>>> y = itex.ops.gelu(x, approximate=True)
>>> y.numpy()
array([-0.00363725, -0.158808  ,  0.        ,  0.841192  ,  2.9963627 ],
      dtype=float32)

itex.ops.ItexLSTM

Long Short-Term Memory layer (first proposed in Hochreiter & Schmidhuber, 1997), this python API itex.ops.ItexLSTM is semantically the same as tf.keras.layers.LSTM.

itex.ops.ItexLSTM(
    200, activation='tanh',
    recurrent_activation='sigmoid',
    use_bias=True,
    kernel_initializer='glorot_uniform',
    recurrent_initializer='orthogonal',
    bias_initializer='zeros', **kwargs
)

Based on available runtime hardware and constraints, this layer will choose different implementations (ITEX-based or fallback-TensorFlow) to maximize the performance.
If a GPU is available and all the arguments to the layer meet the requirements of the ITEX kernel (see below for details), the layer will use a fast Intel® Extension for TensorFlow* implementation. The requirements to use the ITEX implementation are:

  1. activation == tanh

  2. recurrent_activation == sigmoid

  3. use_bias is True

  4. Inputs, if use masking, are strictly right-padded.

  5. Eager execution is enabled in the outermost context.

For example:

>>> import intel_extension_for_tensorflow as itex
>>> inputs = tf.random.normal([32, 10, 8])
>>> lstm = itex.ops.ItexLSTM(4)
>>> output = lstm(inputs)
>>> print(output.shape)
(32, 4)
>>> lstm = itex.ops.ItexLSTM(4, return_sequences=True, return_state=True)
>>> whole_seq_output, final_memory_state, final_carry_state = lstm(inputs)
>>> print(whole_seq_output.shape)
(32, 10, 4)
>>> print(final_memory_state.shape)
(32, 4)
>>> print(final_carry_state.shape)
(32, 4)