Container Affinity and Anti-Affinity

Introduction

Some policies allow the user to give hints about how particular containers should be co-located within a node. In particular these hints express whether containers should be located ‘close’ to each other or ‘far away’ from each other, in a hardware topology sense.

Since these hints are interpreted always by a particular policy implementation, the exact definitions of ‘close’ and ‘far’ are also somewhat policy-specific. However as a general rule of thumb containers running

on CPUs within the same NUMA nodes are considered ‘close’ to each other,
on CPUs within different NUMA nodes in the same socket are ‘farther’, and
on CPUs within different sockets are ‘far’ from each other

These hints are expressed by container affinity annotations on the Pod. There are two types of affinities:

affinity (or positive affinty): cause affected containers to pull each other closer
anti-affinity (or negative affinity): cause affected containers to push each other further away

Policies try to place a container

close to those the container has affinity towards
far from those the container has anti-affinity towards.

Affinity Annotation Syntax

Affinities are defined as the cri-resource-manager.intel.com/affinity annotation. Anti-affinities are defined as the cri-resource-manager.intel.com/anti-affinity annotation. They are specified in the metadata section of the Pod YAML, under annotations as a dictionary, with each dictionary key being the name of the container within the Pod to which the annotation belongs to.

metadata:
  anotations:
    cri-resource-manager.intel.com/affinity: |
      container1:
        - scope:
            key: key-ref
            operator: op
            values:
            - value1
            ...
            - valueN
          match:
            key: key-ref
            operator: op
            values:
            - value1
            ...
            - valueN
          weight: w

An anti-affinity is defined similarly but using cri-resource-manager.intel.com/anti-affinity as the annotation key.

metadata:
  anotations:
    cri-resource-manager.intel.com/anti-affinity: |
      container1:
        - scope:
            key: key-ref
            operator: op
            values:
            - value1
            ...
            - valueN
          match:
            key: key-ref
            operator: op
            values:
            - value1
            ...
            - valueN
          weight: w

Affinity Semantics

An affinity consists of three parts:

scope expression: defines which containers this affinity is evaluated against
match expression: defines for which containers (within the scope) the affinity applies to
weight: defines how strong a pull or a push the affinity causes

Affinities are also sometimes referred to as positive affinities while anti-affinities are referred to as negative affinities. The reason for this is that the only difference between these are that affinities have a positive weight while anti-affinities have a negative weight.

The scope of an affinity defines the bounding set of containers the affinity can apply to. The affinity expression is evaluated against the containers in scope and it selects the containers the affinity really has an effect on. The weight specifies whether the effect is a pull or a push. Positive weights cause a pull while negative weights cause a push. Additionally, the weight specifies how strong the push or the pull is. This is useful in situations where the policy needs to make some compromises because an optimal placement is not possible. The weight then also acts as a way to specify preferences of priorities between the various compromises: the heavier the weight the stronger the pull or push and the larger the propbability that it will be honored, if this is possible at all.

The scope can be omitted from an affinity in which case it implies Pod scope, in other words the scope of all containers that belong to the same Pod as the container for which which the affinity is defined.

The weight can also be omitted in which case it defaults to -1 for anti-affinities and +1 for affinities. Weights are currently limited to the range [-1000,1000].

Both the affinity scope and the expression select containers, therefore they are identical. Both of them are expressions. An expression consists of three parts:

key: specifies what metadata to pick from a container for evaluation
operation (op): specifies what logical operation the expression evaluates
values: a set of strings to evaluate the the value of the key against

The supported keys are:

for pods:
- name
- namespace
- qosclass
- labels/<label-key>
- id
- uid
for containers:
- pod/<pod-key>
- name
- namespace
- qosclass
- labels/<label-key>
- tags/<tag-key>
- id

Essentially an expression defines a logical operation of the form (key op values). Evaluating this logical expression will take the value of the key in which either evaluates to true or false. a boolean true/false result. Currently the following operations are supported:

Equals: equality, true if the value of key equals the single item in values
NotEqual: inequality, true if the value of key is not equal to the single item in values
In: membership, true if value of key equals to any among values
NotIn: negated membership, true if the value of key is not equal to any among values
Exists: true if the given key exists with any value
NotExists: true if the given key does not exist
AlwaysTrue: always evaluates to true, can be used to denote node-global scope (all containers)
Matches: true if the value of key matches the globbing pattern in values
MatchesNot: true if the value of key does not match the globbing pattern in values
MatchesAny: true if the value of key matches any of the globbing patterns in values
MatchesNone: true if the value of key does not match any of the globbing patterns in values

The effective affinity between containers C_1 and C_2, A(C_1, C_2) is the sum of the weights of all pairwise in-scope matching affinities W(C_1, C_2). To put it another way, evaluating an affinity for a container C_1 is done by first using the scope (expression) to determine which containers are in the scope of the affinity. Then, for each in-scope container C_2 for which the match expression evaluates to true, taking the weight of the affinity and adding it to the effective affinity A(C_1, C_2).

Note that currently (for the topology-aware policy) this evaluation is asymmetric: A(C_1, C_2) and A(C_2, C_1) can and will be different unless the affinity annotations are crafted to prevent this (by making them fully symmetric). Moreover, A(C_1, C_2) is calculated and taken into consideration during resource allocation for C_1, while A(C_2, C_1) is calculated and taken into account during resource allocation for C_2. This might be changed in a future version.

Currently affinity expressions lack support for boolean operators (and, or, not). Sometimes this limitation can be overcome by using joint keys, especially with matching operators. The joint key syntax allows joining the value of several keys with a separator into a single value. A joint key can be specified in a simple or full format:

simple: <colon-separated-subkeys>, this is equivalent to :::<colon-separated-subkeys>
full: <ksep><vsep><ksep-separated-keylist>

A joint key evaluates to the values of all the <ksep>-separated subkeys joined by <vsep>. A non-existent subkey evaluates to the empty string. For instance the joint key

:pod/qosclass:pod/name:name

evaluates to

<qosclass>:<pod name>:<container name>

For existence operators, a joint key is considered to exist if any of its subkeys exists.

Examples

Put the container peter close to the container sheep but far away from the container wolf.

metadata:
  annotations:
    cri-resource-manager.intel.com/affinity: |
      peter:
      - match:
          key: name
          operator: Equals
          values:
          - sheep
        weight: 5
    cri-resource-manager.intel.com/anti-affinity: |
      peter:
      - match:
          key: name
          operator: Equals
          values:
          - wolf
        weight: 5

Shorthand Notation

There is an alternative shorthand syntax for what is considered to be the most common case: defining affinities between containers within the same pod. With this notation one needs to give just the names of the containers, like in the example below.

  annotations:
    cri-resource-manager.intel.com/affinity: |
      container3: [ container1 ]
    cri-resource-manager.intel.com/anti-affinity: |
      container3: [ container2 ]
      container4: [ container2, container3 ]

This shorthand notation defines:

container3 having
- affinity (weight 1) to container1
- anti-affinity (weight -1) to container2
container4 having
- anti-affinity (weight -1) to container2, and container3

The equivalent annotation in full syntax would be

metadata:
  annotations:
    cri-resource-manager.intel.com/affinity: |+
      container3:
      - match:
          key: labels/io.kubernetes.container.name
          operator: In
          values:
          - container1
    cri-resource-manager.intel.com/anti-affinity: |+
      container3:
      - match:
          key: labels/io.kubernetes.container.name
          operator: In
          values:
          - container2
      container4:
      - match:
          key: labels/io.kubernetes.container.name
          operator: In
          values:
          - container2
          - container3