Restful API
Version Info
- GET /rest/v1/version
Get XPU Manager version infos
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
level_zero_version (string) – Underlying level-zero lib version
xpum_version (string) – XPUM version
xpum_version_git (string) – The git commit hash of this build
Devices
- GET /rest/v1/devices
Get device list
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
devices[].@odata.id (string) – Link to device detail properties
devices[].device_id (integer) – Device id
devices[].device_name (string) – Device name
devices[].device_type (string) – Device type, now is only GPU
devices[].pci_bdf_address (string) – The PCI bdf address of device
devices[].pci_device_id (string) – The PCI device id of device
devices[].uuid (string) – Device uuid
devices[].vendor_name (string) – Vendor name
- GET /rest/v1/devices/{deviceId}
Get device properties
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
password (string) – Password for redfish auth
username (string) – Username for redfish auth
- Status Codes:
200 OK – OK
400 Bad Request – Error
404 Not Found – Device not found
500 Internal Server Error – Error
- Response JSON Object:
amc_firmware_name (string) – The AMC firmware name of device
amc_firmware_version (string) – The AMC firmware version of device
core_clock_rate_mhz (string) – Clock rate for device core, in MHz
device_id (integer) – Device id
device_name (string) – Device name
device_stepping (string) – The stepping of device
device_type (string) – Device type
driver_version (string) – The driver version
firmware_name (string) – The GFX firmware name of device
firmware_version (string) – The GFX firmware version of device
gfx_data_firmware_name (string) – The GFX_DATA firmware name of device
gfx_data_firmware_version (string) – The GFX_DATA firmware version of device
gfx_firmware_status (string) – The GFX firmware status
gfx_pscbin_firmware_name (string) – The PSC firmware name of device
gfx_pscbin_firmware_version (string) – The PSC firmware version of device
health.@odata.id (string) – Link to detail info
kernel_version (string) – Linux kernel version
max_command_queue_priority (string) – Maximum priority for command queues. Higher value is higher priority
max_hardware_contexts (string) – Maximum number of logical hardware contexts
max_mem_alloc_size_byte (string) – The total allocatable memory, in bytes
memory_bus_width (string) – Memory bus width
memory_ecc_state (string) – The state of memory ecc
memory_free_size_byte (string) – The free memory, in bytes
memory_physical_size_byte (string) – Device physical memory size, in bytes
number_of_eus (string) – The number of EUs
number_of_eus_per_sub_slice (string) – Maximum number of EUs per sub-slice
number_of_media_engines (string) – The number of media engines
number_of_media_enh_engines (string) – The number of media enhancement engines
number_of_memory_channels (string) – Number of memory channels
number_of_slices (string) – Maximum number of slices
number_of_sub_slices_per_slice (string) – Maximum number of sub-slices per slice
number_of_threads_per_eu (string) – Maximum number of threads per EU
number_of_tiles (string) – The number of tiles
pci_bdf_address (string) – The PCI bdf address of device
pci_device_id (string) – The PCI device id of device
pci_slot (string) – PCI slot of device
pci_sub_device_id (string) – The PCI sub device id of device
pci_vendor_id (string) – The PCI vendor id of device
pcie_generation (string) – PCIe generation
pcie_max_link_width (string) – PCIe max link width
physical_eu_simd_width (string) – The physical EU simd width
serial_number (string) – Serial number
sku_type (string) – The type of SKU
socket_id (string) – socket id of OAM GPU
topology.@odata.id (string) – Link to detail info
uuid (string) – Device uuid
vendor_name (string) – Vendor name
- GET /rest/v1/devices/amcversions
Get amc firmware versions.
- Request JSON Object:
password (string) – Password for redfish auth
username (string) – Username for redfish auth
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
amc_fw_version[] (string) – AMC versions
error (string) – Error message
Diagnostics
- POST /rest/v1/devices/{deviceId}/diagnostics
Run diagnostics for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
level (integer) – The level for diagnostics to run
- Status Codes:
201 Created – OK
400 Bad Request – Bad Request, for example invalid level
404 Not Found – Device not found
500 Internal Server Error – Error
- GET /rest/v1/devices/{deviceId}/diagnostics
Get diagnostics result for device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
404 Not Found – Device not found
500 Internal Server Error – Error
- Response JSON Object:
component_count (integer) – Component count
component_list[].component_type (string) – Component type
component_list[].finished (boolean) – Finished or not
component_list[].message (string) – Result message
component_list[].result (string) – Result status
device_id (integer) – Device id
end_time (string) – End time
finished (boolean) – Finished or not
level (integer) – The level for diagnostics to run
message (string) – Result message
result (string) – Result status
start_time (string) – Start time
- POST /rest/v1/groups/{groupId}/diagnostics
Run diagnostics for group
- Parameters:
groupId (integer) – Group id
- Request JSON Object:
level (integer) – The level for diagnostics to run
- Status Codes:
201 Created – OK
400 Bad Request – Bad Request, for example invalid level
404 Not Found – Group not found
500 Internal Server Error – Error
- GET /rest/v1/groups/{groupId}/diagnostics
Get diagnostics result for group
- Parameters:
groupId (integer) – Group id
- Status Codes:
200 OK – OK
404 Not Found – Group not found
500 Internal Server Error – Error
- Response JSON Object:
device_count (integer) – Device count
device_list[].component_count (integer) – Component count
device_list[].component_list[].component_type (string) – Component type
device_list[].component_list[].finished (boolean) – Finished or not
device_list[].component_list[].message (string) – Result message
device_list[].component_list[].result (string) – Result status
device_list[].device_id (integer) – Device id
device_list[].end_time (string) – End time
device_list[].finished (boolean) – Finished or not
device_list[].level (integer) – The level for diagnostics to run
device_list[].message (string) – Result message
device_list[].result (string) – Result status
device_list[].start_time (string) – Start time
finished (boolean) – Finished or not
group_id (integer) – Group id
Health
- GET /rest/v1/devices/{deviceId}/health
Get health for device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
404 Not Found – Device not found
500 Internal Server Error – Error
- Response JSON Object:
core_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
core_temperature.description (string) – The description for health
core_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
core_temperature.status (integer) – The status for health
core_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_id (integer) – Device id
frequency.description (string) – The description for health
frequency.status (integer) – The status for health
memory.description (string) – The description for health
memory.status (integer) – The status for health
memory_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
memory_temperature.description (string) – The description for health
memory_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
memory_temperature.status (integer) – The status for health
memory_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
power.custom_threshold (integer) – The custom threshold in watts for health
power.description (string) – The description for health
power.status (integer) – The status for health
power.throttle_threshold (integer) – The throttle threshold in watts for health
xe_link_port.description (string) – The description for health
xe_link_port.status (integer) – The status for health
- GET /rest/v1/groups/{groupId}/health
Get health for group
- Parameters:
groupId (integer) – Group id
- Status Codes:
200 OK – OK
404 Not Found – Group not found
500 Internal Server Error – Error
- Response JSON Object:
device_count (integer) – Device count
device_list[].core_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
device_list[].core_temperature.description (string) – The description for health
device_list[].core_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
device_list[].core_temperature.status (integer) – The status for health
device_list[].core_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_list[].device_id (integer) – Device id
device_list[].frequency.description (string) – The description for health
device_list[].frequency.status (integer) – The status for health
device_list[].memory.description (string) – The description for health
device_list[].memory.status (integer) – The status for health
device_list[].memory_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
device_list[].memory_temperature.description (string) – The description for health
device_list[].memory_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
device_list[].memory_temperature.status (integer) – The status for health
device_list[].memory_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_list[].power.custom_threshold (integer) – The custom threshold in watts for health
device_list[].power.description (string) – The description for health
device_list[].power.status (integer) – The status for health
device_list[].power.throttle_threshold (integer) – The throttle threshold in watts for health
device_list[].xe_link_port.description (string) – The description for health
device_list[].xe_link_port.status (integer) – The status for health
group_id (integer) – Group id
- GET /rest/v1/devices/{deviceId}/health/{healthType}
Get specific health for device and response JSON object only contains targeted-type health
- Parameters:
deviceId (integer) – Device id
healthType (str) – Health type, coreTemperature, memoryTemperature, power, memory, xeLinkPort or frequency
- Status Codes:
200 OK – OK
404 Not Found – Device not found or health type not supported
500 Internal Server Error – Error
- Response JSON Object:
core_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
core_temperature.description (string) – The description for health
core_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
core_temperature.status (integer) – The status for health
core_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_id (integer) – Device id
frequency.description (string) – The description for health
frequency.status (integer) – The status for health
memory.description (string) – The description for health
memory.status (integer) – The status for health
memory_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
memory_temperature.description (string) – The description for health
memory_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
memory_temperature.status (integer) – The status for health
memory_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
power.custom_threshold (integer) – The custom threshold in watts for health
power.description (string) – The description for health
power.status (integer) – The status for health
power.throttle_threshold (integer) – The throttle threshold in watts for health
xe_link_port.description (string) – The description for health
xe_link_port.status (integer) – The status for health
- PUT /rest/v1/devices/{deviceId}/health/{healthType}
Set health config for device
- Parameters:
deviceId (integer) – Device id
healthType (str) – Health type, only coreTemperature, memoryTemperature or power
- Request JSON Object:
custom_threshold (integer) – The custom threshold for coreTemperature in celsius degree, memoryTemperature in celsius degree or power in watts
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request, for example invalid threshold
404 Not Found – Device not found or health type not supported
500 Internal Server Error – Error
- GET /rest/v1/groups/{groupId}/health/{healthType}
Get health for group and response JSON object only contains targeted-type health
- Parameters:
groupId (integer) – Group id
healthType (str) – Health type, coreTemperature, memoryTemperature, power, memory, xeLinkPort or frequency
- Status Codes:
200 OK – OK
404 Not Found – Group not found or health type not supported
500 Internal Server Error – Error
- Response JSON Object:
device_count (integer) – Device count
device_list[].core_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
device_list[].core_temperature.description (string) – The description for health
device_list[].core_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
device_list[].core_temperature.status (integer) – The status for health
device_list[].core_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_list[].device_id (integer) – Device id
device_list[].frequency.description (string) – The description for health
device_list[].frequency.status (integer) – The status for health
device_list[].memory.description (string) – The description for health
device_list[].memory.status (integer) – The status for health
device_list[].memory_temperature.custom_threshold (integer) – The custom threshold in celsius degree for health
device_list[].memory_temperature.description (string) – The description for health
device_list[].memory_temperature.shutdown_threshold (integer) – The shutdown threshold in celsius degree for health
device_list[].memory_temperature.status (integer) – The status for health
device_list[].memory_temperature.throttle_threshold (integer) – The throttle threshold in celsius degree for health
device_list[].power.custom_threshold (integer) – The custom threshold in watts for health
device_list[].power.description (string) – The description for health
device_list[].power.status (integer) – The status for health
device_list[].power.throttle_threshold (integer) – The throttle threshold in watts for health
device_list[].xe_link_port.description (string) – The description for health
device_list[].xe_link_port.status (integer) – The status for health
group_id (integer) – Group id
- PUT /rest/v1/groups/{groupId}/health/{healthType}
Set health config for group
- Parameters:
groupId (integer) – Group id
healthType (str) – health type, only coreTemperature, memoryTemperature or power
- Request JSON Object:
custom_threshold (integer) – The custom threshold for coreTemperature in celsius degree, memoryTemperature in celsius degree or power in watts
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request, for example invalid threshold
404 Not Found – Group not found or health type not supported
500 Internal Server Error – Error
Policy
- GET /rest/v1/policy
Get all policies for all devices
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
[].device_id (integer) – Device id
[].policy_list[].action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
[].policy_list[].condition.threshold (integer) – The threshold for policy
[].policy_list[].condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
[].policy_list[].device_id (integer) – Device id
[].policy_list[].notify_callback_url (string) – Policy notify callback url
[].policy_list[].type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- GET /rest/v1/devices/{deviceId}/policy
Get all policies for a device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
[].device_id (integer) – Device id
[].policy_list[].action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
[].policy_list[].condition.threshold (integer) – The threshold for policy
[].policy_list[].condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
[].policy_list[].device_id (integer) – Device id
[].policy_list[].notify_callback_url (string) – Policy notify callback url
[].policy_list[].type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- POST /rest/v1/devices/{deviceId}/policy
Set a policy for a device.
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
condition.threshold (integer) – The threshold for policy
condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
device_id (integer) – Device id
notify_callback_url (string) – Policy notify callback url
type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
message (string) – success or error message
status (integer) – status code, 0 is success, other is error.
- DELETE /rest/v1/devices/{deviceId}/policy
Delete a policy for a device. The policy type must be set.
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
condition.threshold (integer) – The threshold for policy
condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
device_id (integer) – Device id
notify_callback_url (string) – Policy notify callback url
type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
message (string) – success or error message
status (integer) – status code, 0 is success, other is error.
- GET /rest/v1/groups/{groupId}/policy
Get all policies for a group
- Parameters:
groupId (integer) – Group id
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
[].device_id (integer) – Device id
[].policy_list[].action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
[].policy_list[].action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
[].policy_list[].condition.threshold (integer) – The threshold for policy
[].policy_list[].condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
[].policy_list[].device_id (integer) – Device id
[].policy_list[].notify_callback_url (string) – Policy notify callback url
[].policy_list[].type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- POST /rest/v1/groups/{groupId}/policy
Set a policy for a group.
- Parameters:
groupId (integer) – Group id
- Request JSON Object:
action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
condition.threshold (integer) – The threshold for policy
condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
device_id (integer) – Device id
notify_callback_url (string) – Policy notify callback url
type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
message (string) – success or error message
status (integer) – status code, 0 is success, other is error.
- DELETE /rest/v1/groups/{groupId}/policy
Delete a policy for a group. The policy type must be set.
- Parameters:
groupId (integer) – Group id
- Request JSON Object:
action.throttle_device_frequency_max (integer) – The throttle_device_frequency_max value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.throttle_device_frequency_min (integer) – The throttle_device_frequency_min value only for POLICY_ACTION_TYPE_THROTTLE_DEVICE action type.
action.type (string) – Policy action type. Supported types: XPUM_POLICY_ACTION_TYPE_THROTTLE_DEVICE, XPUM_POLICY_ACTION_TYPE_NULL
condition.threshold (integer) – The threshold for policy
condition.type (string) – Policy conditon type. Supported types: XPUM_POLICY_CONDITION_TYPE_GREATER, XPUM_POLICY_CONDITION_TYPE_LESS, XPUM_POLICY_CONDITION_TYPE_WHEN_OCCUR
device_id (integer) – Device id
notify_callback_url (string) – Policy notify callback url
type (string) – Policy type. Supported types: XPUM_POLICY_TYPE_GPU_TEMPERATURE, XPUM_POLICY_TYPE_GPU_MEMORY_TEMPERATURE, XPUM_POLICY_TYPE_GPU_POWER, XPUM_POLICY_TYPE_RAS_ERROR_CAT_RESET, XPUM_POLICY_TYPE_RAS_ERROR_CAT_PROGRAMMING_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_DRIVER_ERRORS, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE, XPUM_POLICY_TYPE_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE, XPUM_POLICY_TYPE_GPU_MISSING, XPUM_POLICY_TYPE_GPU_THROTTLE
- Status Codes:
200 OK – OK
400 Bad Request – Request Error
500 Internal Server Error – Internal Error
- Response JSON Object:
message (string) – success or error message
status (integer) – status code, 0 is success, other is error.
Group Management
- POST /rest/v1/groups
Create a new group
- Request JSON Object:
group_name (string) – The name for the group to be created (required)
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
device_id_list[] (integer) – The id of devices belong to this group
group_id (integer) – The id of the group
group_name (string) – The name of the group
- GET /rest/v1/groups
Get all groups
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
group_list[].device_id_list[] (integer) – The id of devices belong to this group
group_list[].group_id (integer) – The id of the group
group_list[].group_name (string) – The name of the group
- GET /rest/v1/groups/{groupId}
Get information of a group
- Parameters:
groupId (integer) – Group id
- Status Codes:
200 OK – OK
400 Bad Request – Error
- Response JSON Object:
device_id_list[] (integer) – The id of devices belong to this group
group_id (integer) – The id of the group
group_name (string) – The name of the group
- POST /rest/v1/groups/{groupId}
Modify a group
- Parameters:
groupId (integer) – Group id
- Request JSON Object:
device_id_add[] (integer) – The id of devices add to this group
device_id_remove[] (integer) – The id of devices remove from this group
- Status Codes:
200 OK – OK
400 Bad Request – Error
- Response JSON Object:
fail_to_add[].device_id (integer) – The id of device failed to be added to or removed from the group
fail_to_add[].error_msg (string) – Error message
fail_to_remove[].device_id (integer) – The id of device failed to be added to or removed from the group
fail_to_remove[].error_msg (string) – Error message
group_info.device_id_list[] (integer) – The id of devices belong to this group
group_info.group_id (integer) – The id of the group
group_info.group_name (string) – The name of the group
- DELETE /rest/v1/groups/{groupId}
Delete a group
- Status Codes:
200 OK – OK
400 Bad Request – Error
Firmware Flash
- POST /rest/v1/devices/{deviceId}/updatefw
Run firmware flash on single device or single card
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
file (string) – The path of firmware binary file to flash (required)
firmware_name (string) – Firmware name, options are: GFX, GFX_DATA, GFX_CODE_DATA, GFX_PSCBIN
force (boolean) – Force GFX firmware update. This parameter only works for GFX firmware.
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request
500 Internal Server Error – Error
- Response JSON Object:
error (string) – Error message
result (string) – The result of the query
- GET /rest/v1/devices/{deviceId}/firmware
Get firmware flash state of single device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
firmware_name (string) – Firmware name, options are: GFX, GFX_DATA, GFX_CODE_DATA, GFX_PSCBIN (required)
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request
500 Internal Server Error – Error
- Response JSON Object:
error (string) – Error message
result (string) – Firmware flash state, OK/FAILED/ONGOING
- POST /rest/v1/devices/updatefw
Run firmware flash on all devices
- Request JSON Object:
file (string) – The path of firmware binary file to flash (required)
firmware_name (string) – Firmware name, options are: AMC
password (string) – Password for redfish auth
username (string) – Username for redfish auth
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request
500 Internal Server Error – Error
- Response JSON Object:
error (string) – Error message
result (string) – The result of the query
- GET /rest/v1/devices/firmware
Get firmware flash state of all devices
- Request JSON Object:
firmware_name (string) – Firmware name, options are: AMC (required)
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request
500 Internal Server Error – Error
- Response JSON Object:
error (string) – Error message
result (string) – Firmware flash state, OK/FAILED/ONGOING
Agent Setting
- GET /rest/v1/agentSettings
Get XPUM settings
- Status Codes:
200 OK – OK
400 Bad Request – Error
- Response JSON Object:
sample_interval (integer) – Agent sample interval, in milliseconds, options are [100, 200, 500, 1000]
- POST /rest/v1/agentSettings
Modify XPUM settings
- Request JSON Object:
sample_interval (integer) – Agent sample interval, in milliseconds, options are [100, 200, 500, 1000]
- Status Codes:
200 OK – OK
400 Bad Request – Error
- Response JSON Object:
sample_interval (integer) – Agent sample interval, in milliseconds, options are [100, 200, 500, 1000]
Statistics
- GET /rest/v1/devices/{deviceId}/stats
Get statistics by device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
begin (string) – The time of last query
device_id (integer) – Device id
device_level[].avg (integer) – The average value since last query
device_level[].max (integer) – The max value since last query
device_level[].metrics_type (string) – The metric type
device_level[].min (integer) – The min value since last query
device_level[].value (integer) – The current value
end (string) – The time of this query
engine_util (any) – Engine utilizations
fabric_throughput[].avg (number) – The average value since last query
fabric_throughput[].max (number) – The max value since last query
fabric_throughput[].min (number) – The min value since last query
fabric_throughput[].name (string) – Fabric throughput name
fabric_throughput[].value (number) – The current value
tile_level[].data_list[].avg (integer) – The average value since last query
tile_level[].data_list[].max (integer) – The max value since last query
tile_level[].data_list[].metrics_type (string) – The metric type
tile_level[].data_list[].min (integer) – The min value since last query
tile_level[].data_list[].value (integer) – The current value
tile_level[].engine_util (any) – Engine utilizations
tile_level[].tile_id (integer) – The tile this data belongs to
- GET /rest/v1/groups/{groupId}/stats
Get statistics by group
- Parameters:
groupId (integer) – Group id
- Status Codes:
200 OK – OK
400 Bad Request – Error
500 Internal Server Error – Error
- Response JSON Object:
datas[].begin (string) – The time of last query
datas[].device_id (integer) – Device id
datas[].device_level[].avg (integer) – The average value since last query
datas[].device_level[].max (integer) – The max value since last query
datas[].device_level[].metrics_type (string) – The metric type
datas[].device_level[].min (integer) – The min value since last query
datas[].device_level[].value (integer) – The current value
datas[].end (string) – The time of this query
datas[].engine_util (any) – Engine utilizations
datas[].fabric_throughput[].avg (number) – The average value since last query
datas[].fabric_throughput[].max (number) – The max value since last query
datas[].fabric_throughput[].min (number) – The min value since last query
datas[].fabric_throughput[].name (string) – Fabric throughput name
datas[].fabric_throughput[].value (number) – The current value
datas[].tile_level[].data_list[].avg (integer) – The average value since last query
datas[].tile_level[].data_list[].max (integer) – The max value since last query
datas[].tile_level[].data_list[].metrics_type (string) – The metric type
datas[].tile_level[].data_list[].min (integer) – The min value since last query
datas[].tile_level[].data_list[].value (integer) – The current value
datas[].tile_level[].engine_util (any) – Engine utilizations
datas[].tile_level[].tile_id (integer) – The tile this data belongs to
group_id (integer) – Group id
Config
- PUT /rest/v1/devices/{deviceId}/standby
Set standby mode for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
standby_mode (string) – The standby mode: never, default
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/powerlimit
Set power limit for device
- Parameters:
device_id (integer) – Device id
- Request JSON Object:
power_limit (integer) – The power limit value
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/frequencyrange
Set frequency range for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
max_frequency (integer) – The max frequency value
min_frequency (integer) – The min frequency value
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/scheduler
Set scheduler mode for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
scheduler_mode (string) – The scheduler mode: timeout, timeslice, exclusive and debug
scheduler_timeslice_interval (integer) – The interval for timeslice mode
scheduler_timeslice_yield_timeout (integer) – The yield timeout for timeslice mode
scheduler_watchdog_timeout (integer) – The watchdog timeout for timeout mode
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- GET /rest/v1/devices/{deviceId}/config
Get all configuration for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
deviceId (integer) – Device id
memory_ecc_current_state (string) – The current state of memory ecc
memory_ecc_pending_state (string) – The pending state of memory ecc
power_limit (integer) – The power limit value
power_vaild_range (string) – power’s scope
tileConfigData.gpu_frequency_valid_options (string) – frequency scope
tileConfigData.max_frequency (integer) – max frequency
tileConfigData.min_frequency (integer) – min frequency
tileConfigData.scheduler_mode (string) – The scheduler mode: timeout, timeslice, exclusive and debug
tileConfigData.scheduler_timeslice_interval (integer) – scheduler timeslice’s interval value
tileConfigData.scheduler_timeslice_yield_timeout (integer) – scheduler timeslice’s yield value
tileConfigData.scheduler_watchdog_timeout (integer) – scheduler timeout’s value
tileConfigData.standby_mode (string) – The standby mode: never, default
tileConfigData.standby_mode_valid_options (string) – standby option
tileConfigData.tileId (string) – Tile id
- PUT /rest/v1/devices/{deviceId}/performancefactor
Set performance factor for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
engine (string) – engine name
factor (number) – performance factor
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/reset
Reset the device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/ppr
Apply PPR to the device
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/portenabled
Set port enabled for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
enabled (integer) – The enabled 1; disabled 0
port (integer) – The port number
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/portbeaconing
Set port beaconing for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
beaconing (integer) – The beaconing on 1; off 0
port (integer) – The port number
tile_id (integer) – The tile id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- PUT /rest/v1/devices/{deviceId}/memoryecc
Set memory ecc state for device
- Parameters:
deviceId (integer) – Device id
- Request JSON Object:
enabled (integer) – The enabled 1; disabled 0
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
Topology
- GET /rest/v1/devices/{deviceId}/topology
Get device topology
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
affinity_localcpulist (string) – local cpu list
affinity_localcpus (string) – local cpus
device_id (integer) – Device id
switch_count (integer) – Device parent switch count
switch_list[] (string) – list of switch device path
- GET /rest/v1/topology
Export node topology
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
length (integer) – XML buffer length
xmlstring (string) – XML sting of node topology
- GET /rest/v1/topology/xelink
Get xelink topology
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
link_type (string) – link type
local_cpu_affinity (string) – cpu affinity
local_device_id (integer) – Device id
local_numa_index (integer) – NUMA node index
local_on_subdevice (boolean) – if xelink port is located on a sub-device
local_subdevice_id (integer) – sub-device id
port_list[] (integer) – port list link to remote device
remote_device_id (integer) – remote Device id
remote_subdevice_id (integer) – remote sub-device id
ps
- GET /rest/v1/devices/{deviceId}/ps
Get per process device utilization.
- Parameters:
deviceId (integer) – Device id
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
utils[].device_id (integer) – Device ID
utils[].mem_size (integer) – Memory size
utils[].process_id (integer) – Process ID
utils[].process_name (string) – Process Name
utils[].shared_mem_size (integer) – Shared memory size
- GET /rest/v1/ps
Get per process device utilization.
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
utils[].device_id (integer) – Device ID
utils[].mem_size (integer) – Memory size
utils[].process_id (integer) – Process ID
utils[].process_name (string) – Process Name
utils[].shared_mem_size (integer) – Shared memory size
Dump Raw Data
- POST /rest/v1/dump
Start dump raw data task
- Request JSON Object:
device_id (integer) – The device to dump raw data (required)
metrics_type_list[] (string) – The metrics type to dump, options are: XPUM_DUMP_GPU_UTILIZATION XPUM_DUMP_POWER XPUM_DUMP_GPU_FREQUENCY XPUM_DUMP_GPU_CORE_TEMPERATURE XPUM_DUMP_MEMORY_TEMPERATURE XPUM_DUMP_MEMORY_UTILIZATION XPUM_DUMP_MEMORY_READ_THROUGHPUT XPUM_DUMP_MEMORY_WRITE_THROUGHPUT XPUM_DUMP_ENERGY XPUM_DUMP_EU_ACTIVE XPUM_DUMP_EU_STALL XPUM_DUMP_EU_IDLE XPUM_DUMP_RAS_ERROR_CAT_RESET XPUM_DUMP_RAS_ERROR_CAT_PROGRAMMING_ERRORS XPUM_DUMP_RAS_ERROR_CAT_DRIVER_ERRORS XPUM_DUMP_RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE XPUM_DUMP_RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE XPUM_DUMP_MEMORY_BANDWIDTH XPUM_DUMP_MEMORY_USED XPUM_DUMP_PCIE_READ_THROUGHPUT XPUM_DUMP_PCIE_WRITE_THROUGHPUT XPUM_DUMP_COMPUTE_XE_LINK_THROUGHPUT XPUM_DUMP_COMPUTE_ENGINE_UTILIZATION XPUM_DUMP_RENDER_ENGINE_UTILIZATION XPUM_DUMP_DECODE_ENGINE_UTILIZATION XPUM_DUMP_ENCODE_ENGINE_UTILIZATION XPUM_DUMP_COPY_ENGINE_UTILIZATION XPUM_DUMP_MEDIA_ENHANCEMENT_ENGINE_UTILIZATION XPUM_DUMP_3D_ENGINE_UTILIZATION XPUM_DUMP_RAS_ERROR_CAT_NON_COMPUTE_ERRORS_CORRECTABLE XPUM_DUMP_RAS_ERROR_CAT_NON_COMPUTE_ERRORS_UNCORRECTABLE XPUM_DUMP_COMPUTE_ENGINE_GROUP_UTILIZATION XPUM_DUMP_RENDER_ENGINE_GROUP_UTILIZATION XPUM_DUMP_MEDIA_ENGINE_GROUP_UTILIZATION XPUM_DUMP_COPY_ENGINE_GROUP_UTILIZATION XPUM_DUMP_FREQUENCY_THROTTLE_REASON_GPU XPUM_DUMP_MEDIA_ENGINE_FREQUENCY
show_date (boolean) – Controls timestamp format in dumps: ‘1’ includes full date and time, ‘0’ (default) includes only time.
tile_id (integer) – The tile to dump raw data
- Status Codes:
200 OK – OK
400 Bad Request – Bad Request
500 Internal Server Error – Internal Error
- Response JSON Object:
dump_file_path (string) – The path to file of dumped data
task_id (integer) – The task id
- GET /rest/v1/dump
List all dump raw data task
- Status Codes:
200 OK – OK
500 Internal Server Error – Internal Error
- Response JSON Object:
dump_task_ids[] (integer) – The id list of all tasks
- DELETE /rest/v1/dump/{taskId}
Stop dump raw data task
- Parameters:
taskId (integer) – the dump raw data task id
- Status Codes:
200 OK – OK
404 Not Found – Task not found
500 Internal Server Error – Internal Error
- Response JSON Object:
dump_file_path (string) – The path to file of dumped data
task_id (integer) – The task id
Sensor
- GET /rest/v1/sensor
Get sensor reading
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
sensor_reading[].amc_index (number) – AMC index
sensor_reading[].sensor_high (number) – High bound of sensor reading
sensor_reading[].sensor_low (number) – Low bound of sensor reading
sensor_reading[].sensor_name (string) – Sensor name
sensor_reading[].sensor_unit (string) – Sensor unit
sensor_reading[].value (number) – Sensor reading value
vgpu
- GET /rest/v1/vgpu/precheck
Check if BIOS settings are ready to create virtual GPUs
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
iommu_message (string) – IOMMU message
iommu_status (string) – IOMMU status
sriov_message (string) – SR-IOV message
stiov_status (string) – SR-IOV status
vmx_flag (string) – VMX Flag Check
vmx_message (string) – VMX flag message
- GET /rest/v1/devices/{deviceId}/vgpustats
Get statistics data of all virtual GPUs
- Status Codes:
200 OK – OK
500 Internal Server Error – Error
- Response JSON Object:
vf_list[].bdf_address (string) – BDF Address
vf_list[].metric_list[].metric_type (integer) – Metric Type
vf_list[].metric_list[].scale (integer) – Scale
vf_list[].metric_list[].value (integer) – Value