Troubleshooting¶
This guide contains troubleshooting for common issues and errors.
Build Errors¶
When building the Docker containers, if the build fails with an error message like:
ERROR: failed to solve: golang:1.21.3-alpine3.17: failed to do request: Head "https://registry-1.docker.io/v2/library/golang/manifests/1.19-alpine3.16": dial tcp: lookup registry-1.docker.io on 127.0.0.53:53: read udp 127.0.0.1:35314->127.0.0.53:53: i/o timeout
make: *** [Makefile:106: docker-job-repository] Error 1
there could be an issue with the system's Docker download rate limit.
Job Status¶
The table below lists the status and error messages associated with jobs.
Job Status | Description | Next Steps |
---|---|---|
Complete | Job has been successfully processed. | No next steps. |
Incomplete | Job is still being processed. | Check the logs of the container that correspond with the service owner of the job. If a container has stopped running, restart that container. Then call the retry endpoint for the service. To learn about retry endpoints, see Retry Logic. |
No Pipeline Found | No tasks have a job selector that matches with the provided image. | Ensure that the desired task has the appropriate job selector. To modify a task in the Web UI, follow the instructions for Update Task. |
Pipeline Error | The pipeline processing the image failed. | Check the job Error Details column or QC flags field for details about the error. See View Jobs in Web UI. Also, check the log for the pipeline container. If the error is permissions-based, the logs may say permission denied or failed to write file . In this case, on the Ubuntu system, run: make files or sudo make files . |
Transmission Error | The job could not be transferred to the message bus. | Check the logs of the service in the Error Details column. See View Jobs in Web UI. If necessary, call the retry endpoint for the appropriate service, restart the pipeline and/or restart the Redis container. To learn about retry endpoints, see Retry Logic. |
File Error (Input) | The input file could not be transferred from the File Sender OEM to the File Receiver Gateway. | Check the logs of the File Sender OEM. If the owner of the job is still the File Sender OEM, then call the retry endpoint using instructions in Retry Logic. |
File Error (Output) | The output file(s) could not be transferred from the File Sender Gateway to the File Receiver OEM. | To learn how to check the error details for the failing output file(s), See Output File Status Info. Check the logs of the File Receiver OEM and the File Sender Gateway for the job. If the owner of the job is still the File Receiver OEM, call the retry endpoint for the File Sender Gateway and File Receiver OEM using the instructions in Retry Logic. |
View Jobs in Web UI¶
To see job-level errors in the Web UI:
-
Choose the View Jobs tab.
-
See the columns Job Status, Pipeline Status and Error Details, which displays detailed information about the error.
-
Compare the contents of these columns with the Job Status table above.
Example
The Error Details column presents with this format: (<service name>): <error>
.
The <service name>
is the service to investigate.
The <error>
is the cause of the failure.
Example:
(data-organizer): no tasks could be matched to the input file name
For information about how to view a container's logs, see Check Logs.
Output File Status¶
The output file status messages are listed in the table below.
File Status | Description |
---|---|
File Complete | The file is done processing. |
File Incomplete | The file is still processing. |
File Invalid | The file could not be validated (e.g. hostname mismatch or file not found). |
File Transmission Failed | The file could not be transmitted from the File Sender Gateway to the File Receiver OEM. |
File Write Failed | The file could not be written to the OEM system. |
File Archival Failed | The file on the gateway could not be archived by the File Sender Gateway. |
See Output File Status Information¶
For information about the output file status:
- Click the View Jobs tab to see the Jobs page for the specific job.
- Click on the Output Files. The output file view will indicate which files failed.
- Hover over the file status and file name, as shown in Figures 2 and 3, to see the specific error that occurred.
Check the Error Details column to determine the service that failed to process the file. From there, trace the logs back to see the context of the error. If necessary, restart the affected containers or call the appropriate retry endpoints. See Retry Logic.
Check Logs¶
The following sections describe various instructions for checking logs.
Note
Run the instructions on the correct system, Gateway or OEM.
Portainer¶
Use Portainer to view the status of AiCSD containers.
On the Gateway system:
- Open a terminal and run:
make run-portainer
-
Navigate to Portainer.
Note
The link to Portainer (above) refers to http://localhost:9000. If viewing from the OEM system, use the IP address instead of localhost.
-
Log in and navigate to the containers view. This will show all the running and stopped containers.
- For the desired container, under the Quick actions column, click the paper icon to view the logs.
- This will open the logs for the selected container.
Command Line¶
Use command line instructions to view the status of AiCSD containers.
On the Gateway or OEM system:
-
Open a terminal and run:
This command lists all the running containers. 2. Select the CONTAINER ID and use that to run:docker ps
docker logs <container_id>
This command displays the logs for the selected container.
Export Logs (Gateway)¶
To export logs from all services (AiCSD and EdgeX), run:
make logs
Retry Logic¶
All services have retry endpoints documented in each individual service. See Retry API Table.
A service will process jobs it owns when:
- Retry endpoints are called.
- The service starts up.
Retry API Table¶
Each service has its own retry endpoint that is shown in the API documentation for that service
Microservice Retry Endpoint | Port |
---|---|
data-organizer | 59781 |
file-sender-oem | 59782 |
file-receiver-gateway | 59783 |
task-launcher | 59785 |
file-sender-gateway | 59786 |
file-receiver-oem | 59787 |
Retry¶
On the Gateway or OEM system, use curl to call the retry endpoint from a Linux terminal:
curl -X POST http://<hostname>:<port-number>/api/v1/retry -H 'Content-Type: application/json' -d <payload>
Note
For retry endpoints that do not have a payload, remove the -H 'Content-Type: application/json' -d <payload>
from the command above.
Retry Timeout Window¶
The task launcher has a retry timeout window. This window should be greater than the time it takes to process an input file.
Change this value in the task-launcher configuration.toml before the service is started or in Consul.
If the change is made in the configuration.toml, the task-launcher service must be rebuilt:
-
Run:
make docker-task-launcher
-
Restart the service.
If the change is made in Consul, the task-launcher service must be restarted.
Security¶
Regenerating SSH Keys (Two System Only)¶
Any time the ssh keys are regenerated on a two system setup, the public key must be copied back from the Gateway System to the OEM system. Once the key is copied over tho the OEM, the OEM services must be rebuilt in order to have a copy of the new key.
Consul¶
With the implementation of the new security features, Consul requires a security token for user authentication.
- Open Consul in a browser.
- Generate a token to access Consul and copy it.
$ make get-consul-acl-token
- Click
Log in with a different token
in Consul and paste the generated token. - Once authenticated in Consul, it is possible to change configurations for any of the AiCSD services.
- For any settings (apart from Writable settings), the affected service must be restarted in order for the change to be applied.
Server Misbehaving (Two System Setup)¶
If an error in the logs appears to be something like
Post \"https://<system_name>:8443/job-repository/api/v1/job\": dial tcp: lookup <system_name> on 127.0.0.11:53: server misbehaving"
it may be necessary to stop the services, clean-up, and ensure that the hostname can be resolved properly.
If the hostname is not set properly, it is possible to modify the /etc/hosts
file to add the ip address and server name.
OEM Services Hang¶
If any of the OEM services (ie app-file-watcher) have an error that looks like:
level=INFO ts=2023-06-13T19:10:25.315559848Z app=app-file-watcher source=methods.go:138 msg="using Unix Domain Socket at unix:///tmp/edgex/secrets/spiffe/public/api.sock"
then the services may have hung.
Here are some potential fixes for this issue:
- Check that systems are on the same network by verifying that the Gateway system can ping the IP address of the OEM system.
- Check that the public key
edgex-res/id_rsa.pub
on the Gateway andedgex-res/remote/sshd-remote/authorized_keys
on the OEM file are the same. - Verify that the
TUNNEL_HOST: <OEM_IP_ADDRESS>
in the docker-compose-edgex-spiffe-spire.yml address matches the OEM system. - Ensure that the logs for both the
oem-ssh-proxy
on the Gateway and theremote-spire-agent
on the OEM have the appropriate logs as described in System Setup
Golang Tool Issues¶
If you try to run the golang tools like the make auto-verify-clients
or make client-update
and run into mockery: command not found
, there is most likely an issue with environment variables. Similar issue - go-test-report: not found
can occur when running make test-report
.
The fix can be found below:
Note
You will need sudo privileges to implement this fix.
-
At the bottom of the
$HOME/.bashrc
file,Note
You can edit the file with
sudo nano $HOME/.bashrc
Add the following commands:
export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$GOPATH/bin:$GOROOT/bin:$PATH:$HOME/bin
-
Then run the following command to reload the terminal and set the environment variables
source $HOME/.bashrc
-
Try running
make auto-verify-clients
ormake client-update
ormake test-report
again.
Dashboards¶
Error or Issue | Possible Solution |
---|---|
Dashboards Tab: This site can't be reached and localhost refused to connect. |
To make sure monitoring is properly set up, refer to the Monitoring Overview. For further issues, refer to the documentation above to Check Logs. |
Other related issues | If experiencing additional dashboard issues, refer to the documentation in the Tools Troubleshooting. |
Next up¶
To understand how images are processed, see General Information
To stop and clean up the services, see Tear Down
BSD 3-Clause License: See License.