Monitor and troubleshoot the agent
This section describes some techniques for monitoring and troubleshooting agents, including how to verify that they are communicating with xMatters and how to troubleshoot configuration issues so you can ensure that your steps and integrations are processed successfully.
Status update alerts notify selected users when communication from an agent has not been detected for more than five minutes. This can happen when the agent service stops running, the host system is turned off or disconnected from the network, or when there is another event such as a network outage.
To configure status update recipients, see Set status alert recipients.
You can use an internal workflow's built-in subscription form to create a subscription to notify you when a specific integration encounters a processing error. This can happen if there's a syntax error in your script or your integration received unexpected inputs.
To subscribe to notifications when your integration encounters a processing error, see Subscribe to integration alerts.
Troubleshooting
If your agent is not appearing in the list of agents or appears to be disconnected, try the following troubleshooting steps.
Once you have installed an agent, it may take a few minutes before it appears in the list of installed agents in the web user interface.
Ensure that the host system is turned on, connected to the network, and that the agent service is running. The agent service is configured to restart automatically when the host computer is rebooted.
If the service fails to restart after the host computer is rebooted, check that the user or service account that initially installed the agent is still active. If the user has been deactivated or deleted, authentication requests between the agent and xMatters will fail, preventing the agent from starting.
For more information about starting the agent service, see Stop, start or restart an agent.
It is recommended to use the most up-to-date version of the agent. To see if there is a newer version available, log in to the xMatters user interface and view the status of the agent, as described in View details about an agent.
To update the agent, see Update an agent to the latest version.
An agent can experience connectivity issues when the clock on the host system becomes inaccurate. This situation is prone to occur when an agent is installed on a virtual machine that experiences time drift. To prevent the agent from disconnecting, you may want to use a service such as NTP to synchronize the time on your host system with internet time.
If a previously-connected agent becomes disconnected even though the service is running and appears healthy, ensure that the time on the host system is accurate. System clock errors appear as a warning in the logs with the following structure:
2017-04-12 16:08:45,608 5422 [qtp726379593-159363] com.cloudagent.WebSocketConfiguration WARN --- An otherwise valid token failed the timestamp check with expiry=1492013171 domain="example.company.com",trace="d9ab8e88-aba1-41e4-a6a8-0d7214ca3a11"
For SSL communication to work between xMatters and the agent, the JRE might need to possess a local copy of the calling system's public certificate if the server's certificate is self-signed or if there is a problem with the server's commercial certificate. If this is the case, the logs will include an error similar to the following when triggering a workflow:
Script failed with message: JavaObject[org.springframework.web.client.ResourceAccessException: I/O error on POST request for "<url>": PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target; nested exception is javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target (org.springframework.web.client.ResourceAccessException)]
The agent's copy of the certificate is kept in a "trust store"; in most cases, this is a file called cacerts stored in the Java JDK folder. The location should look similar to ...\xa\jre\windows\jdk-11.0.4\lib\security depending on your operating system and Java version.
Before adding a new certificate, be sure to back up the existing trust store.
To add the certificate to the trust store, open a command window, navigate to Java's jre\bin folder, and then run the following command:
keytool -importcert -keystore ../lib/security/cacerts -storepass changeit -file /temp/somecert.cer -alias somecert
- Java's /jre/lib/security/cacerts file is the JRE's default trust store.
- changeit is the default password for any JRE's trust store.
- -file is followed by the path to the certificate that you want to add to the trust store. The certificate should be in either DER (binary) format or X.509 Base-64 encoded text format. Some versions of the JRE may also trust PKCS12-formatted certificates in "compatibility" mode.
- -alias allows you to specify an alias for the certificate, which makes it more convenient to list and manipulate the certificates in the store.
You can view certificates in the store with the keytool -list command. To see only a single certificate, use the -alias argument. For more detail, use -v:
keytool -list -keystore ../lib/security/cacerts -storepass changeit -alias somecert -v
To get a copy of the certificate, you can ask the administrator of the calling server, or simply navigate to the server's URL and use the browser's certificate export tool.
If your host system uses a proxy server to connect with xMatters, you can define its proxy configuration settings during installation of the agent. To update existing proxy configuration settings, you'll need to re-install the agent.
If the user that installed the agent has been deleted, the agent cannot authenticate and will fail. You can solve this by changing which user is used to authenticate.
- Add another user as a Registered User on the failing agent.
- On the Agents screen, click the Available tab and copy the values for XMATTERS_KEY and API_KEY.
- Log into the server where the agent is installed and edit or add the following values in the xagent-service.conf file, inserting the key values as indicated:
- wrapper.java.additional.5 = -Dwebsocket.secret=<XMATTERS_KEY VALUE>
- wrapper.java.additional.6 = -Downer.api.key=<API_KEY VALUE>
- Save the file and restart the agent.
If the above troubleshooting steps do not work, you can restart the agent service. For more information, see Stop, start or restart an agent.
When you configure a Flow Designer trigger to run on an agent, the step provides a trigger URL belonging to the agent. When you send a request to this URL, the agent queues the data locally and then forwards it to your xMatters instance. The request appears in the flow's Activity Panel, and xMatters triggers the step.
In some cases, you may receive a "202" response from the agent, but the request does not appear in the Activity Panel and the step is not triggered. This may indicate an issue with authentication: the agent is successfully receiving the request but the authentication is failing when the request reaches xMatters.
To troubleshoot this issue, try sending a test request directly to xMatters instead of via the agent. To do this, modify the request URL as follows:
- Change http to https.
- Replace the agent's IP address and port number with the name of your xMatters instance.
For example, if your URL looks like:
http://10,2.3.4:8081/api/integration/functions/...
Change it to:
https://my.instance.com/api/integration/1/functions/...
Send your request again, and you should receive a "202" response indicating that xMatters received the data for processing. If the issue was related to an authentication issue, you will receive a "401" response. Double check that you've configured the authentication settings and entered the details correctly and try again.
If you're troubleshooting agent activity for a specific step in Flow Designer, try the following:
- Determine which agents the step is waiting for. This information is available on the Run Location page within the step configuration.
- Confirm that the associated agents are available and running properly, and use the troubleshooting steps above to address any issues.
- If the agent is disconnected but can be restored, check the selected agents and fix any connection issues.
- If the agent is disconnected and can't be restored, the job will eventually time out. You can configure new agents in the step's Run Location tab to handle new requests, but this will not impact jobs that are already queued or in progress.
- If the agents are running, look for the following signs:
- Errors with loading the integration: try updating the agent or, if the problem persists, contact Support.
- Slow running steps: review the execution or communication logs to see how many jobs are being processed. You should be able to just wait for the job to finish, but you might also want to look into ways to improve step performance, such as modifying the scripts.
- Many jobs executing quickly: you are likely experiencing an event flood, and may need to add more agents to improve flood handling.
- In all cases, you can try selecting new agents for the step to run on, but this will not affect jobs that are already queued.
More information
If the above actions don't fix the issues, take a look at the monitoring endpoints and system logs for more information.
You can monitor the status of the agent service on the host system by accessing some built-in URLs. You can access these URLs by typing them in to your web browser or by making a GET request from a system that supports making RESTful web requests. The URLs return information such as whether the agent is connected, the time of the last successful connection to xMatters, and other information such as the version of the agent and the number of jobs that are queued for processing.
The URL structure is in the format http://<localhost>:8081/<endpoint>, where <localhost> is the IP address of the host system and <endpoint> is the name of the monitoring endpoint you want to access.
Monitoring endpoints are configured to run on port 8081 by default. This value can be changed by editing the SERVER_PORT setting in the xa.conf configuration file on Linux installations, and by setting a global SERVER_PORT environment variable on Windows installations.
The following sections describe endpoints that are available for monitoring the xMatters Agent.
/health
Returns information about the connection between the agent and the xMatters server. Example: http://localhost:8081/health
agent
status | "Up" if the agent is connected to the integration service; "Down" if the agent is not connected for some reason (for example, network issues). |
connected | True if the agent is currently connected to the xMatters server, false otherwise. |
lastConnectedTime | Timestamp of the most recent time a connection between the agent and the xMatters server was created. |
lastHeartbeatTime | Timestamp of the most recent time the agent sent a packet to the xMatters server. |
diskSpace
status | "Up" if the agent is connected to the integration service; "Down" if the agent is not connected for some reason (for example, network issues). |
total | Total disk space, in bytes, on the system where the agent is installed. |
free | Free disk space, in bytes, on the system where the agent is installed. |
threshold | Free disk space left, in bytes, before the health indicator changes to "Down". |
/info
Returns information about the agent. Example: http://localhost:8081/info
version | Version of the agent. |
/metrics
Each endpoint returns information about local jobs or threads in the agent. Endpoint names are case-sensitive.
Example: http://localhost:8081/metrics/xa.MaxNumOfWorkers
/metrics/xa.maxNumOfWorkers | The maximum number of threads that can be used by the agent. |
/metrics/xa.numOfWorkers | The number of threads currently in use by the agent. |
/metrics/xa.numOfActiveJobs | The number of jobs currently being processed. |
/metrics/xa.numOfJobsInWaiting | The number of processing jobs queued locally. |
The agent logs information in a couple of different files.
For information on the general status of the agent, look in the agent-communication.log file, found in the following locations:
- Linux: /var/log/xmatters/xmatters-xa/agent-communication-xmatters.log
- Windows: C:\Program Files\xa\xalog\agent-communication-xmatters.log
For Windows installations only:
- If the service has trouble starting up or shutting down, review the C:\Program Files\xa\servicelog\services.log file.
- If you encounter any issues with installation of upgrades, see the msilog file, found in the installation folder.
(For LInux installations, refer to the agent-communication.log file.)
For information on your scripts, check the log files for each script, which include the request and response and any console.log statements in the script. They are named agent-<scriptName>.log (where <scriptName> is the name of your script, and can be found in the following locations:
- Linux: /var/log/xmatters/xmatters-xa/agent-myScript.log
- Windows: C:\Program Files\xa\xalog\agent-myScript.log
You can adjust the options within the agent configuration file to control how long the log files are kept before being removed, and how much disk space the log files can take up. This allows you to specify how large a footprint the agent can occupy on your system, and helps you to better allocate your system resources.
The log settings are set at the agent level, but are calculated per file and affect the main agent log (xmatters-agent-communications.log), as well as each Flow Designer and integration log (the agent-<scriptName>.log and <integrationName>.log files). This means that the overall footprint required for the agent depends on two related factors:
- How many different integrations or flows you have, since each one requires its own set of log files.
- How frequently an integration runs and/or how verbose the logs are for each run. Integrations that do not run very often and those that are very simple will maintain a smaller footprint and will not require as much space.
By default, log files rotate on a daily basis, and are retained for 30 days, though size constraints can affect both of these settings as described in the following table.
LOG_HISTORY | How long, in days, log files are retained (default 30). |
LOG_FILE_SIZE | The maximum size of an individual log file (default 10MB). If this is exceeded, the agent will compress the file into an archive and create a new file for that date. For particularly active or busy integrations, or integrations where the logging is expansive (such as those with larger HTTP requests and/or response bodies), this can cause files to rotate more than once per day, resulting in more than 30 log files for a 30-day period. |
LOG_TOTAL | The maximum total size allowed for each set of log files (default 500MB). When the limit is reached, the agent will begin removing log files, beginning with the oldest first. |
To configure the log file options:
- On Linux systems, open the xa.conf file, and add or change the following settings:
//Number of log files history it should keep
export LOG_HISTORY=30
//Max size of current log file; e.g. 1000B / 10KB / 100MB / 1GB
export LOG_FILE_SIZE=10MB
//Total size of all logs. e.g. 1000B / 10KB / 100MB / 1GB
export LOG_TOTAL=1GB
- On Windows systems, open the xagent-service.conf file, and add or change the following settings:
wrapper.java.additional.XX = -DLOG_HISTORY=30
wrapper.java.additional.XX = -DLOG_FILE_SIZE=10MB
wrapper.java.additional.XX = -DLOG_TOTAL=1GB
Any changes you make to these settings will be overwritten when upgrading the agent. On Linux systems, overrides are copied to a backup file, but you will need to manually restore the values.