TMC Recovery mechanism in case of subsystem failure
This guide provides instructions to recover TMC Low when it enters the
FAULTobservation state.The recovery steps involve issuing command-line instructions that can be executed from any Python runtime environment or script.
TMC Low Auto Recovery
Overview
The Telescope Monitoring and Control (TMC) Low system supports an auto-recovery mechanism to handle failures
that occur during the AssignResources and Configure command execution.
When a TMC detects failure on the AssignResources and Configure command, TMC attempts to recover the affected subsystems automatically, depending on their
observation states.
—
Auto Recovery Scenarios
Pre-Requisite: Recovery is only possible after a command failure if the subsystems have either reached their target ObsState or successfully rolled back to their previous ObsState.
1. Configure Command Failure — Subsystems in Recoverable State
If the Configure command fails for any reason and the subsystems are in the following states:
CSP Obs State |
SDP Obs State |
MCCS Obs State | Auto Recovery Action |
|
|---|---|---|---|
|
|
|
|
Example
Scenario: The
Configurecommand fails due to a timeout in MCCS configuration.Subsystem States: - CSP -> Transitions ->
IDLE- SDP -> Transitions ->READY- MCCS -> Transitions ->READYAction Taken: TMC automatically issues
Endon SDP and MCCS subarrays, returning all subsystems (and the TMC Subarray) to theIDLEstate.
—
2. Successive Configure Command Failure — All Subsystems in READY State
If a successive Configure command fails for any reason, and all subsystems remain in the READY state:
CSP Obs State |
SDP Obs State |
MCCS Obs State | Auto Recovery Action |
|
|---|---|---|---|
|
|
|
|
Example
Scenario: The first
Configurecommand succeeds. The nextConfigurecommand fails for SDP due to invalid configuration parameters.Subsystem States: - CSP -> Transitions ->
READY- SDP -> Transitions ->READY- MCCS -> Transitions ->READYAction Taken: TMC re-invokes
Configureon SDP with the previous successful configuration data. Once successful, the TMC Subarray ObsState transitions back toREADY.
—
3. AssignResources Command Failure — Subsystems in Recoverable State
If the AssignResources command fails for any reason and the subsystems are in the following states:
CSP Obs State |
SDP Obs State |
MCCS Obs State | Auto Recovery Action |
|
|---|---|---|---|
|
|
|
TMC invokes the ReleaseAllResources command on MCCS and SDP
subarrays to bring them back to |
Example
Scenario: The
AssignResourcescommand fails due to a timeout in MCCS configuration.Subsystem States: - CSP -> Current ObsState ->
EMPTY- SDP -> Transitions ->IDLE- MCCS -> Transitions ->IDLEAction Taken: TMC automatically issues
ReleaseAllResourceson SDP and MCCS subarrays, returning all subsystems (and the TMC Subarray) to theEMPTYstate.
—
Summary
Condition |
Subsystem States |
Recovery Action |
|---|---|---|
First |
CSP: |
Invoke |
Successive |
CSP: |
Re-invoke |
|
CSP: |
Invoke |
—
TMC Low in FAULT ObsState
TMC will not get stuck in a particular transitional observation states like for ex.
RESOURCING,CONFIGURING, etc.Instead it moves to the Observation state
FAULTin the following scenarios.To recover from the Observation state
FAULT, please follow the steps to recover.
Scenario |
Steps to recover |
|---|---|
|
|
TMC Low not recovering from FAULT obsState
If the Restart() command fails to transition the TMC Low to the EMPTY observation state, please follow these steps:
Inspect all TMC Low leaf nodes: Manually visit each leaf node within the TMC Low hierarchy.
Identify the faulty subsystem: Check the
obsStateof each node to locate any subsystem that is not in the expected state.Manually reset the faulty subsystem: Attempt to bring the identified faulty subsystem to the
EMPTYobservation state by applying corrective actions or issuing necessary commands.Re-invoke Restart() on the TMC Low Subarray Node: After all subsystems are in a recoverable state, issue the
Restart()command on the TMC Low Subarray Node to transition the system back to theEMPTYobsState.