Typical Failure Scenarios

An AtScale cluster responds to different failures in different ways. This topic describes the behavior of the system under several common failure scenarios.

Scenario: Loss of physical host 1

Description: Host 1 (see Figure 1 in Architecture of an AtScale Cluster) physically fails.

Actions:

  • The AtScale DBMS on Host 2 detects loss of the master via AtScale Coordinator and automatically promotes itself to master.
  • The internal load balancer routes database traffic to the Host 2 AtScale DBMS.
  • The Host 2 engine detects the loss of Host 1.
  • The external load balancer detects the loss of Host 1 and routes all traffic to Host 2 services (engine and Design Center).

Scenario: The AtScale engine cannot reach the Coordinator service due to network degradation

Description: The Host 1 AtScale engine is unable to contact the AtScale Coordinator Service due to a degraded network.

Actions:

  • The Host 1 engine detects loss of communication with the AtScale Coordinator and shuts itself down. The Host 1 engine announces its departure to the other AtScale engine.
  • The Host 2 engine receives notice that Host 1 engine is shutting down. The external load balancer detects the loss of the Host 1 engine and routes traffic only to the Host 2 engine.

Scenario: The AtScale engine runs out of memory

Description: A series of unconstrained queries causes the Host 1 AtScale engine 1 to run out of memory and crash.

Actions:

  • AtScale Service Control will restart the Host 1 Engine process. If restart happens quickly enough, no other actions are taken. However, if the restart happens slowly the following actions may occur:

    • The Host 2 Engine detects the loss of the Host 1 engine.
    • The external load balancer detects the loss of the Host1 engine and routes traffic only to the Host 2 engine.
    • The external load balancer detects the presence of the Host1 engine and routes traffic to both Host 1 and Host 2 engines.

Scenario: AtScale engines on Host 1 and Host 2 can communicate with the AtScale Coordinator service but not with each other

Description: The network conditions between the AtScale engines on Host 1 and Host 2 are degraded, preventing timely communication between Host 1 and Host 2.

Actions:

  • The newest host shuts itself down.
  • The external load balancer detects loss of the newest host and routes traffic only to the oldest host.

Scenario: Loss of physical Host 3 (AtScale Coordinator)

Description: Host 3 physically fails.

Actions:

  • Coordinators on Hosts 1 and 2 detect the loss of the Host 3 (AtScale coordinator) and adapt the Coordinator cluster for operation without the Host 3 Coordinator.
  • During this process, the AtScale engines on Hosts 1 and 2 may restart themselves to ensure the engine cluster is reformed in a stable state. This avoids the possibility of a split-brain scenario, whereby the engines act independently of one another.
  • The coordinator on Host 3 may come back online and join the other Coordinators at any time without any impact.