Deploying the Upgrade

These are the activities that are performed when the time of deployment comes around. You will need access to a number of resources (Summit, TTS, BTS) at the sites so be sure that you have the credentials to do so.

Attention

As the person running the deployment, you have absolute control over the system to complete this process. No one should do anything without your express consent.

Important

Upgrading systems which are controlling hardware, especially the camera CCD, cold, cryo and vacuum systems, needs to be done with care and should be coordinated with the hardware/software experts for those systems.

  1. Send all CSC to OFFLINE state
    • Go to the LOVE interface for the specific site and use any of the ScriptQueues to run the system_wide_shutdown.py script (under STANDARD). This will send all CSC systems to OFFLINE state.

    • If CSCs do not transition to OFFLINE with system_wide_shutdown.py, try running set_summary_state.py. An example configuration would be:

      data:
      - [ESS:118, OFFLINE]
      
    • WARNING: Not all CSCs report OFFLINE; these will instead report STANDBY as the last state seen. To check that they are indeed OFFLINE check for heartbeats using Chronograf.

    • It is recommended to use LOVE for this, but if it’s not working, Nublado is a good fallback.

    • An overall status view is available from LOVE in the Summary state view (Summit, TTS, BTS).

    • You can also consult these dashboards on Chronograf. The names are the same across sites.

      • Heartbeats

      • AT Summary State Monitor

      • MT Summary State Monitor

      • Envsys Summary State Monitor

      • Calibration Systems Summary State Monitor

      • Observatory Systems Summary State Monitor

    • The Watcher MUST come down FIRST, to avoid a flurry of alarms going off.

    • The ScriptQueues MUST come down last.

  2. Clean up still running CSCs/systems

    • To shut down the cameras, log into the mcm machines and stop the bridges using sudo systemctl stop (Summit, TTS, BTS).

    • One can work with the system principles to shut down the services.

    • Notify the camera upgrade team that the system is ready for Stage 1.

    • Shut down and clean up bare metal deployments (Summit only).

    • Clean up Kubernetes deployments:
      • Scripts are in lsst-ts/k8s-admin.

      • Ensure the correct cluster is set, then run:

        ./cleanup_all
        
      • To clean up Nublado:

        ./cleanup_nublado
        
  3. With everything shutdown, the configurations need to be updated before deployment starts

    • Ensure Phalanx branch (lsst-sqre/phalanx) contains all the necessary updates, then create a PR and merge it.

    • All other configuration repositories should have the necessary commits already on branches and pushed to GitHub.

    • Update configuration repositories on bare metal machine deployments (Summit only).

      • Unlike shutdown, only the T&S systems are handled here. DM and Camera are handled by the system principles.

      • Also, only certain T&S systems are handled here, the rest need to be coordinated with system principles.

  4. Once all configurations are in place, deployment of the new system can begin.
    • Be patient with container pulling (goes for everything containerized here).

    1. Update ESS Controllers (Summit only)

    2. Log into the site specific ArgoCD UI to sync the relevant applications:

      • Start by syncing science-platform.

      • Sync nublado.

      • Sync sasquatch if necessary, but check first, in case there are configuration changes that we don’t want to apply just yet.

      • Sync T&S applications, all under the telescope ArgoCD project. While the order doesn’t matter in principle, it is a good idea to start with a small application (like control-system-test). It is also useful to update LOVE before the rest of the control system applications, as we can monitor the state of the different CSCs from the summary state view.

    3. Startup Camera Services (Summit, TTS, BTS).

      • This is handled by the Camera team for a Cycle upgrade, but it is done by the deployment team for a system restart.

    4. Use the site specific Slack channel (Summit, TTS, BTS) to notify the people doing the camera upgrade that they can proceed to Stage 2.

    5. Startup Services on Bare Metal Deployments (Summit only).

  5. Once the deployment steps have been executed, the system should be monitored to see if all CSCs come up into STANDBY

    • Some CSCs (ScriptQueues) should come up ENABLED.

    • Report any issues directly to the system principles (DMs are OK).

    • This step is completed when either all CSCs are in STANDBY/OFFLINE or CSCs with issues cannot be fixed in a reasonable (~30 minutes) amount of time.

    • If leaving this step with CSCs in non-working order, make sure to report that on the site specific Slack channel.

  6. Some CSCs need to be ENABLED (Summit, TTS, BTS).

  7. If not carrying on with integration testing, folks can be told they can use Nublado again via the site specific Slack channel.

Site Specific Variations