Deploying the Upgrade¶
These are the activities that are performed when the time of deployment comes around. You will need access to a number of resources (Summit, TTS, BTS) at the sites so be sure that you have the credentials to do so.
Attention
As the person running the deployment, you have absolute control over the system to complete this process. No one should do anything without your express consent.
Important
Upgrading systems which are controlling hardware, especially the camera CCD, cold, cryo and vacuum systems, needs to be done with care and should be coordinated with the hardware/software experts for those systems.
- Send all CSC to
OFFLINE
state Go to the LOVE interface for the specific site and use any of the ScriptQueues to run the
system_wide_shutdown.py
script (under STANDARD). This will send all CSC systems toOFFLINE
state.If CSCs do not transition to
OFFLINE
withsystem_wide_shutdown.py
, try runningset_summary_state.py
. An example configuration would be:data: - [ESS:118, OFFLINE]
WARNING: Not all CSCs report
OFFLINE
; these will instead reportSTANDBY
as the last state seen. To check that they are indeedOFFLINE
check for heartbeats using Chronograf.It is recommended to use LOVE for this, but if it’s not working, Nublado is a good fallback.
An overall status view is available from LOVE in the Summary state view (Summit, TTS, BTS).
You can also consult these dashboards on Chronograf. The names are the same across sites.
Heartbeats
AT Summary State Monitor
MT Summary State Monitor
Envsys Summary State Monitor
Calibration Systems Summary State Monitor
Observatory Systems Summary State Monitor
The Watcher MUST come down FIRST, to avoid a flurry of alarms going off.
The ScriptQueues MUST come down last.
- Send all CSC to
Clean up still running CSCs/systems
To shut down the cameras, log into the
mcm
machines and stop the bridges usingsudo systemctl stop
(Summit, TTS, BTS).One can work with the system principles to shut down the services.
Notify the camera upgrade team that the system is ready for Stage 1.
Shut down and clean up bare metal deployments (Summit only).
- Clean up Kubernetes deployments:
Scripts are in lsst-ts/k8s-admin.
Ensure the correct cluster is set, then run:
./cleanup_all
To clean up Nublado:
./cleanup_nublado
With everything shutdown, the configurations need to be updated before deployment starts
Ensure Phalanx branch (lsst-sqre/phalanx) contains all the necessary updates, then create a PR and merge it.
All other configuration repositories should have the necessary commits already on branches and pushed to GitHub.
Update configuration repositories on bare metal machine deployments (Summit only).
Unlike shutdown, only the T&S systems are handled here. DM and Camera are handled by the system principles.
Also, only certain T&S systems are handled here, the rest need to be coordinated with system principles.
- Once all configurations are in place, deployment of the new system can begin.
Be patient with container pulling (goes for everything containerized here).
Update ESS Controllers (Summit only)
Log into the site specific ArgoCD UI to sync the relevant applications:
Start by syncing
science-platform
.Sync
nublado
.Sync
sasquatch
if necessary, but check first, in case there are configuration changes that we don’t want to apply just yet.Sync T&S applications, all under the
telescope
ArgoCD project. While the order doesn’t matter in principle, it is a good idea to start with a small application (likecontrol-system-test
). It is also useful to update LOVE before the rest of the control system applications, as we can monitor the state of the different CSCs from the summary state view.
Startup Camera Services (Summit, TTS, BTS).
This is handled by the Camera team for a Cycle upgrade, but it is done by the deployment team for a system restart.
Use the site specific Slack channel (Summit, TTS, BTS) to notify the people doing the camera upgrade that they can proceed to Stage 2.
Startup Services on Bare Metal Deployments (Summit only).
Once the deployment steps have been executed, the system should be monitored to see if all CSCs come up into
STANDBY
Some CSCs (ScriptQueues) should come up
ENABLED
.Report any issues directly to the system principles (DMs are OK).
This step is completed when either all CSCs are in STANDBY/OFFLINE or CSCs with issues cannot be fixed in a reasonable (~30 minutes) amount of time.
If leaving this step with CSCs in non-working order, make sure to report that on the site specific Slack channel.
If not carrying on with integration testing, folks can be told they can use Nublado again via the site specific Slack channel.