Who really is Dr. DRE (@dr-dre
)?¶
To enhance our team's effectiveness, we have implemented a weekly rotation system. Each week, a designated team member takes responsibility for managing routine and unexpected operations. This document provides an overview of the on-call responsibilities and includes important links to resources that will support you in this role.
Where to find the rotation¶
The rotation schedules can be found on our Jira Team Operations page (requires DFINITY Jira access which can be obtained through Okta). There are two schedules, both following the same round-robin system:
- DRE Alerts: Handles automatic paging related to our infrastructure.
- DRE Ops Rotation: Determines who will act as
@dr-dre
(our Slack handle).
Why are there two schedules?
The two-schedule system was designed to separate responsibilities and ensure balance.
- DRE Alerts focuses on managing infrastructure alerts and operates only during working hours, as we don’t adhere to any strict SLA/SLO requirements.
- DRE Ops Rotation handles Slack pings and general team operations.
I am not getting paged for alerts?
We use the Jira cloud app for on-call and rotations.
To set it up follow the document on Notion.
Regular activities¶
As Dr. DRE, your role for the week involves taking on several responsibilities. These include, but are not limited to:
1. Follow through the IC OS release process¶
The release process is documented in detail here. In short:
- Follow the schedule presented on the rollout dashboard. If problems arise, diagnose using the low-level statuses from Airflow (the dashboard also links directly to the problem task in Airflow).
- Cut a new GuestOS & HostOS release on Thursday, and create any additional feature builds as per the spreadsheet as well as security hotfixes.
- Ensure team engineers review the release notes through Friday.
- Ensure the release controller submits GuestOS & HostOS version elect proposals on Friday -- not earlier, to allow sufficient time for community and DFINITY voters to review and vote without rush.
- In-depth explanation of the release process can be found on Notion.
2. Review alerts for our clusters¶
- All alerts that our clusters send are aggregated in our Jira ops board.
- Heartbeats are present here.
What should I do if there are alerts?
- It's not expected that every alert can be resolved immediately or by a single team member.
- The key objective is to maintain the stability of our clusters.
- Evaluate the alert based on its severity and the affected cluster to determine if further action is required.
- Escalate or address issues as needed to ensure operations continue smoothly.
3. Handle all notifications and answer all questions asked in the team's slack channels¶
#eng-dre
: General channel for activities#eng-release
: Questions related to release process#eng-release-bots
: Automations send important notifications to this channel, which you must handle#eng-observability
: Questions related to our observability
But I don't know the answers to all questions
- It’s perfectly fine not to have all the answers.
- Take the initiative to investigate the issue and see how you can assist.
- If you’re unable to resolve the question, redirect it to the appropriate team member.
- The primary goal is to support the organization and relieve pressure on the rest of the team during your on-call week.
4. Submit requested proposals¶
All requested proposals must:
- Be registered as a ticket under the DRE Ops Rotation queue
- Include clear requirements and expected outcomes
- Be followed through in a timely manner based on priority
Typical types of requested proposals are:
- Help in on-boarding or off-boarding of datacenters and node providers
- Firewall rule modifications
- Node rewards adjustment proposals (see Handoff operations below)
- Any other requested proposals
Tooling
For all regular ops we have sufficient tooling implemented in our dre
tool. For all new proposals and specific scenarios it is your responsibility to add them to the tooling as the new use cases come.
5. Submit proposals conventionally submitted once a week¶
- Replace dead nodes
- Mainnet topology proposals, such as
dre network --heal --optimize --ensure-operator-nodes-unassigned --ensure-operator-nodes-assigned --remove-cordoned-nodes
or a subset of these operations. The operations are still not polished enough to be run automatically. - Provider reward adjustment proposals, if any are needed that week. Please ask in
#eng-dre
if you don't know if any are needed.
Please register proposals as tickets under the DRE Ops Rotation queue, so adoption and progress can be tracked, and context can be observed by your teammates.
6. Monitor status and health of CI¶
-
Weekly dependency upgrade jobs:
-
A GitHub Action runs weekly to automatically upgrade dependencies.
- Dependabot also issues PRs regularly.
- While some weeks result in straightforward updates, others may require manual intervention due to API changes or other breaking updates.
- Review and address any issues with the generated pull request
- Ensure the fixes are implemented and attempt to merge the PR into the repository.
- Maintaining compatibility between the IC repo and our repo reduces friction and ensures our tooling operates smoothly.
7. Drive progress on the DRE Ops Rotation task queue¶
Our DRE Ops rotation dashboard lets you view the queue. The queue exists to keep track of work falling under the Dr. DRE umbrella that may span multiple days or weeks. It contains a list of child tickets that you need to work on.
Tend to the queue at least once a day. Read and heed the guidelines in the umbrella epic. Here is a brief summary (which is not a substitute for reading the guidelines):
- Record (as tickets of type task) multi-day work under the umbrella of the DRE Ops Rotation, with the task queue ticket as the new ticket's epic.
- Drive progress on tasks that are not blocked.
- Mark blocked tasks as blocked.
- Record completion of tasks.
- Provide enough context there for your teammates to pick up ongoing work the week after.
- Move tickets that change scope out of the queue and into its own epic or project.
8. Handoff operations¶
- If there are any pending tasks or unresolved operations, it is your responsibility to inform the next on-call team member.
- Provide clear details on what needs to be addressed and any context that might help them pick up where you left off.
- Pass on information about node rewards adjustments requested to the next on-call team member.
The DRE Ops Rotation dashboard is an invaluable aid in getting yourself in context as well as providing context to your teammates.