Who really is Dr. DRE (@dr-dre
)?¶
To enhance our team's effectiveness, we have implemented a weekly rotation system. Each week, a designated team member takes responsibility for managing routine and unexpected operations. This document provides an overview of the on-call responsibilities and includes important links to resources that will support you in this role.
Where to find the rotation¶
The rotation schedules can be found on our Jira Team Operations page. There are two schedules, both following the same round-robin system:
- DRE Alerts: Handles automatic paging related to our infrastructure.
- DRE Ops Rotation: Determines who will act as
@dr-dre
(our Slack handle).
Why are there two schedules?
The two-schedule system was designed to separate responsibilities and ensure balance.
- DRE Alerts focuses on managing infrastructure alerts and operates only during working hours, as we don’t adhere to any strict SLA/SLO requirements.
- DRE Ops Rotation handles Slack pings and general team operations.
I am not getting paged for alerts?
We use the Jira cloud app for on-call and rotations.
To set it up follow the document on Notion!
Regular activities¶
As Dr. DRE, your role for the week involves taking on several responsibilities. These include, but are not limited to:
1. Follow through the release process¶
The release process is documented here. In short:
- Follow the schedule presented on the rollout dashboard.
- Follow the statuses visible in airflow.
- Vote on the proposals being submitted by the automation.
- Cut a new release on Thrusday and create any additional feature builds.
- If needed, create ordinary hotfixes or security hotfixes for that week.
- In-depth explaination of the release process can be found on Notion.
2. Review alerts for our clusters¶
- All alerts that our clusters send are aggregated in our Jira ops board
- Heartbeats are present here
What should I do if there are alerts?
- It's not expected that every alert can be resolved immediately or by a single team member.
- The key objective is to maintain the stability of our clusters.
- Evaluate the alert based on its severity and the affected cluster to determine if further action is required.
- Escalate or address issues as needed to ensure operations continue smoothly.
3. Answer all questions asked in the team's slack channels¶
#eng-dre
: General channel for activities#eng-release
: Questions related to release process#eng-observability
: Questions related to our observability
But I don't know the answers to all questions
- It’s perfectly fine not to have all the answers.
- Take the initiative to investigate the issue and see how you can assist.
- If you’re unable to resolve the question, redirect it to the appropriate team member.
- The primary goal is to support the organization and relieve pressure on the rest of the team during your on-call week.
4. Submit requested proposals¶
- Replace dead nodes
- Help in on-boarding or off-boarding of datacenters and node providers
- Firewall rule modifications
- Any other requested proposals
Tooling
For all regular ops we have sufficient tooling implemented in our dre
tool. For all new proposals and specific scenarios it is your responsibility to add them to the tooling as the new use cases come.
5. Monitor status and health of CI¶
-
Weekly dependency upgrade job:
-
A GitHub Action runs weekly to automatically upgrade dependencies.
-
While some weeks result in straightforward updates, others may require manual intervention due to API changes or other breaking updates.
-
Your responsibility:
-
Review and address any issues with the generated pull request.
- Ensure the fixes are implemented and attempt to merge the PR into the repository.
- Maintaining compatibility between the IC repo and our repo reduces friction and ensures our tooling operates smoothly.
- Mainnet topology proposals, such as
dre network --heal --optimize --ensure-operator-nodes-unassigned --ensure-operator-nodes-assigned --remove-cordoned-nodes
or a subset of these operations. The operations are still not polished enough to be run automatically. - Provider reward adjustment proposals, if any are needed that week. Please ask in
#eng-dre
if you don't know if any are needed.
6. Handover operations¶
- If there are any pending tasks or unresolved operations, it is your responsibility to inform the next on-call team member.
- Provide clear details on what needs to be addressed and any context that might help them pick up where you left off.