The primary aim of IT operations is to run a trustworthy IT ecosystem. From the point of view of your customer, you want to do such a good job that they don’t even notice IT. For older organizations this can be a challenge due to the existence of hundreds, if not thousands, of legacy systems that have been deployed over the decades. You may face daunting technical debt in these systems – poor quality data, overly complex or poorly written source code, systems with inadequate automated regression tests (if any), different versions of the same system, several systems offering similar functionality, numerous technology platforms, systems and technologies for which you have insufficient expertise, and more.
This article is organized into the following topics:
- The Lean IT operations mindset
- The IT Operations process
- Teaming strategies
- IT operations success factors
The Lean IT Operations Mindset
The Disciplined Agile (DA) toolkit describes strategies for how an organization’s IT group can support a lean enterprise. An important part of this is to have an effective IT operations strategy, and to do that the people involved need to have what we call a “lean IT operations mindset.” The philosophies behind such a mindset include:
- Run a trustworthy IT ecosystem. At a high level the goal is to “keep the lights on.” At a detailed level anyone responsible for IT operations wants to run an IT ecosystem that is sufficiently secure, resilient, available, performant, usable, and environmentally friendly. Part of running a trustworthy ecosystem is monitoring running services so as to identify and hopefully avoid potential problems before they occur. For some systems, and perhaps for your IT ecosystem as a whole, you may have service level agreements (SLAs) in place with your end users that guarantee a minimum level of trustworthiness.
- Focus on the strategic (long-term) over the tactical (short-term). Anyone responsible for IT operations needs to have a very good understanding between the long-term implications of a decision versus the short-term conveniences. A classic example of this right now is the preference of people building micro-services to use what they believe to be the best technologies for each service. This makes a lot of sense from the narrow viewpoint of that service and it often proves to be incredibly convenient, and fun, for the developers because they often get to work with new technologies. However, from an operational point of view you end up with a mishmash of technologies that must be operated and evolved over time, resulting in a potential maintenance nightmare. Yes, you will still make some short-term decisions but you should do so intelligently. Too great a focus on the long term results in a stagnant IT ecosystem, too great a focus on short-term decisions results in operations teams who spend all their time fighting fires. The long-term technical vision for your organization is developed by your Enterprise Architecture efforts and the long-term business vision comes from your Product Management activities.
- Streamline the overall flow of work. This arguably should be part of everyone’s mindset, but it is particularly important for people doing IT operations work. IT operations has traditionally been a bottleneck in many organizations, often the result of the need to run a trustworthy ecosystem and to focus on long-term considerations, hence the need to focus on streamlining the overall flow of work. BUT, this isn’t just operational work that we need to streamline, but the overall flow of work into, within, and out of IT operations. In this case we need a disciplined approach to DevOps that takes all aspects of the development-operations lifecycle into account, including the support of multiple development lifecycles (not just continuous delivery), the release management process, and the operational aspects of data management. Of course, streamlining the flow of work goes beyond development-operations and is an important goal of your organization’s continuous improvement strategy.
- Help end-users succeed. An important goal of people performing operations activities is to ensure that your end users are successfully using your IT systems. It doesn’t matter how well your systems are built, or how trustworthy they are, if your end users are unable or unwilling to use them effectively. End users are going to need help – you need to be prepared to provide a support function.
- Standardization without stagnation. The more standardized your IT ecosystem is the easier it will be to run, to release new functionality into, and to find and fix problems if they should arise. However, too much standardization can lead to stagnation where it becomes very difficult to evolve your ecosystem. You will need to work very closely with people performing enterprise architecture and product management activities to ensure that you understand the long term vision and are working towards it.
- Regulate releases into production. Most DevOps strategies reflect the viewpoint of a single product team. But what about the viewpoint of your overall IT ecosystem, which may comprise hundreds of products? An interesting question to ask is what is the WIP limit for releases across your overall ecosystem? In other words, what rate of change can your infrastructure, and your stakeholder community, bear? In the Disciplined Agile (DA) toolkit this philosophy is an important driver of the Release Management process blade. Furthermore, some regulatory compliance regimes call out a separation of concerns pertaining to release management – the people building a product are not allowed to release the product into production, someone else must make that decision and do the work (even if “the work” is merely pressing a button to run a script).
- Sufficient documentation. Yes, there will be some documentation maintained about your IT ecosystem. Hopefully this documentation is concise, accurate, and high-level. Common documentation includes an overview(s) of your infrastructure, release procedures (even if fully automated, there’s still some overview documentation and training), and high-level views of critical aspects of your infrastructure including security, data architecture, and network architecture. Organizations that operate in regulated industries will of course need to comply to the documentation requirements of the appropriate regulations. When infrastructure components are discoverable and self-documenting there is a lesser need for external documentation, but there is still a need. Any documentation that you do create should be maintained under configuration management (CM) control.
The IT Operations Process
Some methods will choose to prescribe a single approach, such as capturing architectural requirements in the form of epics or pre-building “architectural runways,” but the Disciplined Agile (DA) toolkit promotes an adaptive, context-sensitive strategy. DA does this via its goal-driven approach that indicates the decision points that you need to consider, a range of techniques or strategies for you to address each decision point, and the advantages and disadvantages of each technique. In this section we present the goal diagram for the IT Operations process blade and overviews its decision points.
Figure 1 overviews the potential activities associated with Disciplined Agile IT Operations.
Figure 1. The Operations process blade (click on diagram for larger version).
The decision points that you need to consider for IT Operations are:
- Run solutions. The reason why your IT operations efforts exist is to run your organization’s solutions in production.
- Manage infrastructure. Your IT ecosystem is made up of the solutions that you build and buy as well as the infrastructure (hardware, software, network, cloud, and so on) that those solutions run on. This infrastructure must be managed (and evolved).
- Manage configurations. You need to understand the configuration of your IT ecosystem, including dependencies between various aspects of it, to support impact analysis of any potential changes. Traditional strategies are centered around manual maintenance of configuration and dependency metadata, a risky and expensive proposition at best. Agile strategies focus on deriving/generating the required metadata from development tools, particularly from agile management tools such as VersionOne or the Atlassian Suite -or- from executable test specifications.
- Evolve infrastructure. You will evolve your IT infrastructure over time, upgrading databases, operating systems, hardware components, network components, and many more. Due to the significant coupling of your solutions to your infrastructure, and infrastructure components to other aspects of your infrastructure, this can be a risky endeavor (hence the need to identify the potential impact of any change before making it).
- Mitigate disasters. Disciplined organizations will plan for operational disasters. Potential disasters include servers going down, network connectivity going down, power outages, failed solution deployments, failed infrastructure deployments, natural disasters such as fires and floods, terrorist attacks, and many more. Furthermore, it is one thing to have disaster mitigations plans in place, it is another to know whether they actually work. Disciplined organizations will run through disaster scenarios to verify how well their mitigation strategies work in practice. This can be done on a scheduled basis at first, evolving into unscheduled or “random” problems (via something like ChaosMonkey) and eventually even full-fledged disaster scenarios.
- Govern IT operations. As with other process blades, the activities of IT Operations must be governed effectively. Operational governance is part of your organization’s overall IT Governance and Control efforts.
Your organization will need to organize the person(s) involved with IT operations as appropriate for your situation. In this section we share three common patterns for doing so:
- A traditional strategy
- A DevOps strategy for small-organizations
- A DevOps strategy for large organizations
First, let’s explore the traditional approach to organizing an operations team. This is depicted in Figure 2. With this strategy the development teams and the operations team(s) are kept separate, often because the skillsets are perceived to be distinct and sometimes because of a strict interpretation of separation of duties requirements in regulations such as PCI-DSS. A release manager, and in larger organizations a release management team, is responsible for shepherding releases of new functionality into production. The operations team is responsible for running the solutions in production, for maintaining and evolving the IT infrastructure, for monitoring running systems, and for addressing problems as best they can. There is often an IT support team, not shown in Figure 2, helping end users.
Figure 2. A traditional approach to operations (click on diagram for larger version).
The small-organization DevOps teaming strategy is depicted in Figure 3. This works well in organizations with a handful of systems, where each system is being evolved by a solution delivery team, and where a “you build it, you run it” approach has been adopted by the delivery teams. In this case the delivery teams themselves are responsible for developing, releasing, operating, and (very likely) supporting their solution.
Figure 3. A DevOps approach to operations in small organizations (click on diagram for larger version).
You can see that there in Figure 3 that there is someone in the role of “DevOps Engineer”, a specialist role. This role is common when organizations are either new to DevOps or are very small. In small organizations a DevOps Engineer is typically a “jack-of-all-DevOps-trades” who takes on the responsibilities of several of the roles in Figure 4 (Release Manager/Coordinator, Database Administrator, Toolsmith, and sometimes even Operations Engineer). As your organization grows larger you’ll find that these specialist roles will emerge and that DevOps Engineer goes away. In organizations new to DevOps you’ll often see them call their more senior developers, particularly those on teams following the Continuous Delivery:Lean or Continuous Delivery:Agile lifecycles, DevOps Engineer.
The large-organization DevOps teaming strategy is depicted in Figure 4. As an organization grows they realize that the “you build it, you run it” philosophy at the team level doesn’t scale very well by itself. The developers on a given team can and should operate and support the specific functionality that they are responsible for, but that functionality is hosted within your overall enterprise ecosystem. Because there is a shared ecosystem, there are some common issues across teams that are better handled at the strategic level. For example, this may include:
- Having a Release Manager/Coordinator to coordinate and guide the deployment efforts across dozens or even hundreds of teams
- Operations engineers to operate and evolve common infrastructure such as the network, your servers, and your external services (e.g. the cloud)
- Disaster planning, simulation, and mitigation
- Managing the enterprise ecosystem configuration
Figure 4. A DevOps approach to operations in large organizations (click on diagram for larger version).
We’ve seen all three of these approaches, and combinations thereof, work quite well in practice. As usual, context counts – different situations require different teaming strategies.
IT Operations Success Factors
Successful Operations efforts balance several competing factors:
- Strategic (long term) versus tactical (short term). There is a fine balance between ensuring operational safety while enabling the evolution of operational systems.
- Operations needs versus organizational needs. You want to not only optimize the flow of operational work but do so within the context of your larger organization – Context Counts.
- Standardization versus evolution. To reduce the overall cost and risk associated with operations, and to simultaneously make it easier for development teams to test and release changes into production, you want to standardize as much of your IT infrastructure as possible. Yet your infrastructure cannot be allowed to stagnate, it must safely evolve over time – Hence the need to work with your Enterprise Architecture efforts to envision the future and run experiments so as to learn how to evolve towards that vision.
- Team DevOps versus organizational efficiency. The DevOps philosophy of “you build it, you run it” is very attractive to individual delivery teams, and it certainly makes sense for smaller organizations. But for organizations with dozens, hundreds, or even thousands of delivery teams working in parallel your costs and risks quickly skyrocket. These organizations quickly realize that having a flexible operations/infrastructure team to support the delivery teams to leverage common infrastructure and guidance will help to optimize the overall workflow across your DAE – Follow the Pragmatism principle.