So before I can show you what I am doing to fix our scom environment, I need to tell you a little bit about what we did wrong, and what issues we are seeing because of it.
SCOM uses Management Packs for everything it does.
Management Packs (MP) contain predefined monitoring settings that enable agents to monitor a specific service or application in Operations Manager 2007. These predefined settings include discovery information that allows management servers to automatically detect and begin monitoring objects, a knowledge base that contains error and troubleshooting information, alerts, and reports.
In short MP's are a collections of rules and alerts for a specific object. Many MP's are provided from Microsoft for monitoring things like Windows 2008 Server OS, or SQl. Some third party vendors provide management packs as well. For example we have a lot of Dell Servers in our environment, so we installed the dell server MP. This MP monitors physical system health and alerts us when a failure has occurred.
When you First install SCOM it will install around 40 different default MPs. These MP's are a core part of SCOM. You then have the option to install additional MPs. These will be application specific, Windows, SQL, Exchange, ect. It is a best practice that you only install MPs that you need, and install them in a controlled manner, so they can be configured, and overridden in a controlled manner.
In my environment we have 140 MP's installed, with very minimal overrides configured. We have anything with a critical status to generate an email to the System Admins. This creates a large amount of garbage emails to be generated.
The Second big issue we have is Overrides. It is best practice to never place an override in the default management pack (by default it will automatically select to go there.) Instead each management pack should have its own custom build override pack. For example if the MP is "Active Directory Server 2008 (Monitoring)" you would create a custom override MP "Active Directory Server 2008 (Monitoring) - Override." Like most items in your environment, naming conventions are key here. If you dont come up with something that you follow for naming custom MP's you will end up with a mess.
In my environment the overrides we do have defined are all over the place, some are in "Default Management Pack" some "Test" or "Test MP" and very little follow any naming scheme.
Unfortunately before i can fix the email flooding issue, I needed to fix the Poor Location and Naming Scheme for our custom MP's.
So I have been tasked at work to clean up our scom environment. I have been spending a lot of time trying to organize and understand how scom works. Wikipedia Defines scom as:
System Center Operations Manager is a cross-platform data center management system for operating systems and hypervisors. It uses a single interface that shows state, health and performance information of computer systems. It also provides alerts generated according to some availability, performance, configuration or security situation being identified. It works with Microsoft Windows Server and Unix-based hosts.
In short its a health management tool. It sends out email alerts based on what you define, to notify you of issues, or pending issues within your environment. It will allow you to migrate from a reactionary department to a proactive department.
This tool when configured right and make your job much easier, but when not configured right... well you end up with what i have.
Currently i receive between 200 and 500 email's a day on "critical" issues within my environment.
"Critical" Issues Like:
Alert: Computer Browser Service Stopped Resolution state: New
Alert: Miscellaneous SAM Errors Resolution state: New
Alert: DC is both a Global Catalog and the Infrastructure Update master Resolution
Now don't get me wrong, i do get emails about actual critical issues within my environment, but when i get emails on issues like the above, it makes it near impossible for me to react within a timely manner.
So. Im going to start posting tips and tricks i have found for managing scom on here. Hopefully some of this stuff will help someone down the line, or even myself when I need to come back and reference something.
Hope you enjoy the Adventure.