Gathering of application metrics helps to identify incidents like brute force attacks, login/logout patterns, and unusual spikes in activity. Key metrics to monitor include: - Authentication attempts (successful/failed logins) - Transaction volumes and patterns (e.g. orders, payments) - API call rates and response times - User session metrics - Resource utilization Example: An e-commerce application normally processes 100 orders per hour. A sudden spike to 1000 orders per hour could indicate either: - A legitimate event (unannounced marketing campaign, viral social media post) - A security incident (automated bulk purchase bots, credential stuffing attack) By monitoring these basic metrics, teams can quickly investigate abnormal patterns and determine if they represent security incidents requiring response.
Risk:Attacks on an application are not recognized.
Cloud providers often provide insight into budgets. A threshold and alarming for the budget is set.
Risk:Not getting notified about reaching the end of the budget (e.g. due to a denial of service) creates unexpected costs.
Gathering of system metrics helps to identify incidents and specially bottlenecks like in CPU usage, memory usage and hard disk usage.
Risk:Without simple metrics analysis of incidents are hard. In case an application uses a lot of CPU from time to time, it is hard for a developer to find out the source with Linux commands.
Thresholds for metrics are set. In case the thresholds are reached, alarms are send out. Which should get attention due to the critically.
Risk:Incidents are discovered after they happened.
Implement cost budgets. Setting of an alert threshold and sending out errors when it is reached. In the best case, a second threshold with a limit is set so that the cost can not go higher.
Risk:Not monitoring costs might lead to unexpected high resource consumption and a high invoice.
Metrics are visualized in real time in a user friendly way.
Risk:Not visualized metrics lead to restricted usage of metrics.
Advanced metrics are gathered in relation to availability and stability. For example unplanned downtime's per year.
Risk:Trends and advanced attacks are not detected.
Gathering of system calls.
Risk:System events (system calls) trends and attacks are not detected.
Deactivation of unused metrics helps to free resources.
Risk:High resources are used while gathering unused metrics.
Meaningful grouping of metrics helps to speed up analysis.
Risk:The analysis of metrics takes long.
By the definition of target groups for incidents people are only getting alarms for incidents they are in charge for.
Risk:People are bored (ignorant) of incident alarm messages, as they are not responsible to react.
All defects from the dimension Test- and Verification are instrumented.
Risk:People are not looking into tests results. Vulnerabilities not recolonized, even they are detected by tools.
Usage of Coverage- and control-metrics to show the effectiveness of the security program. Coverage is the degree in which a specific security control for a specific target group is applied with all resources. The control degree shows the actual application of security standards and security-guidelines. Examples are gathering information on anti-virus, anti-rootkits, patch management, server configuration and vulnerability management.
Risk:The effectiveness of configuration, patch and vulnerability management is unknown.
Gathering of defense metrics like TCP/UDP sources enables to assume the geographic location of the request. Assuming a Kubernetes cluster with an egress-traffic filter (e.g. IP/domain based), an alert might be send out in case of every violation. For ingress-traffic, alerting might not even be considered.
Risk:IDS/IPS systems like packet- or application-firewalls detect and prevent attacks. It is not known how many attacks has been detected and blocked.
By having an internal accessible screen with a security related dashboards helps to visualize incidents.
Risk:Security related information is discovered too late during an incident.
Metrics during tests helps to identify programming errors.
Risk:Changes might cause high load due to programming errors.