Security Operations Center (SOC) - When detection is not enough
A Security Operations Center (SOC) is often composed of a team of dedicated analysts that hawkishly monitor everything happening within the infrastructure. The ultimate tool in their arsenal is a Security information and event management (SIEM) that aggregates logs from various sources: Endpoint Detection Response, Antivirus, Firewall and Cloud API calls. These logs are submitted to a detection engine full of scenarios defined by the SOC team.
It sounds like the ideal setup. The north star that every company should pursue. But the reality is that many SOC teams I have encountered are either busy chasing their own tails or drowning under a sea of false positives.
When questioning such teams, one can always zero-in on the same root cause: They only work on detection.
Allow me to explain.
Protection is the prerequisite for good detection
As outlined by the NIST security framework, detection is the third step following the identification and protection phases. The case for identification of assets is obvious, we want to have maximum asset coverage to detect all potentially compromised systems. The more blind spots we have, the less efficient our detection system is.
Protection, however, is often viewed as an entirely separate endeavor, well outside the scope of the SOC team. That’s the fallacy I would like to address.
Left alone, the information system is an explosive cacophony of requests and actions: “Anything that can happen, will happen” in a vague approximate echo of Murphy’s law. Users allowed to make AWS API calls to fetch production secrets will do just that, lax firewall rules will be taken advantage of to perform nasty FTP calls, default Windows machines will flood the network with NTLM requests and so on.
Trying to detect suspicious behavior in this commotion is a Sisyphean task. Wherever you look, there are exceptions to the rules you write. However cleverly you approach the problem, you get inundated with a sea of false positives that either make the team churn or allow attackers to hide in plain sight.
The root cause of it all is that SOC teams are trying to solve a multi-variable equation using a single knob: tweaking detection rules. The math is clear, it will not work. Yes, we need to tweak rules, but we also need to tweak the systems producing these logs to constrain their output.
Hardening systems, tightening firewall rules, whitelisting applications and containers, reducing privileges…all these actions limit the number of potential interactions in a system. They rationalize it to what is strictly required to conduct the business, implicitly defining the baseline of activity. And once we can confidently state what is “normal” within an infrastructure, then by definition everything else is suspicious, ergo, subject to detection. This gets rid of 90% of the noise and greatly simplifies rules.
If all admin actions are forced to go through a given tool (Terraform, Ansible, CyberArk, what have you), then detecting suspicious admin access amounts to detecting changes happening outside these tools. That’s it. Try doing that in an environment where every admin is querying the infrastructure from their own laptop, their own IP address, using a generic account on custom scripts… Nearly impossible!
A SOC team excised from the rest of the business, solely focusing on detection rules and investigating poorly calibrated alerts, will not be able to live up to its true potential. The team needs to tweak the infrastructure, change system configuration, alter container and system configuration…all to force their logs to take a certain shape, a certain pattern. Doing so will confer them infinitely more potential for effective detection.
Aim for 10 alerts during the week. 0 during the weekend. Anything more needs tweaking, but most of all, don’t forget to take advantage of all the knobs ;)