Anyway, I recently came across a very interesting article from them and I wanted an excuse to reference it here and discuss some of the issues they raise.
So first off, go here and read this NCSC, then come back here!
My reading of that article is that it is mostly about logging for security purposes rather than for diagnostic purposes, which is a good and necessary thing of course, but logging can be much, much more than that.
In fact I'd go so far as to say, if you aren't doing good logging, how do you know that your network and systems are working properly?
It's also important to remember that logging is like gardening - it's not a 'fire and forget' task - you need to keep tuning it as your network and attack surface changes, although beware that many companies will offer you AI based logging systems that claim to avoid this necessity, though I have my doubts!
Security Information and Event Management (SIEM)
What we are talking about here is a very cut-down approach to Security Information and Event Management (SIEM) which is a huge topic and large companies make small fortunes advising and supplying big software suites to help with this.
However, many SMEs and even some large ones get by with much more straightforward approaches.
Commercial and Free Tools
I'm not planning to give a total rundown of what is a huge market but here are a few packages to look at; some of which I've used, and some I haven't
- Splunk - aggregate and dashboard stuff in a very straightforward fashion. This also has installable agents for on-system data collection.
- Logstash/Kibana/ElasticSearch - this is the open-source equivalent for the above, but as with all open-source stuff you may find it needs more care and feeding than you might like.
- Solarwinds Security and Event Manager - commercial product. Not cheap!
- Nagios - an opensource log management and analysis tool
- Solarwinds SEM - I've not actually used this but it's spoken well of (if you have $$$)
- AlienVault OSSIM - I've not used this either but it looks interesting
What to log
My standard answer would be "it depends". Whilst I appreciate that isn't very helpful, here's some of the things I think should always be logged and some "it depends" items too.
If one needs a quick fix one can just log messages of a given level of severity or above (which does somewhat assume that all the clients properly label each message with a sensible severity, of course!)
- Authentication failures
- Authentication successes
- General 'out of resource' errors e.g. out of memory, storage, CPU etc.
- System Restarted messages of all sorts - bad actors installing malware will quite often have to force a reboot to complete their installation.
- Any routing state changes, be that for OSPF, RIP, BGP or whatever
- LACP State Changes
- Spanning-Tree State Changes
- Flapping ports
- Error-Disable activity messages on Switch Ports - very handy for detecting if someone has added an unauthorized unmanaged switch to the network!
- Ideally one would capture the whole device config on a regular basis and compare with the previous version. This is very handy in detecting unauthorised config changes (Also see Never Be The Last Person Who Touched The Firewall)
- Ted have you any other things to suggest here?
Things to look for when analysing logs
- Same user logging in successfully or unsuccessfully across multiple devices at the same time - this might be legit, it might not.
- Any gaps in the logs ( which could be a sign of attacker deleting stuff )
- Ted have you any other things to suggest here?
How to Log
You should get logging information off the box you are protecting ASAP to avoid it being over-written or compromised by a bad actor.
Preferably one should use a write-only channel such as syslog (although syslog doesn't have any form of integrity protection). The standard way of doing this is just with unencrypted UDP on port 514 but there is now (since 2009!) an IETF standard way of doing this as per RFC5425 which provides protection against various forms of attacks.
The NCSC have a "basic logging toolkit" to help with this [https://github.com/ukncsc/lme Logging Made Easy]
Once you have all the log messages from across your estate in (approximately) the same place and hopefully into a relatively sane format (syslog format is common and will save you time munging log formats later...) you can then look at how to process the logs
This is inevitably tied up with the "how much to log" question, and the related second question "How far back do you want to be able to go?", as obviously the more days logs you want to keep, the more disk you need. On the plus side, since each log entry is likely to be a pretty small line of text so it's not like you are re-inventing YouTube or anything like that!
As an example of how far you can go, a colleague of ours was the security manager for a large-ish gambling site and he had a setup where he captured every byte of data flowing into and out of his network from the internet, and he stored that for several weeks in a searchable form - just in case he needed to replay any suspicious traffic for investigation. Now I'm not suggesting you ought to go that far, but none-the-less keeping all your log messages for say several months would seem to be a reasonable starting point
If you are trying to analyse an issue, it may be useful to initially narrow the search to a relatively wide time window, but only looking for log messages of a relatively high severity. That may highlight a narrower timeframe where a suspicious log message occurs, at which point you can inspect a narrower time window more rigorously, looking for any messages of any severity in that timeframe.
A general point would be that one is trying to find patterns. Sometimes these will be repeated instances of the same message on the same device, or perhaps a cluster of similar messages around the same time across multiple devicess. One recent item that we've seen recently showed a pattern where any given "incident" of one message type on a given switch was followed after about 10 seconds by a different set of message types on the same device. The obvious conclusion was that the latter was somehow a consequence of the former, presumably driven in part by some timeout settings in the various protocols running on the Switch.
Another general idea is that it's worth having a regular inspection of logs for anything which is hogging the limelight; indeed, you could even make a simple tool which counts the top 10 log messages on a daily basis. These can be simple things like "flapping" edge ports which don't, of themselves, cause huge problems to the overall network, but may be consuming huge amounts of log space and possibly even Switch CPU resource, making it difficult to see the wood for the trees when a real problem comes along.
If you would like full confidence that your network is fit for use and that nothing untoward is happening on your network, please contact us on 0203 805 7795 and talk directly to one of the Layer3 Systems team.
[Phil, what is this link for, can it be removed?] See: here