Below notes are based on LinkedIn Course:
Monitoring Basics: Modelling your system
- USE Model
- RED Model
- DWR Model
Check if a service is down or slow, Uptime, Pagespeed, Visitor Insights, Transactions)
- Write a CURL script
- Phantom.js app
- steelcentral monitoring via application agents
- CPU Usage Monitoring
- Memory Usage Monitoring
- Disk I/O Monitoring
- iNode & File Handle usage
- Datadog agent
- New term: DevNetOps
- NetFlow Analysis: Feature by Cisco
- Packet Analysis: tcpdump, Wireshark, server based agents, or cloud packet aggregator like Gigamon
- API based events sent via application (for example new signups)
Problems: Garbage collection, Latency, Utilization, Heavy parsing, Thread management, Database locking, Connection leaks, Table scans, File handles, Timeouts, Disk queing
- JMX Metrics (Java)
- WMI Metrics (.NET)
- Use proper log levels: INFO, WARN, DEBUG, FATAL
- Log keeping is expensive on storage
- Use json logs for easy parsing.
- Format your verbose logs.
- For problem detection: use concise dashboard: simple numbers, event overlays, percentile graphs
- All in one Context: performance, throughput & errors
- Not in one context, if it misleads the information.
- Design better visualisation book: Envisioning Information by Edward Tufte
- New Alert: Who & Why?
- Alerts can also be used for automatic remediation.
- Two common problems: False Negatives (too less alerts) & False Positives (too many alerts)
- Example 1: If the number of 500 are over 50 within a minute, alert me.
- Example 2: Receiving alerts at midnight isn’t actionable, so adding these info on a dashboard could be better.
- Actionable Alerts:
- Test your alerts with dummy data
- Instrument the real thing
- Send context with your alert
- Set budgets for alerts in your Agile Process
- Before creating automated alerts, try validating it with people that would action them.