In this article, we’re going to take a simple and straightforward approach to monitoring your Google Cloud Platform (GCP) virtual machines (VMs) and setting up alerts when needed. We’ll keep it “Keep It Simple, Stupid” (KISS) style, so no over engineering or overthinking required!
Meet the Ops Agent
The Ops Agent is like your all-in-one superhero for monitoring. It can handle both logging and metrics, replacing the need for separate agents like the Stackdriver Logging agent and the Stackdriver Monitoring agent. But before we dive into the details, here are a couple of important things to remember:
- Check the supported OS for the Ops Agent.
- Make sure your VM type is supported.
- Don’t forget to enable API logging and monitoring and assign the necessary roles to your VM’s service account (avoid using the default service account).
roles/monitoring.metricWriter roles/logging.logWriter
Easy Installation on Individual VMs
Let’s start with installing the Ops Agent on a single VM. You might be tempted to use your mouse and click around the GCP console, but we recommend using Infrastructure as Code (IAC) tools like Terraform or Pulumi for consistency.
- SSH into your VM.
- Run this simple command:
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh sudo bash add-google-cloud-ops-agent-repo.sh --also-install
Alternatively, when creating your VM, follow the steps shown in the picture to install the Ops Agent effortlessly.
Keep an eye also to the new (not that much new on the terraform code generated based on what you select for your VM )
Keep It Simple with IAC
you can streamline the process using Infrastructure as Code. Here’s how to do it:
- Define your VM configurations in Terraform or Pulumi.
- Specify the Ops Agent installation as part of your VM setup.
resource "google_compute_instance" "vm_instance" {
project = var.project_ID
name = "ops-agent-vm"
machine_type = "f1-micro"
zone = "europe-west1-b"
boot_disk {
initialize_params {
image = "debian-cloud/debian-11"
labels = {
my_label = "value"
}
}
}
// Local SSD disk
scratch_disk {
interface = "SCSI"
}
network_interface {
network = "default"
access_config {
// Ephemeral public IP
}
}
metadata = {
foo = "bar"
}
metadata_startup_script = <<-EOF
#!/bin/bash
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
EOF
service_account {
scopes = ["cloud-platform"]
}
}
Fleet Installation: Label ’Em All
Imagine you have a bunch of VMs, and you want to equip them all with the Ops Agent for logging and monitoring. No need to manually visit each VM; just label ’em and let the magic happen with what we call an “agent policy.”
Using GCP’s GCloud Command Line
With GCP’s GCloud command-line tool, it’s as easy as pie:
gcloud beta compute instances \
ops-agents-vm policies update ops-agents-policy-safe-rollout \
--group-labels=env=po,service=devtools \
--zones=us-central1-c
Simply specify the group labels, and watch the Ops Agent roll out like clockwork.
Using Terraform: Simplify with IAC
If you prefer the magic of Infrastructure as Code (IAC), Terraform has your back:
module "agent_policy" {
source = "terraform-google-modules/cloud-operations/google//modules/agent-policy"
version = "~> 0.2.3"
project_id = "<PROJECT ID>"
policy_id = "ops-agents-policy"
agent_rules = [
{
type = "logging"
version = "current-major"
package_state = "installed"
enable_autoupgrade = true
},
{
type = "metrics"
version = "current-major"
package_state = "installed"
enable_autoupgrade = true
},
]
group_labels = [
{
env = "po"
service = "devtools"
}
]
os_types = [
{
short_name = "debian"
version = "10"
},
]
}
Terraform does the heavy lifting, applying the Ops Agent rules consistently across your VM fleet.
Don’t Forget osconfig.googleapis.com
Before diving into agent policies, make sure to enable osconfig.googleapis.com
– a crucial step in this process. It ensures smooth installation and maintenance of the Ops Agent.
https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/managing-agent-policies
Play It Cool with Ansible
For the Ansible enthusiasts out there, we’ve got a playbook ready for you. Create your VMs and then unleash the Ansible role:
# Installing the Ops-Agent with custom configuration
- hosts: all # replace it with your inventory
become: true
roles:
- role: googlecloudplatform.google_cloud_ops_agents
vars:
agent_type: ops-agent
version: 1.0.1
main_config_file: ops_agent.yaml
Ansible takes the stage, making fleet installation a breeze.
https://github.com/GoogleCloudPlatform/google-cloud-ops-agents-ansible
Now that you’ve got the Ops Agent installed and running smoothly, it’s time to take the next step: setting up monitoring alerts and creating informative dashboards. Don’t worry, we’ll keep it easy and straightforward.
Your Ops Agent Playground
You’ve already installed the Ops Agent, which is great! But what if you want to keep an eye on your VMs’ performance? Well, that’s where monitoring alerts and dashboards come in.
The Dashboard Delight
Creating a dashboard in GCP is a piece of cake. You don’t need to write complex JSON, instead, you can do it right from the GCP console. Here’s how:
- Visit the GCP console.
- Navigate to “Monitoring” and click “Dashboard.”
- Click your way through the options to design your dashboard.
- When you’re done, save it as a JSON file.
Simple, right? Now, let’s move on to setting up alerts.
Alerting Awesomeness
Your VMs are like moody teenagers. They’re fine one moment, and then, boom! They’re using up all the memory. Time for a chat, but not just any chat — an alert!
You want to be notified when your VMs misbehave, right? No problem! We’ll set up alerts for memory, disk, and CPU usage.
Using Terraform (IAC for the Win)
We’ll use Terraform, the Infrastructure as Code tool, to create these alerts. Check out the simplicity:
resource "google_monitoring_alert_policy" "alert_policies" {
project = var.project_id
for_each = var.alert_policies
display_name = each.value.display_name
user_labels = {}
conditions {
display_name = each.value.display_name
condition_threshold {
filter = each.value.filter
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_NONE"
group_by_fields = ["metadata.system_labels.name"]
per_series_aligner = "ALIGN_MEAN"
}
comparison = "COMPARISON_GT"
duration = "0s"
trigger {
percent = 100
}
threshold_value = each.value.threshold_value
}
}
alert_strategy {
auto_close = "604800s"
}
combiner = "OR"
enabled = true
notification_channels = [google_monitoring_notification_channel.default.name]
depends_on = [
google_monitoring_notification_channel.default
]
}
And here’s where you define your alert policies:
variable "alert_policies" {
type = map(object({
display_name = string
filter = string
threshold_value = number
}))
default = {
cpu_utilization = {
display_name = "CPU Usage Alert"
filter = "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\""
threshold_value = 0.8
},
memory_utilization = {
display_name = "Memory Usage Alert"
filter = "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/memory/percent_used\" AND metric.labels.state != \"free\""
threshold_value = 80
},
disk_utilization = {
display_name = "Disk Usage Alert"
filter = "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND metric.labels.state != \"free\""
threshold_value = 80
}
}
}
The Slack Secret Sauce
First, you need a way to send these alerts. We’re using Slack because it’s like texting for the cloud. But how do we keep the Slack token a secret? We have a secret weapon: Google Secret Manager! It’s like storing your secret sauce in a high-tech vault.
data "google_secret_manager_secret_version" "slack_bot_user_oauth_token" {
secret = "secret_me"
project = var.project_id
}
resource "google_monitoring_notification_channel" "default" {
project = var.project_id
display_name = "Notification and alerting"
type = "slack"
labels = {
"channel_name" = "#alerts-gcp"
}
sensitive_labels {
auth_token = data.google_secret_manager_secret_version.slack_bot_user_oauth_token.secret_data
}
}
Now, when those VMs start acting up, Slack will be your hotline.
Dashboard Delights
Remember that dashboard you created earlier? It’s about to get even cooler. You want it to display your alerts, right? We’re going to mix some variables and create magic!
resource "google_monitoring_dashboard" "monitored_dash" {
project = var.project_id
dashboard_json = templatefile(
"dash.jsontpl",
{
cpu_alert = google_monitoring_alert_policy.alert_policies["cpu_utilization"].name
memo_alert = google_monitoring_alert_policy.alert_policies["memory_utilization"].name
disk_alert = google_monitoring_alert_policy.alert_policies["disk_utilization"].name
}
)
depends_on = [
google_monitoring_alert_policy.alert_policies
]
}
Imagine your dashboard as a canvas, and these alerts are your brush strokes. With each alert, your dashboard becomes a masterpiece!
{
"dashboardFilters": [],
"displayName": "GCE VM Instance Monitoring",
"labels": {},
"mosaicLayout": {
"columns": 48,
"tiles": [
{
"height": 19,
"widget": {
"alertChart": {
"name": "${cpu_alert}" ==> because it will replace it with
variable above as it will do for the memo_alert
}
},
"width": 24
Conclusion
You’ve leveled up your GCP monitoring game. Now, when your VMs throw a tantrum, you’ll get a friendly Slack message. Plus, your dashboard will tell you all about it in a visually appealing way. Keep having fun in the cloud, and stay tuned for more cloud adventures!
everything mentioned here, like project names,
service accounts, and VM names, is just for
testing purposes and not actual production use.