Monitoring Your GCP VMs Made Easy: Using Ops Agent and Alerts

6 min readJan 3, 2024

In this article, we’re going to take a simple and straightforward approach to monitoring your Google Cloud Platform (GCP) virtual machines (VMs) and setting up alerts when needed. We’ll keep it “Keep It Simple, Stupid” (KISS) style, so no over engineering or overthinking required!

Meet the Ops Agent

The Ops Agent is like your all-in-one superhero for monitoring. It can handle both logging and metrics, replacing the need for separate agents like the Stackdriver Logging agent and the Stackdriver Monitoring agent. But before we dive into the details, here are a couple of important things to remember:

Check the supported OS for the Ops Agent.
Make sure your VM type is supported.
Don’t forget to enable API logging and monitoring and assign the necessary roles to your VM’s service account (avoid using the default service account).roles/monitoring.metricWriter roles/logging.logWriter

Easy Installation on Individual VMs

Let’s start with installing the Ops Agent on a single VM. You might be tempted to use your mouse and click around the GCP console, but we recommend using Infrastructure as Code (IAC) tools like Terraform or Pulumi for consistency.

SSH into your VM.
Run this simple command:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh sudo bash add-google-cloud-ops-agent-repo.sh --also-install

Alternatively, when creating your VM, follow the steps shown in the picture to install the Ops Agent effortlessly.

Keep an eye also to the new (not that much new on the terraform code generated based on what you select for your VM )

Keep It Simple with IAC

you can streamline the process using Infrastructure as Code. Here’s how to do it:

Define your VM configurations in Terraform or Pulumi.
Specify the Ops Agent installation as part of your VM setup.

resource "google_compute_instance" "vm_instance" {

 project = var.project_ID
  name         = "ops-agent-vm"
  machine_type = "f1-micro"
  zone         = "europe-west1-b"

   boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"
      labels = {
        my_label = "value"
      }
    }
  }

  // Local SSD disk
  scratch_disk {
    interface = "SCSI"
  }

  network_interface {
    network = "default"

    access_config {
      // Ephemeral public IP
    }
  }

  metadata = {
    foo = "bar"
  }

  metadata_startup_script = <<-EOF
    #!/bin/bash
   curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
   sudo bash add-google-cloud-ops-agent-repo.sh --also-install
  EOF

  service_account {
    scopes = ["cloud-platform"]
  }
}

Fleet Installation: Label ’Em All

Imagine you have a bunch of VMs, and you want to equip them all with the Ops Agent for logging and monitoring. No need to manually visit each VM; just label ’em and let the magic happen with what we call an “agent policy.”

Using GCP’s GCloud Command Line

With GCP’s GCloud command-line tool, it’s as easy as pie:

gcloud beta compute instances \     
  ops-agents-vm policies update ops-agents-policy-safe-rollout \     
  --group-labels=env=po,service=devtools \
  --zones=us-central1-c

Simply specify the group labels, and watch the Ops Agent roll out like clockwork.

Using Terraform: Simplify with IAC

If you prefer the magic of Infrastructure as Code (IAC), Terraform has your back:

module "agent_policy" {
  source     = "terraform-google-modules/cloud-operations/google//modules/agent-policy"
  version    = "~> 0.2.3"
  project_id = "<PROJECT ID>"
  policy_id  = "ops-agents-policy"
  agent_rules = [
    {
      type               = "logging"
      version            = "current-major"
      package_state      = "installed"
      enable_autoupgrade = true
    },
    {
      type               = "metrics"
      version            = "current-major"
      package_state      = "installed"
      enable_autoupgrade = true
    },
  ]
  group_labels = [
    {
      env = "po"
      service = "devtools"
    }
  ]
  os_types = [
    {
      short_name = "debian"
      version    = "10"
    },
  ]
}

Terraform does the heavy lifting, applying the Ops Agent rules consistently across your VM fleet.

Don’t Forget osconfig.googleapis.com

Before diving into agent policies, make sure to enable osconfig.googleapis.com – a crucial step in this process. It ensures smooth installation and maintenance of the Ops Agent.

https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/managing-agent-policies

Play It Cool with Ansible

For the Ansible enthusiasts out there, we’ve got a playbook ready for you. Create your VMs and then unleash the Ansible role:

# Installing the Ops-Agent with custom configuration
- hosts: all # replace it with your inventory 
  become: true
  roles:
    - role: googlecloudplatform.google_cloud_ops_agents
      vars:
        agent_type: ops-agent
        version: 1.0.1
        main_config_file: ops_agent.yaml

Ansible takes the stage, making fleet installation a breeze.

https://github.com/GoogleCloudPlatform/google-cloud-ops-agents-ansible

Now that you’ve got the Ops Agent installed and running smoothly, it’s time to take the next step: setting up monitoring alerts and creating informative dashboards. Don’t worry, we’ll keep it easy and straightforward.

Your Ops Agent Playground

You’ve already installed the Ops Agent, which is great! But what if you want to keep an eye on your VMs’ performance? Well, that’s where monitoring alerts and dashboards come in.

The Dashboard Delight

Creating a dashboard in GCP is a piece of cake. You don’t need to write complex JSON, instead, you can do it right from the GCP console. Here’s how:

Visit the GCP console.
Navigate to “Monitoring” and click “Dashboard.”
Click your way through the options to design your dashboard.
When you’re done, save it as a JSON file.

Simple, right? Now, let’s move on to setting up alerts.

Alerting Awesomeness

Your VMs are like moody teenagers. They’re fine one moment, and then, boom! They’re using up all the memory. Time for a chat, but not just any chat — an alert!
You want to be notified when your VMs misbehave, right? No problem! We’ll set up alerts for memory, disk, and CPU usage.

Using Terraform (IAC for the Win)

We’ll use Terraform, the Infrastructure as Code tool, to create these alerts. Check out the simplicity:

resource "google_monitoring_alert_policy" "alert_policies" {
  project = var.project_id
  for_each = var.alert_policies
  display_name = each.value.display_name
  user_labels  = {}
  conditions {
    display_name = each.value.display_name
    condition_threshold {
      filter     = each.value.filter
      aggregations {
        alignment_period       = "60s"
        cross_series_reducer  = "REDUCE_NONE"
        group_by_fields =  ["metadata.system_labels.name"]
        per_series_aligner    = "ALIGN_MEAN"
      }
      comparison = "COMPARISON_GT"
      duration   = "0s"
      trigger {
        percent = 100
      }
      threshold_value = each.value.threshold_value
    }
  }
  alert_strategy {
    auto_close = "604800s"
  }
  combiner             = "OR"
  enabled              = true
  notification_channels = [google_monitoring_notification_channel.default.name]
  depends_on = [
    google_monitoring_notification_channel.default
  ]
}

And here’s where you define your alert policies:

variable "alert_policies" {
  type = map(object({
    display_name      = string
    filter            = string
    threshold_value   = number
  }))
  default = {
    cpu_utilization = {
      display_name      = "CPU Usage Alert"
      filter            = "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\""
      threshold_value   = 0.8
    },
    memory_utilization = {
      display_name      = "Memory Usage Alert"
      filter            = "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/memory/percent_used\" AND metric.labels.state != \"free\""
      threshold_value   = 80
      
    },
    disk_utilization = {
      display_name      = "Disk Usage Alert"
      filter            = "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND metric.labels.state != \"free\""
      threshold_value   = 80
    }
  }
}

The Slack Secret Sauce

First, you need a way to send these alerts. We’re using Slack because it’s like texting for the cloud. But how do we keep the Slack token a secret? We have a secret weapon: Google Secret Manager! It’s like storing your secret sauce in a high-tech vault.

data "google_secret_manager_secret_version" "slack_bot_user_oauth_token" {
  secret  = "secret_me"
  project = var.project_id
}
resource "google_monitoring_notification_channel" "default" {
  project = var.project_id
  display_name = "Notification and alerting"
  type         = "slack"
  labels = {
    "channel_name" = "#alerts-gcp"
  }
  sensitive_labels {
    auth_token = data.google_secret_manager_secret_version.slack_bot_user_oauth_token.secret_data
  }
}

Now, when those VMs start acting up, Slack will be your hotline.

Dashboard Delights

Remember that dashboard you created earlier? It’s about to get even cooler. You want it to display your alerts, right? We’re going to mix some variables and create magic!

resource "google_monitoring_dashboard" "monitored_dash" {
  project = var.project_id
  dashboard_json = templatefile(
    "dash.jsontpl",
    {
      cpu_alert = google_monitoring_alert_policy.alert_policies["cpu_utilization"].name
      memo_alert = google_monitoring_alert_policy.alert_policies["memory_utilization"].name
      disk_alert = google_monitoring_alert_policy.alert_policies["disk_utilization"].name
    }
  )
  depends_on = [
    google_monitoring_alert_policy.alert_policies
  ]
}

Imagine your dashboard as a canvas, and these alerts are your brush strokes. With each alert, your dashboard becomes a masterpiece!

{
  "dashboardFilters": [],
  "displayName": "GCE VM Instance Monitoring",
  "labels": {},
  "mosaicLayout": {
    "columns": 48,
    "tiles": [
      {
        "height": 19,
        "widget": {
          "alertChart": {
            "name": "${cpu_alert}" ==> because it will replace it with 
             variable above as it will do for the memo_alert 
          }
        },
        "width": 24

Conclusion

You’ve leveled up your GCP monitoring game. Now, when your VMs throw a tantrum, you’ll get a friendly Slack message. Plus, your dashboard will tell you all about it in a visually appealing way. Keep having fun in the cloud, and stay tuned for more cloud adventures!

everything mentioned here, like project names, 
service accounts, and VM names, is just for
 testing purposes and not actual production use.