A Complete Guide to Google Cloud Observability

Your application is deployed. Your GKE cluster is humming, your databases are ready, and your functions are awaiting triggers. You’ve built the ship. But now you have to sail it, and sailing in the cloud can feel like navigating through a thick fog.

Is the application healthy? Are users getting errors? Why did that request take five seconds? Without the right tools, your sophisticated system is a complete black box. This is the problem of observability, and it’s the art of understanding the internal state of your system by observing its external outputs.

In Google Cloud, observability is handled by the Cloud Operations suite (formerly Stackdriver). It’s not one tool, but a set of five powerful, interconnected services that act as the eyes, ears, and diagnostic scanners for your entire cloud environment.

Let’s step onto the bridge and learn how to investigate an issue from the first alert to the exact line of code, like a seasoned cloud captain.

The Red Alert: Cloud Monitoring

It’s 3 AM. You’re woken up by a notification on your phone. This is the start of our investigation, and it begins with Cloud Monitoring.

Cloud Monitoring is your ship’s bridge. It has all the live dials, gauges, and view-screens showing you the ship’s status right now. Its job is to watch your system’s performance metrics and tell you when something is wrong.

  • Metrics: Monitoring is built on metrics—time-series measurements of anything from VM CPU utilization to load balancer latency or the number of messages in a Pub/Sub topic.
  • Dashboards: This is your main viewscreen, where you build custom charts and graphs to get an at-a-glance view of your application’s health.
  • Uptime Checks: These are sentinels that constantly check your public endpoints from around the world to ensure your application is reachable.
  • Alerting Policies: This is the red alert system. An alerting policy watches a metric and, when a condition is met (e.g., “5xx error rate > 5% for 5 minutes”), it sends a message via a notification channel (like PagerDuty, Slack, or email). A great alert also includes documentation—a link to a runbook telling the on-call engineer what to do.

Our alert says the 5xx error rate is high. We know what is wrong, but we don’t know why. Our first step is to check the ship’s logbook.

The Investigation Begins: Cloud Logging

Cloud Logging is the centralized, searchable library for every log from every service in your project. Logs from most GCP services stream here automatically. For your VMs, you install the Ops Agent to collect logs, metrics, and traces.

Using the Log Explorer, you quickly filter for logs with severity=ERROR from the last 15 minutes. You find hundreds of entries with the message: "Error: Request to payment-service timed out after 3000ms".

Now we have a clue! It’s not our main application that’s failing; it’s failing because its downstream call to the payment-service is timing out. But the payment-service itself is a complex beast that calls three other microservices. Where is the actual delay?

Following the Trail: Cloud Trace

This is the perfect job for Cloud Trace, a distributed tracing system that acts like a GPS tracker for your requests.

Trace allows you to follow a single request as it travels through all the different services in your application. By instrumenting your code with a client library, Trace assigns a unique ID to each request and visualizes how much time was spent in each service call (each “span”).

You find a trace for a failed request and see a waterfall graph. It clearly shows the request spent 2,950ms of its 3,000ms timeout waiting for a response from the fraud-detection-service. We’ve found the slow service!

Zooming In on the Culprit: Cloud Profiler

We know the fraud-detection-service is the bottleneck, but why? Is it stuck in an inefficient loop? Is it struggling with memory allocation?

Cloud Profiler is the magnifying glass for your code’s performance. It continuously analyzes the CPU and memory usage of your applications in production with very low overhead. It helps you identify which specific functions or lines of code are the most resource-intensive.

You pull up the CPU profile for the fraud-detection-service from the last hour. The flame graph is a dead giveaway: a function called calculateRiskScoreV2 is consuming 95% of the CPU time. You’ve narrowed the problem down from a high-level alert to a single, specific function.

The Live Inspection: Cloud Debugger

The calculateRiskScoreV2 function is complex, and the bug only seems to happen with a specific type of user data that you can’t easily reproduce in your test environment. You need to see what’s happening inside that function in production, but you can’t just halt the service.

Cloud Debugger is your production X-ray. It lets you inspect the state of a live, running application without stopping or slowing it down for other users.

You connect Debugger to your GKE service and set a snapshot on a key line inside the calculateRiskScoreV2 function. The next time a request executes that line, Debugger captures the full call stack and the values of all local variables at that exact moment.

The snapshot reveals the issue: a variable that should contain a user ID is occasionally null, causing the risk-scoring algorithm to enter a massive, inefficient recalculation loop. You’ve found the root cause.

The Complete Observability Workflow

This investigative story shows how the Cloud Operations tools work together. For the exam, know which tool answers which question:

  1. Is something wrong?
    • Cloud Monitoring gives you the high-level picture with Alerts and Dashboards.
  2. What is the specific error?
    • Cloud Logging lets you dive into the details with the Log Explorer to find the exact error message and context.
  3. Where is the latency in my distributed system?
    • Cloud Trace shows you the request’s journey across microservices to find the latency bottleneck.
  4. Which line of code is using the most CPU/memory?
    • Cloud Profiler analyzes your code’s performance to find the inefficient functions.
  5. What is the live state of my code at this exact moment?
    • Cloud Debugger lets you inspect variables and the call stack in production without halting the application.

Supporting Pillar: Archiving Logs with Sinks

While Cloud Logging is great for real-time analysis, storing all your logs there forever can be expensive. A Sink is an automated rule for exporting logs to other destinations for compliance or long-term storage:

  • Cloud Storage: For cheap, long-term archival.
  • BigQuery: For complex, SQL-based analysis.
  • Pub/Sub: For streaming logs to other tools like Splunk.

Common Pitfalls & Best Practices

  • Pitfall: Not installing the Ops Agent on GCE VMs. You’ll be blind to what’s happening inside your instances.
  • Best Practice: Install the Ops Agent on all VMs as part of your standard build process.
  • Pitfall: Creating “noisy” alerts in Monitoring that lead to alert fatigue.
  • Best Practice: Tune your alert conditions carefully with appropriate durations. Always use the Documentation field to provide context and runbooks.
  • Pitfall: Forgetting to instrument your code for Trace and Profiler.
  • Best Practice: Add the necessary client libraries to your applications during development so that you have deep visibility when you need it most.
  • Pitfall: Storing all logs in Cloud Logging indefinitely.
  • Best Practice: Use Sinks to route logs to appropriate destinations. Keep logs in Cloud Logging for a short period (e.g., 30 days) and archive the rest to Cloud Storage.

Quick Reference Command Center

While much of the Cloud Operations suite is best used through the UI, here are some key commands.

ServiceActionCommand
LoggingRead Logsgcloud logging read "[FILTER]"
LoggingCreate a Sink to GCSgcloud logging sinks create [SINK_NAME] storage.googleapis.com/my-bucket --log-filter="severity>=ERROR"
MonitoringList Notification Channelsgcloud alpha monitoring channels list
MonitoringDescribe an Uptime Checkgcloud alpha monitoring uptime-checks describe [CHECK_NAME]
error: Content is protected !!