feat: Add datadog metrics backend by markstory · Pull Request #703 · getsentry/taskbroker

markstory · 2026-06-12T20:09:59Z

We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious.

Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability.

Refs STREAM-816

We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious. Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability. Refs STREAM-816

linear-code · 2026-06-12T20:10:03Z

STREAM-816

cursor · 2026-06-12T20:12:00Z

+        try:
+            yield None
+        finally:
+            self._emit("timing", key, time.monotonic() - start, tags, sample_rate)


Timer sends seconds not milliseconds

High Severity

DatadogMetrics.timer emits elapsed wall time from time.monotonic() directly to DogStatsD timing, but that API expects values in milliseconds. Reported durations for RPC and task timers will be roughly three orders of magnitude too small, breaking latency dashboards and alerts.

^{Reviewed by Cursor Bugbot for commit fe820a1. Configure here.}

cursor · 2026-06-12T20:12:00Z

+
+
+def _rss_bytes() -> int:
+    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss


Memory metric uses peak RSS

Medium Severity

track_memory_usage derives its distribution from the difference in ru_maxrss, which is a process lifetime high-water mark, not current RSS. Memory freed inside the block does not lower the metric, and on Linux ru_maxrss is in kilobytes despite _rss_bytes implying bytes.

^{Reviewed by Cursor Bugbot for commit fe820a1. Configure here.}

sentry · 2026-06-12T20:12:00Z

+def _rss_bytes() -> int:
+    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss


Bug: The _rss_bytes() function returns kilobytes on Linux, not bytes, causing memory usage metrics to be underreported by a factor of 1024 in production.
_{Severity: HIGH}

Suggested Fix

To ensure the function _rss_bytes() returns bytes consistently across platforms, multiply the result of resource.getrusage(resource.RUSAGE_SELF).ru_maxrss by 1024 on Linux systems. This can be achieved by adding a platform check.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: clients/python/src/taskbroker_client/metrics.py#L291-L292 Potential issue: The function `_rss_bytes()` returns memory usage from `resource.getrusage().ru_maxrss`. On Linux, which is the production environment, this value is in kilobytes. However, the function's name implies it returns bytes. This function is used by `track_memory_usage` to calculate memory changes, causing all memory usage distribution metrics to be reported with a value 1024 times smaller than the actual memory change in bytes. This will lead to incorrect memory monitoring and alerting.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

This is consistent with the implementations we have in sentry and other applications though.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2078b84. Configure here.}

sentry · 2026-06-12T20:42:46Z

+    using taskbroker-client, without depending on a host-application metrics
+    prefix.


Bug: The code relies on a private datadog library attribute _container_id to disable origin detection, which is fragile and may break silently in future versions.
_{Severity: LOW}

Suggested Fix

Instead of setting the private attribute _container_id, use the official public API provided by the DogStatsd constructor. Pass origin_detection_enabled=False during client initialization to ensure the setting is respected across library versions.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: clients/python/src/taskbroker_client/metrics.py#L156-L157 Potential issue: To disable DogStatsd origin detection, the code sets a private attribute `client._container_id = None`. This relies on an internal implementation detail of the `datadog` library, which is not guaranteed to be stable across versions. A future update to the library could rename or remove this attribute, which would silently re-enable origin detection without raising an error. The library provides a stable, public API for this purpose which is not being used.

It should be aligned with the other methods.

markstory · 2026-06-12T20:46:26Z

        instance: str | None = None,
        tags: Tags | None = None,
-        sample_rate: float = 1,
+        sample_rate: float | None = None,


This default value has been silly for a while. Now is a good time to change this.

markstory · 2026-06-12T20:47:56Z

+def _rss_bytes() -> int:
+    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss


This is consistent with the implementations we have in sentry and other applications though.

sentry · 2026-06-12T20:48:34Z

+    client.distribution.assert_called_once()
+    args, _ = client.distribution.call_args
+    assert args[0] == "taskworker.mem"
+    assert isinstance(args[1], int)


Bug: The track_memory_usage function incorrectly uses ru_maxrss, a high-water mark for memory, to calculate memory deltas. This can cause test failures and makes the production metric unreliable.
_{Severity: MEDIUM}

Suggested Fix

To accurately measure memory usage within the context, replace resource.getrusage with a method that measures current memory usage, not a historical peak. For example, use a library like psutil to get the current resident set size (RSS) at the beginning and end of the block. The difference between these two values will provide a more accurate measurement of the memory consumed within that specific context.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: clients/python/tests/test_metrics.py#L140 Potential issue: The function `track_memory_usage` calculates memory usage deltas using `resource.getrusage(resource.RUSAGE_SELF).ru_maxrss`. This value represents the peak memory usage (a high-water mark) of the process, not the current usage. In the test `test_track_memory_usage`, an assertion `args[1] > 0` expects a positive memory delta after an allocation. However, if previous tests have already set a high memory peak, the new allocation may not increase it, resulting in a delta of zero and a test failure. This also affects production, where the metric will likely report zero for most tasks, making it ineffective for its intended purpose of tracking memory usage per task.

markstory requested a review from a team as a code owner June 12, 2026 20:10

cursor Bot reviewed Jun 12, 2026

View reviewed changes

sentry Bot reviewed Jun 12, 2026

View reviewed changes

Port more configuration from sentry in

2078b84

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/metrics.py

sentry Bot reviewed Jun 12, 2026

View reviewed changes

Fix up default sample_rate for gauge()

0c255ff

It should be aligned with the other methods.

markstory commented Jun 12, 2026

View reviewed changes

sentry Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add datadog metrics backend#703

feat: Add datadog metrics backend#703
markstory wants to merge 3 commits into
mainfrom
metrics-adapter

markstory commented Jun 12, 2026

Uh oh!

linear-code Bot commented Jun 12, 2026

Uh oh!

cursor Bot Jun 12, 2026

Uh oh!

cursor Bot Jun 12, 2026

Uh oh!

Uh oh!

sentry Bot Jun 12, 2026

Uh oh!

markstory Jun 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

sentry Bot Jun 12, 2026

Uh oh!

markstory Jun 12, 2026

Uh oh!

markstory Jun 12, 2026

Uh oh!

sentry Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def _rss_bytes() -> int:
		return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

		using taskbroker-client, without depending on a host-application metrics
		prefix.

Uh oh!

Conversation

markstory commented Jun 12, 2026

Uh oh!

linear-code Bot commented Jun 12, 2026

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Timer sends seconds not milliseconds

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Memory metric uses peak RSS

Uh oh!

Uh oh!

sentry Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sentry Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

sentry Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant