feat: Add datadog metrics backend#703
Conversation
We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious. Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability. Refs STREAM-816
| try: | ||
| yield None | ||
| finally: | ||
| self._emit("timing", key, time.monotonic() - start, tags, sample_rate) |
There was a problem hiding this comment.
Timer sends seconds not milliseconds
High Severity
DatadogMetrics.timer emits elapsed wall time from time.monotonic() directly to DogStatsD timing, but that API expects values in milliseconds. Reported durations for RPC and task timers will be roughly three orders of magnitude too small, breaking latency dashboards and alerts.
Reviewed by Cursor Bugbot for commit fe820a1. Configure here.
|
|
||
|
|
||
| def _rss_bytes() -> int: | ||
| return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss |
There was a problem hiding this comment.
Memory metric uses peak RSS
Medium Severity
track_memory_usage derives its distribution from the difference in ru_maxrss, which is a process lifetime high-water mark, not current RSS. Memory freed inside the block does not lower the metric, and on Linux ru_maxrss is in kilobytes despite _rss_bytes implying bytes.
Reviewed by Cursor Bugbot for commit fe820a1. Configure here.
| def _rss_bytes() -> int: | ||
| return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss |
There was a problem hiding this comment.
Bug: The _rss_bytes() function returns kilobytes on Linux, not bytes, causing memory usage metrics to be underreported by a factor of 1024 in production.
Severity: HIGH
Suggested Fix
To ensure the function _rss_bytes() returns bytes consistently across platforms, multiply the result of resource.getrusage(resource.RUSAGE_SELF).ru_maxrss by 1024 on Linux systems. This can be achieved by adding a platform check.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: clients/python/src/taskbroker_client/metrics.py#L291-L292
Potential issue: The function `_rss_bytes()` returns memory usage from
`resource.getrusage().ru_maxrss`. On Linux, which is the production environment, this
value is in kilobytes. However, the function's name implies it returns bytes. This
function is used by `track_memory_usage` to calculate memory changes, causing all memory
usage distribution metrics to be reported with a value 1024 times smaller than the
actual memory change in bytes. This will lead to incorrect memory monitoring and
alerting.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
This is consistent with the implementations we have in sentry and other applications though.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2078b84. Configure here.
| using taskbroker-client, without depending on a host-application metrics | ||
| prefix. |
There was a problem hiding this comment.
Bug: The code relies on a private datadog library attribute _container_id to disable origin detection, which is fragile and may break silently in future versions.
Severity: LOW
Suggested Fix
Instead of setting the private attribute _container_id, use the official public API provided by the DogStatsd constructor. Pass origin_detection_enabled=False during client initialization to ensure the setting is respected across library versions.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: clients/python/src/taskbroker_client/metrics.py#L156-L157
Potential issue: To disable DogStatsd origin detection, the code sets a private
attribute `client._container_id = None`. This relies on an internal implementation
detail of the `datadog` library, which is not guaranteed to be stable across versions. A
future update to the library could rename or remove this attribute, which would silently
re-enable origin detection without raising an error. The library provides a stable,
public API for this purpose which is not being used.
It should be aligned with the other methods.
| instance: str | None = None, | ||
| tags: Tags | None = None, | ||
| sample_rate: float = 1, | ||
| sample_rate: float | None = None, |
There was a problem hiding this comment.
This default value has been silly for a while. Now is a good time to change this.
| def _rss_bytes() -> int: | ||
| return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss |
There was a problem hiding this comment.
This is consistent with the implementations we have in sentry and other applications though.
| client.distribution.assert_called_once() | ||
| args, _ = client.distribution.call_args | ||
| assert args[0] == "taskworker.mem" | ||
| assert isinstance(args[1], int) |
There was a problem hiding this comment.
Bug: The track_memory_usage function incorrectly uses ru_maxrss, a high-water mark for memory, to calculate memory deltas. This can cause test failures and makes the production metric unreliable.
Severity: MEDIUM
Suggested Fix
To accurately measure memory usage within the context, replace resource.getrusage with a method that measures current memory usage, not a historical peak. For example, use a library like psutil to get the current resident set size (RSS) at the beginning and end of the block. The difference between these two values will provide a more accurate measurement of the memory consumed within that specific context.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: clients/python/tests/test_metrics.py#L140
Potential issue: The function `track_memory_usage` calculates memory usage deltas using
`resource.getrusage(resource.RUSAGE_SELF).ru_maxrss`. This value represents the peak
memory usage (a high-water mark) of the process, not the current usage. In the test
`test_track_memory_usage`, an assertion `args[1] > 0` expects a positive memory delta
after an allocation. However, if previous tests have already set a high memory peak, the
new allocation may not increase it, resulting in a delta of zero and a test failure.
This also affects production, where the metric will likely report zero for most tasks,
making it ineffective for its intended purpose of tracking memory usage per task.


We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious.
Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability.
Refs STREAM-816