Skip to content

feat: Add datadog metrics backend#703

Open
markstory wants to merge 3 commits into
mainfrom
metrics-adapter
Open

feat: Add datadog metrics backend#703
markstory wants to merge 3 commits into
mainfrom
metrics-adapter

Conversation

@markstory

Copy link
Copy Markdown
Member

We currently require applications to implement their own metrics backend. This has resulted in metrics from the taskworker runtime having inconsistent tags and metric names, which makes building dashboards and alerting for taskworkers tedious.

Having a more opinionated metrics backend will allow us to make metric names and tags that are important to observability structurally required. I've also included a prefixed metric shim which will allow us to switch metrics providers without gaps in observability.

Refs STREAM-816

We currently require applications to implement their own metrics
backend. This has resulted in metrics from the taskworker runtime having
inconsistent tags and metric names, which makes building dashboards and
alerting for taskworkers tedious.

Having a more opinionated metrics backend will allow us to make metric
names and tags that are important to observability structurally
required. I've also included a prefixed metric shim which will allow us
to switch metrics providers without gaps in observability.

Refs STREAM-816
@markstory markstory requested a review from a team as a code owner June 12, 2026 20:10
@linear-code

linear-code Bot commented Jun 12, 2026

Copy link
Copy Markdown

STREAM-816

try:
yield None
finally:
self._emit("timing", key, time.monotonic() - start, tags, sample_rate)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer sends seconds not milliseconds

High Severity

DatadogMetrics.timer emits elapsed wall time from time.monotonic() directly to DogStatsD timing, but that API expects values in milliseconds. Reported durations for RPC and task timers will be roughly three orders of magnitude too small, breaking latency dashboards and alerts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fe820a1. Configure here.



def _rss_bytes() -> int:
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory metric uses peak RSS

Medium Severity

track_memory_usage derives its distribution from the difference in ru_maxrss, which is a process lifetime high-water mark, not current RSS. Memory freed inside the block does not lower the metric, and on Linux ru_maxrss is in kilobytes despite _rss_bytes implying bytes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fe820a1. Configure here.

Comment thread clients/python/src/taskbroker_client/metrics.py
Comment on lines +291 to +292
def _rss_bytes() -> int:
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The _rss_bytes() function returns kilobytes on Linux, not bytes, causing memory usage metrics to be underreported by a factor of 1024 in production.
Severity: HIGH

Suggested Fix

To ensure the function _rss_bytes() returns bytes consistently across platforms, multiply the result of resource.getrusage(resource.RUSAGE_SELF).ru_maxrss by 1024 on Linux systems. This can be achieved by adding a platform check.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/src/taskbroker_client/metrics.py#L291-L292

Potential issue: The function `_rss_bytes()` returns memory usage from
`resource.getrusage().ru_maxrss`. On Linux, which is the production environment, this
value is in kilobytes. However, the function's name implies it returns bytes. This
function is used by `track_memory_usage` to calculate memory changes, causing all memory
usage distribution metrics to be reported with a value 1024 times smaller than the
actual memory change in bytes. This will lead to incorrect memory monitoring and
alerting.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the implementations we have in sentry and other applications though.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2078b84. Configure here.

Comment thread clients/python/src/taskbroker_client/metrics.py
Comment thread clients/python/src/taskbroker_client/metrics.py
Comment on lines +156 to +157
using taskbroker-client, without depending on a host-application metrics
prefix.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code relies on a private datadog library attribute _container_id to disable origin detection, which is fragile and may break silently in future versions.
Severity: LOW

Suggested Fix

Instead of setting the private attribute _container_id, use the official public API provided by the DogStatsd constructor. Pass origin_detection_enabled=False during client initialization to ensure the setting is respected across library versions.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/src/taskbroker_client/metrics.py#L156-L157

Potential issue: To disable DogStatsd origin detection, the code sets a private
attribute `client._container_id = None`. This relies on an internal implementation
detail of the `datadog` library, which is not guaranteed to be stable across versions. A
future update to the library could rename or remove this attribute, which would silently
re-enable origin detection without raising an error. The library provides a stable,
public API for this purpose which is not being used.

It should be aligned with the other methods.
instance: str | None = None,
tags: Tags | None = None,
sample_rate: float = 1,
sample_rate: float | None = None,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default value has been silly for a while. Now is a good time to change this.

Comment on lines +291 to +292
def _rss_bytes() -> int:
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the implementations we have in sentry and other applications though.

client.distribution.assert_called_once()
args, _ = client.distribution.call_args
assert args[0] == "taskworker.mem"
assert isinstance(args[1], int)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The track_memory_usage function incorrectly uses ru_maxrss, a high-water mark for memory, to calculate memory deltas. This can cause test failures and makes the production metric unreliable.
Severity: MEDIUM

Suggested Fix

To accurately measure memory usage within the context, replace resource.getrusage with a method that measures current memory usage, not a historical peak. For example, use a library like psutil to get the current resident set size (RSS) at the beginning and end of the block. The difference between these two values will provide a more accurate measurement of the memory consumed within that specific context.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/tests/test_metrics.py#L140

Potential issue: The function `track_memory_usage` calculates memory usage deltas using
`resource.getrusage(resource.RUSAGE_SELF).ru_maxrss`. This value represents the peak
memory usage (a high-water mark) of the process, not the current usage. In the test
`test_track_memory_usage`, an assertion `args[1] > 0` expects a positive memory delta
after an allocation. However, if previous tests have already set a high memory peak, the
new allocation may not increase it, resulting in a delta of zero and a test failure.
This also affects production, where the metric will likely report zero for most tasks,
making it ineffective for its intended purpose of tracking memory usage per task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant