Encode disallowed characters around already percent-encoded triples#151
Encode disallowed characters around already percent-encoded triples#151gaoflow wants to merge 1 commit into
Conversation
| _GEN_DELIMS: t.Final[str] = ":/?#[]@" | ||
| _SUB_DELIMS: t.Final[str] = "!$&'()*+,;=" | ||
| _RESERVED_CHARACTERS: t.Final[str] = f"{_GEN_DELIMS}{_SUB_DELIMS}" | ||
| _PERCENT_ENCODED: t.Final["re.Pattern[str]"] = re.compile("%[0-9A-Fa-f]{2}") |
There was a problem hiding this comment.
| _PERCENT_ENCODED: t.Final["re.Pattern[str]"] = re.compile("%[0-9A-Fa-f]{2}") | |
| _PERCENT_ENCODED: t.Final[re.Pattern[str]] = re.compile("%[0-9A-Fa-f]{2}") |
Furthermore, this does not check for whether or not the percent encoding is a valid percent encoding. This just checks if it looks like one. Text that doesn't have a valid percent encoding could still be confused here.
There was a problem hiding this comment.
Applied the annotation suggestion, thanks.
On the validity point: you're right that the regex only matches things shaped like a pct-triple rather than verifying intent, but that look-alike pass-through is what RFC 6570 specifies for reserved ({+}) and fragment ({#}) expansion. §3.2.3 copies pct-encoded triples through unchanged, and the pct-encoded = "%" HEXDIG HEXDIG production is purely syntactic, so an already-encoded %41 is meant to be passed as-is. The library already did this for a bare "%41"; the bug here is that a following disallowed character caused the whole tail (triple included) to be re-encoded, which contradicts that same rule. So the change narrows behavior back to the RFC rather than adding new look-alike matching.
Strict validation (encode %XX that isn't a 'real' encoding) would be a deliberate divergence from reserved-expansion semantics and its own change - happy to open a separate issue if you'd want to explore that.
Reserved ("+") and fragment ("#") expansion must preserve already valid
percent-encoded triples (RFC 6570 Section 3.2.3). The current implementation
does this by returning the value untouched as soon as `urllib.parse.unquote`
detects any triple, which also skips encoding every other disallowed
character in the value.
As a result `URITemplate("{+v}").expand(v="a b%20c")` produced "a b%20c"
(space left raw) instead of "a%20b%20c". This is a regression from the fix
for issue python-hyper#99, which addressed the opposite problem of double-encoding.
Quote the value segment by segment instead: pass valid percent-encoded
triples through verbatim and percent-encode everything between them. Output
is unchanged for values with no triples or with only triples, so existing
behavior (and all current tests) is preserved.
c37149c to
7ece9a1
Compare
Problem
Reserved (
{+var}) and fragment ({#var}) expansion is supposed to preserve already valid percent-encoded triples while still percent-encoding any other disallowed characters (RFC 6570 Section 3.2.3).Operator._only_quote_unquoted_charactersdoes this with an all-or-nothing check: ifurllib.parse.unquote(value) != value(i.e. any triple is present) it returns the value untouched, which also skips encoding every other disallowed character.This is a regression introduced by the fix for #99, which addressed the opposite problem (valid triples being double-encoded).
Fix
Quote the value segment by segment: pass valid percent-encoded triples through verbatim and percent-encode the text between them. Output is identical for values with no triples or with only triples, so existing behavior is preserved — all current tests still pass, and the new case is now correct:
Tests
Added
test_reserved_expansion_encodes_around_pct_triples. It fails on the current code (a b%20c→a%20b%20c) and passes with the fix; the full suite (59 tests) is green.black,isort, andmypyare clean.