Skip to content

Implement "Followup improvements for ext/uri" RFC - WHATWG URL building#22268

Open
kocsismate wants to merge 2 commits into
php:masterfrom
kocsismate:uri-followup4
Open

Implement "Followup improvements for ext/uri" RFC - WHATWG URL building#22268
kocsismate wants to merge 2 commits into
php:masterfrom
kocsismate:uri-followup4

Conversation

@kocsismate

Copy link
Copy Markdown
Member

@kocsismate kocsismate requested a review from TimWolla as a code owner June 10, 2026 15:10
@kocsismate kocsismate changed the title IImplement "Followup improvements for ext/uri" RFC - WHATWG URL building Implement "Followup improvements for ext/uri" RFC - WHATWG URL building Jun 10, 2026
@kocsismate kocsismate marked this pull request as draft June 10, 2026 16:12
goto failure;
}

if (lexbor_base_url != NULL) {

@kocsismate kocsismate Jun 25, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a big shame is that apparently it's not possible to properly use the builder with a base URL :(

  • if we try to add the base URL after the input URL is built, then legitimate relative URLs are rejected (e.g. /foo + https://example.com), because /foo is not a valid URL on its own
  • If we try to build the input URL with the base URL in the same time, then the parsing algorithm must be used (currently, only setters are used with a hack on line 821). Then the complication is to find some URL component that is suitable for parsing:
    • for special, full URLs: scheme + host is needed at least (e.g. https://example.com)
    • for non-special full URLs: scheme is needed at least (e.g. https://)
    • for relative URLs: the path is needed at least (e.g. /foo/bar)

And in the 2nd case, the question arises if it's ok to parse only the minimally required components and then set the rest of the components, or the whole input URL must be built and parsed all at once.

@kocsismate kocsismate marked this pull request as ready for review June 25, 2026 08:19
@kocsismate kocsismate requested a review from ndossche June 25, 2026 13:32
@kocsismate

Copy link
Copy Markdown
Member Author

May I have a review soon so that this can potentially be included into alpha 2 at least? :)

Comment thread ext/uri/php_uri.c Outdated
return c >= '0' && c <= '9';
}

ZEND_ATTRIBUTE_NONNULL zend_result php_uri_parser_whatwg_validate_scheme(const zend_string *scheme)

@kocsismate kocsismate Jul 1, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible to get rid of this validation if we don't want to reimplement the related WHATWG URL checks. The rest of the components surely cannot be validated early during the setter calls because they are mostly context-sensitive.

Comment thread ext/uri/uri_parser_whatwg.c Outdated
Comment thread ext/uri/uri_parser_whatwg.c Outdated
* The URL is initialized as LXB_URL_SCHEMEL_TYPE__UNDEF but this would prevent the scheme to be updated
* in case of non-special schemes due to https://github.com/php/php-src/blob/27d7b799c0a13578ee0506b428b8ddc209ffb010/ext/lexbor/lexbor/url/url.c#L1402
*/
if (!php_uri_parser_whatwg_is_special_scheme(scheme)) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to check if it's something that we can omit somehow...

Comment thread ext/uri/php_uri.c
Comment thread ext/uri/uri_parser_whatwg.c Outdated
Comment thread ext/uri/uri_parser_whatwg.c Outdated

const zend_string *str = Z_STR_P(scheme);

switch (ZSTR_LEN(str)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zend_string_equals_literal_ci already has a length check built-in. The compiler can factor this out for the common cases. Therefore, you can skip checking this length explicitly, and just do one giant return zend_string_equals_literal_ci(str, "ws") || zend_string_equals_literal_ci(str, "ftp") || ...;
The compiler is smart enough to optimise this, and then the code becomes (imo) a bit cleaner

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic remained, but I could no longer use zend_string_equals_literal_ci() because newline and tab characters had to be stripped from the scheme per the spec.

@TimWolla TimWolla left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a very superficial first look.

Comment thread ext/uri/uri_parser_whatwg.c Outdated
Comment thread ext/uri/uri_parser_whatwg.c Outdated
return true;
}

ZEND_ATTRIBUTE_NONNULL zend_result php_uri_parser_whatwg_validate_port(const zend_long port)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The port is disregarded by the port setter if the URL cannot have a port (e.g. there is no scheme, see the "ext/uri/tests/whatwg/builder/port_success_empty_opaque_host.phpt" test at https://github.com/php/php-src/pull/22268/changes#diff-38d974ee6050457f68d4650d100c80b641befeb781332f7a05822ec1d2d7b87d), so it's a question whether this validation can be implemented. :(

Related WHATWG behavior:

A URL cannot have a username/password/port if its host is null or the empty string, or its scheme is "file".

However, the above concept is only part of some setters, setting a username/password/port for missing/empty hosts is rejected during parsing. So I guess the Builder should adhere to parsing rules, and therefore invalid ports, as well as setting username/password/port should be rejected via our custom checks (similar to the ones that the RFC3986 Builder used).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants