Hyperlink API¶
Hyperlink provides Pythonic URL parsing, construction, and rendering.
Usage is straightforward:
>>> import hyperlink
>>> url = hyperlink.parse(u'http://github.com/mahmoud/hyperlink?utm_source=docs')
>>> url.host
u'github.com'
>>> secure_url = url.replace(scheme=u'https')
>>> secure_url.get('utm_source')[0]
u'docs'
Hyperlink’s API centers on the DecodedURL
type, which wraps
the lower-level URL
, both of which can be returned by the
parse()
convenience function.
Creation¶
Before you can work with URLs, you must create URLs.
Parsing Text¶
If you already have a textual URL, the easiest way to get URL objects
is with the parse()
function:
-
hyperlink.
parse
(url, decoded=True, lazy=False)[source]¶ Automatically turn text into a structured URL object.
>>> url = parse(u"https://github.com/python-hyper/hyperlink") >>> print(url.to_text()) https://github.com/python-hyper/hyperlink
Parameters: - url – A text string representation of a URL.
- decoded – Whether or not to return a
DecodedURL
, which automatically handles all encoding/decoding/quoting/unquoting for all the various accessors of parts of the URL, or aURL
, which has the same API, but requires handling of special characters for different parts of the URL. - lazy – In the case of decoded=True, this controls whether the URL is decoded immediately or as accessed. The default, lazy=False, checks all encoded parts of the URL for decodability.
New in version 18.0.0.
By default, parse()
returns an instance of
DecodedURL
, a URL type that handles all encoding for you, by
wrapping the lower-level URL
.
DecodedURL¶
-
class
hyperlink.
DecodedURL
(url=URL.from_text(u''), lazy=False, query_plus_is_space=None)[source]¶ DecodedURL
is a type designed to act as a higher-level interface toURL
and the recommended type for most operations. By analogy,DecodedURL
is theunicode
to URL’sbytes
.DecodedURL
automatically handles encoding and decoding all its components, such that all inputs and outputs are in a maximally-decoded state. Note that this means, for some special cases, a URL may not “roundtrip” character-for-character, but this is considered a good tradeoff for the safety of automatic encoding.Otherwise,
DecodedURL
has almost exactly the same API asURL
.Where applicable, a UTF-8 encoding is presumed. Be advised that some interactions can raise
UnicodeEncodeErrors
andUnicodeDecodeErrors
, just like when working with bytestrings. Examples of such interactions include handling query strings encoding binary data, and paths containing segments with special characters encoded with codecs other than UTF-8.Parameters: - url – A
URL
object to wrap. - lazy – Set to True to avoid pre-decode all parts of the URL to check for validity. Defaults to False.
- query_plus_is_space –
- characters in the query string should be treated
as spaces when decoding. If unspecified, the default is taken from the scheme.
Note
The
DecodedURL
initializer takes aURL
object, not URL components, likeURL
. To programmatically construct aDecodedURL
, you can use this pattern:>>> print(DecodedURL().replace(scheme=u'https', ... host=u'pypi.org', path=(u'projects', u'hyperlink')).to_text()) https://pypi.org/projects/hyperlink
New in version 18.0.0.
- url – A
The Encoded URL¶
The lower-level URL
looks very similar to the
DecodedURL
, but does not handle all encoding cases for
you. Use with caution.
Note
URL
is also available as an alias,
hyperlink.EncodedURL
for more explicit usage.
-
class
hyperlink.
URL
(scheme=None, host=None, path=(), query=(), fragment=u'', port=None, rooted=None, userinfo=u'', uses_netloc=None)[source]¶ From blogs to billboards, URLs are so common, that it’s easy to overlook their complexity and power. With hyperlink’s
URL
type, working with URLs doesn’t have to be hard.URLs are made of many parts. Most of these parts are officially named in RFC 3986 and this diagram may prove handy in identifying them:
foo://user:pass@example.com:8042/over/there?name=ferret#nose \_/ \_______/ \_________/ \__/\_________/ \_________/ \__/ | | | | | | | scheme userinfo host port path query fragment
While
from_text()
is used for parsing whole URLs, theURL
constructor builds a URL from the individual components, like so:>>> from hyperlink import URL >>> url = URL(scheme=u'https', host=u'example.com', path=[u'hello', u'world']) >>> print(url.to_text()) https://example.com/hello/world
The constructor runs basic type checks. All strings are expected to be text (
str
in Python 3,unicode
in Python 2). All arguments are optional, defaulting to appropriately empty values. A full list of constructor arguments is below.Parameters: - scheme – The text name of the scheme.
- host – The host portion of the network location
- port – The port part of the network location. If
None
or no port is passed, the port will default to the default port of the scheme, if it is known. See theSCHEME_PORT_MAP
andregister_default_port()
for more info. - path – A tuple of strings representing the slash-separated parts of the path, each percent-encoded.
- query – The query parameters, as a dictionary or as an sequence of percent-encoded key-value pairs.
- fragment – The fragment part of the URL.
- rooted – A rooted URL is one which indicates an absolute path. This is True on any URL that includes a host, or any relative URL that starts with a slash.
- userinfo – The username or colon-separated username:password pair.
- uses_netloc – Indicates whether
://
(the “netloc separator”) will appear to separate the scheme from the path in cases where no host is present. Setting this toTrue
is a non-spec-compliant affordance for the common practice of having URIs that are not URLs (cannot have a ‘host’ part) but nevertheless use the common://
idiom that most people associate with URLs; e.g.message:
URIs likemessage://message-id
being equivalent tomessage:message-id
. This may be inferred based on the scheme depending on whetherregister_scheme()
has been used to register the scheme and should not be passed directly unless you know the scheme works like this and you know it has not been registered.
All of these parts are also exposed as read-only attributes of
URL
instances, along with several useful methods.
-
classmethod
URL.
from_text
(text)[source]¶ Whereas the
URL
constructor is useful for constructing URLs from parts,from_text()
supports parsing whole URLs from their string form:>>> URL.from_text(u'http://example.com') URL.from_text(u'http://example.com') >>> URL.from_text(u'?a=b&x=y') URL.from_text(u'?a=b&x=y')
As you can see above, it’s also used as the
repr()
ofURL
objects. The natural counterpart toto_text()
. This method only accepts text, so be sure to decode those bytestrings.Parameters: text – A valid URL string. Returns: The structured object version of the parsed string. Return type: URL Note
Somewhat unexpectedly, URLs are a far more permissive format than most would assume. Many strings which don’t look like URLs are still valid URLs. As a result, this method only raises
URLParseError
on invalid port and IPv6 values in the host portion of the URL.
Transformation¶
Once a URL is created, some of the most common tasks are to transform it into other URLs and text.
-
URL.
to_text
(with_password=False)[source]¶ Render this URL to its textual representation.
By default, the URL text will not include a password, if one is set. RFC 3986 considers using URLs to represent such sensitive information as deprecated. Quoting from RFC 3986, section 3.2.1:
“Applications should not render as clear text any data after the first colon (“:”) character found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password).”- Args (bool):
- with_password: Whether or not to include the password in the URL
- text. Defaults to False.
Returns: - The serialized textual representation of this URL, such as
u"http://example.com/some/path?some=query"
.
Return type: Text The natural counterpart to
URL.from_text()
.
-
URL.
to_uri
()[source]¶ Make a new
URL
instance with all non-ASCII characters appropriately percent-encoded. This is useful to do in preparation for sending aURL
over a network protocol.For example:
>>> URL.from_text(u'https://ايران.com/foo⇧bar/').to_uri() URL.from_text(u'https://xn--mgba3a4fra.com/foo%E2%87%A7bar/')
Returns: - A new instance with its path segments, query parameters, and
- hostname encoded, so that they are all in the standard US-ASCII range.
Return type: URL
-
URL.
to_iri
()[source]¶ Make a new
URL
instance with all but a few reserved characters decoded into human-readable format.Percent-encoded Unicode and IDNA-encoded hostnames are decoded, like so:
>>> url = URL.from_text(u'https://xn--mgba3a4fra.example.com/foo%E2%87%A7bar/') >>> print(url.to_iri().to_text()) https://ايران.example.com/foo⇧bar/
Note
As a general Python issue, “narrow” (UCS-2) builds of Python may not be able to fully decode certain URLs, and the in those cases, this method will return a best-effort, partially-decoded, URL which is still valid. This issue does not affect any Python builds 3.4+.
Returns: - A new instance with its path segments, query parameters, and
- hostname decoded for display purposes.
Return type: URL
-
URL.
replace
(scheme=Sentinel('_UNSET'), host=Sentinel('_UNSET'), path=Sentinel('_UNSET'), query=Sentinel('_UNSET'), fragment=Sentinel('_UNSET'), port=Sentinel('_UNSET'), rooted=Sentinel('_UNSET'), userinfo=Sentinel('_UNSET'), uses_netloc=Sentinel('_UNSET'))[source]¶ URL
objects are immutable, which means that attributes are designed to be set only once, at construction. Instead of modifying an existing URL, one simply creates a copy with the desired changes.If any of the following arguments is omitted, it defaults to the value on the current URL.
Parameters: - scheme – The text name of the scheme.
- host – The host portion of the network location.
- path – A tuple of strings representing the slash-separated parts of the path.
- query – The query parameters, as a dictionary or as an sequence of key-value pairs.
- fragment – The fragment part of the URL.
- port – The port part of the network location.
- rooted – Whether or not the path begins with a slash.
- userinfo – The username or colon-separated username:password pair.
- uses_netloc – Indicates whether
://
(the “netloc separator”) will appear to separate the scheme from the path in cases where no host is present. Setting this toTrue
is a non-spec-compliant affordance for the common practice of having URIs that are not URLs (cannot have a ‘host’ part) but nevertheless use the common://
idiom that most people associate with URLs; e.g.message:
URIs likemessage://message-id
being equivalent tomessage:message-id
. This may be inferred based on the scheme depending on whetherregister_scheme()
has been used to register the scheme and should not be passed directly unless you know the scheme works like this and you know it has not been registered.
Returns: - A copy of the current
URL
, with new values for parameters passed.
Return type:
-
URL.
normalize
(scheme=True, host=True, path=True, query=True, fragment=True, userinfo=True, percents=True)[source]¶ Return a new URL object with several standard normalizations applied:
- Decode unreserved characters (RFC 3986 2.3)
- Uppercase remaining percent-encoded octets (RFC 3986 2.1)
- Convert scheme and host casing to lowercase (RFC 3986 3.2.2)
- Resolve any “.” and “..” references in the path (RFC 3986 6.2.2.3)
- Ensure an ending slash on URLs with an empty path (RFC 3986 6.2.3)
- Encode any stray percent signs (%) in percent-encoded fields (path, query, fragment, userinfo) (RFC 3986 2.4)
All are applied by default, but normalizations can be disabled per-part by passing False for that part’s corresponding name.
Parameters: - scheme – Convert the scheme to lowercase
- host – Convert the host to lowercase
- path – Normalize the path (see above for details)
- query – Normalize the query string
- fragment – Normalize the fragment
- userinfo – Normalize the userinfo
- percents – Encode isolated percent signs for any percent-encoded fields which are being normalized (defaults to True).
>>> url = URL.from_text(u'Http://example.COM/a/../b/./c%2f?%61%') >>> print(url.normalize().to_text()) http://example.com/b/c%2F?a%25
Query Parameters¶
CRUD operations on the query string multimap.
-
URL.
get
(name)[source]¶ Get a list of values for the given query parameter, name:
>>> url = URL.from_text(u'?x=1&x=2') >>> url.get('x') [u'1', u'2'] >>> url.get('y') []
If the given name is not set, an empty list is returned. A list is always returned, and this method raises no exceptions.
Parameters: name – The name of the query parameter to get. Returns: - A list of all the values associated with the
- key, in string form.
Return type: List[Optional[Text]]
-
URL.
add
(name, value=None)[source]¶ Make a new
URL
instance with a given query argument, name, added to it with the value value, like so:>>> URL.from_text(u'https://example.com/?x=y').add(u'x') URL.from_text(u'https://example.com/?x=y&x') >>> URL.from_text(u'https://example.com/?x=y').add(u'x', u'z') URL.from_text(u'https://example.com/?x=y&x=z')
Parameters: - name – The name of the query parameter to add.
The part before the
=
. - value – The value of the query parameter to add.
The part after the
=
. Defaults toNone
, meaning no value.
Returns: A new
URL
instance with the parameter added.Return type: - name – The name of the query parameter to add.
The part before the
-
URL.
set
(name, value=None)[source]¶ Make a new
URL
instance with the query parameter name set to value. All existing occurences, if any are replaced by the single name-value pair.>>> URL.from_text(u'https://example.com/?x=y').set(u'x') URL.from_text(u'https://example.com/?x') >>> URL.from_text(u'https://example.com/?x=y').set(u'x', u'z') URL.from_text(u'https://example.com/?x=z')
Parameters: - name – The name of the query parameter to set.
The part before the
=
. - value – The value of the query parameter to set.
The part after the
=
. Defaults toNone
, meaning no value.
Returns: A new
URL
instance with the parameter set.Return type: - name – The name of the query parameter to set.
The part before the
-
URL.
remove
(name, value=Sentinel('_UNSET'), limit=None)[source]¶ Make a new
URL
instance with occurrences of the query parameter name removed, or, if value is set, parameters matching name and value. No exception is raised if the parameter is not already set.Parameters: - name – The name of the query parameter to remove.
- value – Optional value to additionally filter on. Setting this removes query parameters which match both name and value.
- limit – Optional maximum number of parameters to remove.
Returns: A new
URL
instance with the parameter removed.Return type:
Attributes¶
URLs have many parts, and URL objects have many attributes to represent them.
-
URL.
absolute
¶ Whether or not the URL is “absolute”. Absolute URLs are complete enough to resolve to a network resource without being relative to a base URI.
>>> URL.from_text(u'http://wikipedia.org/').absolute True >>> URL.from_text(u'?a=b&c=d').absolute False
Absolute URLs must have both a scheme and a host set.
-
URL.
scheme
¶ The scheme is a string, and the first part of an absolute URL, the part before the first colon, and the part which defines the semantics of the rest of the URL. Examples include “http”, “https”, “ssh”, “file”, “mailto”, and many others. See
register_scheme()
for more info.
-
URL.
host
¶ The host is a string, and the second standard part of an absolute URL. When present, a valid host must be a domain name, or an IP (v4 or v6). It occurs before the first slash, or the second colon, if a
port
is provided.
-
URL.
port
¶ The port is an integer that is commonly used in connecting to the
host
, and almost never appears without it.When not present in the original URL, this attribute defaults to the scheme’s default port. If the scheme’s default port is not known, and the port is not provided, this attribute will be set to None.
>>> URL.from_text(u'http://example.com/pa/th').port 80 >>> URL.from_text(u'foo://example.com/pa/th').port >>> URL.from_text(u'foo://example.com:8042/pa/th').port 8042
Note
Per the standard, when the port is the same as the schemes default port, it will be omitted in the text URL.
-
URL.
path
¶ A tuple of strings, created by splitting the slash-separated hierarchical path. Started by the first slash after the host, terminated by a “?”, which indicates the start of the
query
string.
-
URL.
query
¶ Tuple of pairs, created by splitting the ampersand-separated mapping of keys and optional values representing non-hierarchical data used to identify the resource. Keys are always strings. Values are strings when present, or None when missing.
For more operations on the mapping, see
get()
,add()
,set()
, anddelete()
.
-
URL.
fragment
¶ A string, the last part of the URL, indicated by the first “#” after the
path
orquery
. Enables indirect identification of a secondary resource, like an anchor within an HTML page.
-
URL.
userinfo
¶ The colon-separated string forming the username-password combination.
-
URL.
rooted
¶ Whether or not the path starts with a forward slash (
/
).This is taken from the terminology in the BNF grammar, specifically the “path-rootless”, rule, since “absolute path” and “absolute URI” are somewhat ambiguous.
path
does not contain the implicit prefixed"/"
since that is somewhat awkward to work with.
Low-level functions¶
A couple of notable helpers used by the URL
type.
-
class
hyperlink.
URLParseError
[source]¶ Exception inheriting from
ValueError
, raised when failing to parse a URL. Mostly raised on invalid ports and IPv6 addresses.
-
hyperlink.
register_scheme
(text, uses_netloc=True, default_port=None, query_plus_is_space=True)[source]¶ Registers new scheme information, resulting in correct port and slash behavior from the URL object. There are dozens of standard schemes preregistered, so this function is mostly meant for proprietary internal customizations or stopgaps on missing standards information. If a scheme seems to be missing, please file an issue!
Parameters: - text – A string representation of the scheme. (the ‘http’ in ‘http://hatnote.com’)
- uses_netloc – Does the scheme support specifying a network host? For instance, “http” does, “mailto” does not. Defaults to True.
- default_port – The default port, if any, for netloc-using schemes.
- query_plus_is_space – If true, a “+” in the query string should be decoded as a space by DecodedURL.
-
hyperlink.
parse
(url, decoded=True, lazy=False)[source] Automatically turn text into a structured URL object.
>>> url = parse(u"https://github.com/python-hyper/hyperlink") >>> print(url.to_text()) https://github.com/python-hyper/hyperlink
Parameters: - url – A text string representation of a URL.
- decoded – Whether or not to return a
DecodedURL
, which automatically handles all encoding/decoding/quoting/unquoting for all the various accessors of parts of the URL, or aURL
, which has the same API, but requires handling of special characters for different parts of the URL. - lazy – In the case of decoded=True, this controls whether the URL is decoded immediately or as accessed. The default, lazy=False, checks all encoded parts of the URL for decodability.
New in version 18.0.0.