| Title: | Structural Analysis and Pattern Discovery in URL Datasets |
|---|---|
| Description: | Offers tools for parsing and analyzing URL datasets, extracting key components and identifying common patterns. It aids in examining website architecture and identifying SEO issues, helping users optimize web presence and content strategy. |
| Authors: | Marek Prokop [aut, cre] |
| Maintainer: | Marek Prokop <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-16 06:43:36 UTC |
| Source: | https://github.com/marekprokop/urlexplorer |
Count fragments in URLs
count_fragments(url, sort = FALSE, name = "n")count_fragments(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each fragment and its count.
count_fragments(c("http://example.com#top", "http://example.com#bottom"))count_fragments(c("http://example.com#top", "http://example.com#bottom"))
Count different hosts found in URLs
count_hosts(url, sort = FALSE, name = "n")count_hosts(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each host and its count.
count_hosts(c("http://example.com", "http://www.example.com"))count_hosts(c("http://example.com", "http://www.example.com"))
Count different parameter names in query strings
count_param_names(query, sort = FALSE, name = "n")count_param_names(query, sort = FALSE, name = "n")
query |
A character vector of query strings. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each parameter name and how often it occurs.
count_param_names(c("param1=value1¶m2=value2", "param3=value3"))count_param_names(c("param1=value1¶m2=value2", "param3=value3"))
Count different values for a specified parameter across query strings
count_param_values(query, param_name, sort = FALSE, name = "n")count_param_values(query, param_name, sort = FALSE, name = "n")
query |
A character vector of query strings. |
param_name |
The name of the parameter whose values to count. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each value of the specified parameter and how often it occurs.
count_param_values(c("param1=value1¶m2=value2", "param1=value3"), "param1")count_param_values(c("param1=value1¶m2=value2", "param1=value3"), "param1")
Count occurrences of specific path segments at a given index
count_path_segments(path, segment_index, sort = FALSE, name = "n")count_path_segments(path, segment_index, sort = FALSE, name = "n")
path |
A character vector of paths. |
segment_index |
Index of the segment to count. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each segment at the specified index and how often it occurs.
count_path_segments(c("/path/to/resource", "/path/to/shop"), 2)count_path_segments(c("/path/to/resource", "/path/to/shop"), 2)
Count different paths found in URLs
count_paths(url, sort = FALSE, name = "n")count_paths(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each path and its count.
count_paths(c("http://example.com/index", "http://example.com/home"))count_paths(c("http://example.com/index", "http://example.com/home"))
Count different port numbers used in URLs
count_ports(url, sort = FALSE, name = "n")count_ports(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each port and how many times it occurs.
count_ports(c("http://example.com:8080", "http://example.com:80"))count_ports(c("http://example.com:8080", "http://example.com:80"))
Count the occurrence of query strings in URLs
count_queries(url, sort = FALSE, name = "n")count_queries(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each query string and how often it occurs.
count_queries(c("http://example.com?query1=value1", "http://example.com?query2=value2"))count_queries(c("http://example.com?query1=value1", "http://example.com?query2=value2"))
Count different schemes used in URLs
count_schemes(url, sort = FALSE, name = "n")count_schemes(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble with each scheme and its count.
count_schemes(c("http://example.com", "https://example.com"))count_schemes(c("http://example.com", "https://example.com"))
Count occurrences of userinfo in URLs
count_userinfos(url, sort = FALSE, name = "n")count_userinfos(url, sort = FALSE, name = "n")
url |
A character vector of URLs. |
sort |
Logical indicating whether to sort the output by count. Defaults to FALSE. |
name |
The name of the column containing the counts. Defaults to 'n'. |
A tibble listing userinfos and how often each occurs.
count_userinfos(c("http://user:[email protected]", "http://example.com"))count_userinfos(c("http://user:[email protected]", "http://example.com"))
This function parses each input URL or path and extracts the file extension, if present. It is particularly useful for identifying the type of files referenced in URLs.
extract_file_extension(url)extract_file_extension(url)
url |
A character vector of URLs or paths from which to extract file extensions. |
A character vector with the file extension for each URL or path.
Extensions are returned without the dot (e.g., "jpg" instead of ".jpg"),
and URLs or paths without extensions will return NA.
extract_file_extension( c( "http://example.com/image.jpg", "https://example.com/archive.zip", "http://example.com/" ) )extract_file_extension( c( "http://example.com/image.jpg", "https://example.com/archive.zip", "http://example.com/" ) )
Extract the fragment from URL
extract_fragment(url)extract_fragment(url)
url |
A character vector of URLs. |
A character vector containing the fragment from each URL, if present.
extract_fragment(c("http://example.com/#sec1", "http://example.com/#sec2"))extract_fragment(c("http://example.com/#sec1", "http://example.com/#sec2"))
Extract the host from URL
extract_host(url)extract_host(url)
url |
A character vector of URLs. |
A character vector containing the host from each URL.
extract_host(c("https://example.com", "http://www.example.com"))extract_host(c("https://example.com", "http://www.example.com"))
Extract the value of a specified parameter from the query string
extract_param_value(query, param_name)extract_param_value(query, param_name)
query |
A character vector of query strings. |
param_name |
The name of the parameter to extract values for. |
A character vector containing the value of the specified parameter from each query string.
extract_param_value(c("param1=val1¶m2=val2", "param1=val3"), "param1")extract_param_value(c("param1=val1¶m2=val2", "param1=val3"), "param1")
Extract the path from URL
extract_path(url)extract_path(url)
url |
A character vector of URLs. |
A character vector containing the path from each URL.
extract_path(c("http://example.com/", "http://example.com/path/to/resource"))extract_path(c("http://example.com/", "http://example.com/path/to/resource"))
Extract a specific segment from a path
extract_path_segment(path, segment_index)extract_path_segment(path, segment_index)
path |
A character vector of paths. |
segment_index |
The index of the segment to extract. |
A character vector containing the specified segment from each path.
extract_path_segment(c("/path/to/resource", "/another/path/"), 2)extract_path_segment(c("/path/to/resource", "/another/path/"), 2)
Extract the port number from URL
extract_port(url)extract_port(url)
url |
A character vector of URLs. |
A character vector containing the port number from each URL, if specified.
extract_port(c("http://example.com:8080"))extract_port(c("http://example.com:8080"))
Extract the query from URL
extract_query(url)extract_query(url)
url |
A character vector of URLs. |
A character vector containing the query string from each URL.
extract_query(c( "http://example.com?query1=value1&query2=value2", "http://example.com?query1=value3" ))extract_query(c( "http://example.com?query1=value1&query2=value2", "http://example.com?query1=value3" ))
Extract the scheme from URL
extract_scheme(url)extract_scheme(url)
url |
A character vector of URLs. |
A character vector containing the scheme from each URL.
extract_scheme(c("http://example.com", "https://example.com"))extract_scheme(c("http://example.com", "https://example.com"))
Extract userinfo from URL
extract_userinfo(url)extract_userinfo(url)
url |
A character vector of URLs. |
A character vector containing the userinfo from each URL, if present.
extract_userinfo(c("http://user:[email protected]"))extract_userinfo(c("http://user:[email protected]"))
Split host into subdomains and domain
split_host(host)split_host(host)
host |
A character vector of hostnames to be split. |
A tibble with one row per hostname and columns for top-level domain, domain and subdomains. Columns are created as many as the number of hosts' components and are named as tld, domain, subdomain_1, subdomain_2, etc.
split_host(c("subdomain.example.com")) split_host(c("subdomain2.subdomain1.example.com", "example.com"))split_host(c("subdomain.example.com")) split_host(c("subdomain2.subdomain1.example.com", "example.com"))
Split path into segments
split_path(path)split_path(path)
path |
A character vector of paths to be split. |
A tibble with one row per path and columns for each segment separated by '/'.
split_path(c("/path/to/resource"))split_path(c("/path/to/resource"))
Split query into parameters
split_query(query)split_query(query)
query |
A character vector of query strings to be split. |
A tibble with one row per query string and columns for each parameter, column names as parameter names.
split_query(c("param1=value1¶m2=value2"))split_query(c("param1=value1¶m2=value2"))
Split URL into its constituent parts
split_url(url)split_url(url)
url |
A character vector of URLs to be split. |
A tibble with one row per URL and columns for each component: scheme, host, port, userinfo, path, query, and fragment.
split_url(c("https://example.com/path?query=arg#frag"))split_url(c("https://example.com/path?query=arg#frag"))
Sample web site URLs
websitepageswebsitepages
websitepagesA data frame with 1,000 rows and 1 column:
Page URL
...
Syntetic data