A URL is designed to be a universal address for a resource on the web. To ensure this universality, the standard character set for URLs is limited to a small range of characters known as ASCII. When a URL contains non-ASCII characters—such as accents (é), symbols (©), or characters from non-Latin alphabets (你好)—they must be translated into a format that all systems can understand. While modern browsers and search engines are adept at handling these characters, their presence can still lead to issues with crawling, sharing, and analytics.

Think of it as writing an address for an international package. Using the local language for the street name might work, but translating it into a universally recognized format ensures it won’t get lost in translation by an older sorting machine. For URLs, this translation process is called ‘percent-encoding.’ For a broader look at URL best practices, see our guide on on-page SEO.

An illustration of a tangled web, symbolizing the potential confusion of non-ASCII characters in URLs.

Percent-Encoding: The Universal Language of URLs

When a browser encounters a non-ASCII character in a URL, it automatically converts it into a percent-encoded equivalent. For example, the URL `https://example.com/caffè` becomes `https://example.com/caff%C3%A8`. While Google is generally smart enough to understand that these two URLs point to the same content, this can still cause problems:

  • Crawling and Linking Issues: Some older crawlers or external systems may not handle the encoding correctly, leading to broken links or crawl errors.
  • Sharing and Usability: A long, encoded URL can be intimidating for users to copy, paste, or share.

For a deep dive into this topic, this guide from Ahrefs on URL structure is an excellent resource.

A Step-by-Step Guide to Cleaning Your URLs

The best practice is to avoid non-ASCII characters in your URL slugs from the start. If they already exist, you should standardize them. For Google’s perspective on this, their guide on URL structure is a must-read.

Example: Fixing a Non-ASCII URL

Before: `https://example.com/caffè`

After: `https://example.com/caffe` (and a 301 redirect from the old URL to the new one)

For more on this topic, see our guide on URL parameters.

Frequently Asked Questions

What is ASCII?

ASCII (American Standard Code for Information Interchange) is a character encoding standard that includes letters, numbers, and basic symbols used in the English language. URLs were originally designed to use only these characters, which is why special characters, spaces, and characters from other languages need to be encoded.

Does this apply to domain names?

Yes. Domain names with non-ASCII characters are called Internationalized Domain Names (IDNs). They are converted into a format called Punycode (e.g., `xn--…`) to be compatible with the DNS system. While modern browsers can display them correctly, they can still cause issues with some systems.

How can I find all the non-ASCII URLs on my site?

The most effective way is to use a website crawler like Creeper. It will scan your site and identify any internal links that contain non-ASCII characters, allowing you to update them to their properly encoded versions.

Ready to untangle your URLs? Start your Creeper audit today and ensure your URL structure is clean and effective.