1. Technology

Specifying Language in XHTML and HTML Content

An Overview of the W3C Internationalization Best Practices


Who Should Define the Language of Their HTML Content?

According to the W3C, everyone who writes HTML 4.01 or XHTML 1.0 documents should define the language of their Web content. Even if you assume that 100% of your audience is English-speaking, you need to define the language of your Web pages so that the user-agents they use can correctly interpret the content.

There are more and more applications that can use information about the natural language (in other words, the language of the content, not the programming language) of the content to provide the content in ways that work for the user.

For example, some ways the language might be used:

  • language specific searches
    Many search engines offer the ability to tailor searches to provide results only in a specific language. Right now they do it primarily through analysis of the page and site the document is on - but if you define the language correctly in the meta data, your documents will show up more accurately in these searches.
  • spoken language accents in aural browsers
    Screen readers use the language meta data to provide more accurate rendering of the page in an audio context.
  • translation tools
    Automated translation services need the base language of the content to provide accurate translations.
  • print style sheets
    The language that the HTML document is written in can affect how the page will print.
  • other style sheet properties
    For example, the CSS 3 property :first-letter needs to know the language of the content to correctly interpret what constitutes the first letter. This may seem obvious, but to a computer a double-byte character might be interpreted as 2 separate characters without the language information.

How Do You Specify the Language in an XHTML or HTML Document?

First, you need to remember that specifying the language is meant to encompass the page as a whole. If you write Web pages with multiple languages on them, you should define the base language of the page, and then call out the other languages as separate language elements on the page.

Then, you need to know that declaring the language is separate from declaring the character encoding of the document. Your server might define the character encoding automatically, but to be sure, it's a good idea to include the following meta tag in your XHTML and HTML documents:

<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

Finally, the text direction is also not specified by the language declaration. For example, in some languages like Hebrew and Arabic text is displayed and read from right to left, but numbers and text from other scripts are displayed (and read) from left to right. To define the text direction you can use the dir attribute in the html element to define the text direction for the entire page.

There are several ways you can define the language of your document in HTML and XHTML:

  • Attribute on the html tag
    Setting the lang attribute on the html tag will define the language for the entire document. This attribute can then be overridden within the HTML for content areas that contain text in another language.
  • Meta data
    You can use a "Content-language" meta tag or a Dublin core "language" meta tag to define the language used by the document.
  • HTTP headers
    You can set up your Web server to send the language information in the HTTP headers.

What is the Best Way to Define the Language of the Web Page Content?

According to the W3C best practices document, the best way to define the language of your HTML and XHTML documents is with an attribute on the html tag. They say:

Always declare the default language for text in the page using attributes on the html tag, unless the document contains content aimed at speakers of more than one language.

To define the language in an HTML 4.01 document:

<html lang="en-US">

This defines the language as being U.S. English.

If you're writing XHTML that is delivered as type "text/html", you should use both the lang attribute and the xml:lang attribute:

<html lang="en-US" xml:lang="en-US" xmlns ="http://www.w3.org/1999/xhtml">

And if you're serving XHTML pages as XML (such as if you're serving XHTML 1.1 documents), use the xml:lang attribute alone.

<html xml:lang="en-US" xmlns ="http://www.w3.org/1999/xhtml">

What to Do with Bilingual Web Pages

The lang and xml:lang attributes do not allow you to assign mulitple languages to a document. So if you're writing a Web page with multiple languages you have two options:

  1. Define a primary language with the lang attribute, and then call out the secondary language(s) with lang attributes on elements in the document.
  2. Leave out the lang attribute and define it in the various divisions of the document.

To define the language of a section of the document, add the lang attribute to the appropriate element, such as a div element:

<div lang="fr-CA" xml:lang="fr-CA">
 Canadian French content...
 <div lang="en-CA" xml:lang="en-CA">
 Canadian English content...
  1. About.com
  2. Technology
  3. Web Design / HTML
  4. Web Strategy
  5. Content
  6. Localization
  7. Internationalization of XHTML and HTML Documents

©2014 About.com. All rights reserved.