Converting HTML to XHTML

Learn how to convert an HTML Web page to XHTML with this detailed example.

In our Introducing XHTML article, we took a look at how XHTML differs from regular HTML 4. In this article, you'll learn how to convert an HTML 4 Web page to fully standards-compliant XHTML 1.0 by working through a practical example.

The HTML 4 page

Take a look at the page we're going to convert. This page validates to HTML 4.01 Transitional. The source markup looks like this:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<HTML>
  <HEAD>
    <TITLE>My cat called Lucky</TITLE>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
  </HEAD>
  <BODY>

    <A NAME="top"> </A>

    <H1>My cat called Lucky</H1>

    I have a cat called Lucky. She is black & white, and nearly
    twelve years old.<P>
    
    I found her through a pet rescue service. She didn't like her
    old home because it had a big scary dog in it that used to
    frighten her. When I first got her she was very scared and
    hid under the table for a whole week! Nowadays she is still
    a bit jittery but much more relaxed.<P>

    Here is a picture of Lucky in the garden.<P>

    <IMG SRC="images/lucky-being-stroked.jpg" ALT="Lucky" WIDTH=400
    HEIGHT=300 BORDER=0>

    <BR><BR>
    
    She is very good at catching mice. She also catches birds,
    which can be a problem. Now that she has a collar and bell,
    though, she catches fewer birds.<P>

    <H2>Email Lucky!</H2>

    Use the form below to send Lucky an email. You never know -
    she might even reply, if she's not too busy!<P>

    <FORM METHOD="post" ACTION="mailform.cgi">
      Your email: <INPUT TYPE="text" NAME="email"><P>
      Your message: <TEXTAREA NAME="message" COLS=40 ROWS=8>
      </TEXTAREA><P>
      Do you have a cat?
        <INPUT TYPE="radio" NAME="haveCat" VALUE="yes" checked>Yes
        <INPUT TYPE="radio" NAME="haveCat" VALUE="no">No<P>
      <INPUT TYPE="submit" NAME="Send" VALUE="Send Email">
    </FORM>

    <P><A HREF="#top">Top of page</A>

  </BODY>
</HTML>

As you can see, it's a Web page about my cat. It's a simple page, but it contains a lot of markup that needs to be changed if the page is going to be valid XHTML 1.0.

Changing tags to lowercase

Our first task is to change all those uppercase tags to lowercase. XHTML requires that all elements and attributes be written in lowercase. Here's how our markup looks with lowercase tags:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>My cat called Lucky</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>

    <a name="top"> </a>

    <h1>My cat called Lucky</h1>

    I have a cat called Lucky. She is black & white, and nearly
    twelve years old.<p>
    
    I found her through a pet rescue service. She didn't like her
    old home because it had a big scary dog in it that used to
    frighten her. When I first got her she was very scared and
    hid under the table for a whole week! Nowadays she is still
    a bit jittery but much more relaxed.<p>

    Here is a picture of Lucky in the garden.<p>

    <img src="images/lucky-being-stroked.jpg" alt="Lucky" width=400
    height=300 border=0>

    <br><br>
    
    She is very good at catching mice. She also catches birds,
    which can be a problem. Now that she has a collar and bell,
    though, she catches fewer birds.<p>

    <h2>Email Lucky!</h2>

    Use the form below to send Lucky an email. You never know -
    she might even reply, if she's not too busy!<p>

    <form method="post" action="mailform.cgi">
      Your email: <input type="text" name="email"><p>
      Your message: <textarea name="message" cols=40 rows=8>
      </textarea><p>
      Do you have a cat?
        <input type="radio" name="haveCat" value="yes" checked>Yes
        <input type="radio" name="haveCat" value="no">No<p>
      <input type="submit" name="Send" value="Send Email">
    </form>

    <p><a href="#top">Top of page</a>

  </body>
</html>

Notice that we don't need to change the values of attributes ("Lucky", "haveCat" and so on) to lowercase. Also notice that we made html lowercase in the DOCTYPE declaration at the top of the page (but left the other parts of the declaration untouched).

Quoting attribute values and expanding attributes

All attribute values need to be quoted in XHTML, even if they're numeric. For example:


Incorrect: <img ... border=0>
Correct: <img ... border="0">

In addition, XHTML doesn't allow you to use attribute names without their values; such attributes need to be expanded:


Incorrect: <input type="radio" ... checked>
Correct: <input type="radio" ... checked="checked">

Some browsers, such as Safari, actively refuse to recognise so-called "minimised" attributes (attribute names without values) if the document type is XHTML.

After going through our HTML page and correcting these issues, we're left with the following markup:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>My cat called Lucky</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>

    <a name="top"> </a>

    <h1>My cat called Lucky</h1>

    I have a cat called Lucky. She is black & white, and nearly
    twelve years old.<p>
    
    I found her through a pet rescue service. She didn't like her
    old home because it had a big scary dog in it that used to
    frighten her. When I first got her she was very scared and
    hid under the table for a whole week! Nowadays she is still
    a bit jittery but much more relaxed.<p>

    Here is a picture of Lucky in the garden.<p>

    <img src="images/lucky-being-stroked.jpg" alt="Lucky" width="400"
    height="300" border="0">

    <br><br>
    
    She is very good at catching mice. She also catches birds,
    which can be a problem. Now that she has a collar and bell,
    though, she catches fewer birds.<p>

    <h2>Email Lucky!</h2>

    Use the form below to send Lucky an email. You never know -
    she might even reply, if she's not too busy!<p>

    <form method="post" action="mailform.cgi">
      Your email: <input type="text" name="email"><p>
      Your message: <textarea name="message" cols="40" rows="8">
      </textarea><p>
      Do you have a cat?
        <input type="radio" name="haveCat" value="yes" checked="checked">Yes
        <input type="radio" name="haveCat" value="no">No<p>
      <input type="submit" name="Send" value="Send Email">
    </form>

    <p><a href="#top">Top of page</a>

  </body>
</html>

Note that all these changes still leave us with a perfectly valid HTML 4.01 page. XHTML is largely backward-compatible with HTML.

Making the document well-formed

Our HTML Transitional page isn't well-formed. XHTML Strict requires all documents to be well-formed, so we'll need to make a few changes to the markup's structure.

Closing open elements

In order to be a well-formed XHTML document, all elements in the document must be closed. This means they need a closing tag: </p>, </b> and so on. Alternatively, if the element is empty (contains no content) then you can just place a slash (/) before the > at the end of the tag — for example, <br />.

Although you can just write <br/> (without the space before the slash), it's a good idea to put the space in to avoid confusing some HTML browsers.

Nesting inline elements inside block elements

Strict-mode documents — whether HTML or XHTML — require that all inline elements such as a, img and input, as well as bare text, are nested inside block-level elements, such as p or div. This means that we need to properly wrap our text, as well as any bare inline elements, in <p></p> tags.

So let's go through our HTML document and fix up all those unclosed elements and non-nested inline elements. Here's the result:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>My cat called Lucky</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>

    <p><a name="top"> </a></p>

    <h1>My cat called Lucky</h1>

    <p>I have a cat called Lucky. She is black & white, and nearly
    twelve years old.</p>
    
    <p>I found her through a pet rescue service. She didn't like her
    old home because it had a big scary dog in it that used to
    frighten her. When I first got her she was very scared and
    hid under the table for a whole week! Nowadays she is still
    a bit jittery but much more relaxed.</p>

    <p>Here is a picture of Lucky in the garden.</p>

    <p><img src="images/lucky-being-stroked.jpg" alt="Lucky" width="400"
    height="300" border="0" /></p>
    
    <p>She is very good at catching mice. She also catches birds,
    which can be a problem. Now that she has a collar and bell,
    though, she catches fewer birds.</p>

    <h2>Email Lucky!</h2>

    <p>Use the form below to send Lucky an email. You never know -
    she might even reply, if she's not too busy!</p>

    <form method="post" action="mailform.cgi">
      <p>Your email: <input type="text" name="email" /></p>
      <p>Your message: <textarea name="message" cols="40" rows="8">
      </textarea></p>
      <p>Do you have a cat?
        <input type="radio" name="haveCat" value="yes" checked="checked" />Yes
        <input type="radio" name="haveCat" value="no" />No</p>
      <p><input type="submit" name="Send" value="Send Email" /></p>
    </form>

    <p><a href="#top">Top of page</a></p>

  </body>
</html>

Notice that we removed the <br /><br /> after the img element. Apart from being invalid XHTML — br is an inline element, so it can't be placed directly in the page body without being wrapped in a block element — the line breaks were no longer necessary once we'd correctly wrapped our img in a block-level p element.

That's better. We've closed all our elements, either by placing a closing tag after each opening tag, or by using the slash (/) shortcut to close empty elements. In addition, all inline elements are properly encased in block-level elements — in this case, p elements.

Removing presentational markup

Generally speaking, XHTML encourages you to use CSS to describe the look of your pages, rather than embedding presentation within the markup. This means that attributes such as align, size and border should be replaced with CSS equivalents; such attributes are deprecated in XHTML. Let's change our img element accordingly, from:


    <p><img src="images/lucky-being-stroked.jpg" alt="Lucky" width="400"
    height="300" border="0" /></p>

to:


    <p><img src="images/lucky-being-stroked.jpg" alt="Lucky" width="400"
    height="300" style="border: none;" /></p>

In a real-world situation, it'd be a good idea to move the above inline CSS to a separate style sheet, and place a class or id attribute on the img element so that you can select it from within the CSS. If possible, keep all presentational aspects out of your markup.

Changing name to id and encoding ampersands

Nearly there. We just need to make a couple more minor changes to turn our markup into valid XHTML.

First of all, using the name attribute to identify fragments (sections of markup to link to within the page) is deprecated in XHTML. The id attribute should be used instead. This means that we need to rewrite our #top fragment:


    <a name="top"> </a>

as:


    <a id="top"> </a>

Using name is still OK in other situations, such as form fields. You only need to change name to id when defining fragments that you link to with <a href="# ... .

Don't forget that, unlike the name attribute, ids must be unique; you can't have more than one element with the same id in the page.

Secondly, we have a single bare ampersand in our markup. This is not allowed in XHTML; all ampersands must be encoded. So we need to change:


<p>I have a cat called Lucky. She is black & white, and nearly
    twelve years old.</p>

to:


<p>I have a cat called Lucky. She is black &amp; white, and nearly
    twelve years old.</p>

Changing the document type

Excellent! We've changed all our markup so that it validates to XHTML 1.0 Strict. We now need to change the page's document type from HTML 4.01 Transitional to XHTML 1.0 Strict. The DOCTYPE for XHTML 1.0 Strict is:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

In addition, we need to add an xmlns namespace declaration inside the html element to make the page a valid XML document:


<html xmlns="http://www.w3.org/1999/xhtml">

So our final XHTML 1.0 Strict markup looks like this:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>My cat called Lucky</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>

    <p><a id="top"> </a></p>

    <h1>My cat called Lucky</h1>

    <p>I have a cat called Lucky. She is black &amp; white, and nearly
    twelve years old.</p>
    
    <p>I found her through a pet rescue service. She didn't like her
    old home because it had a big scary dog in it that used to
    frighten her. When I first got her she was very scared and
    hid under the table for a whole week! Nowadays she is still
    a bit jittery but much more relaxed.</p>

    <p>Here is a picture of Lucky in the garden.</p>

    <p><img src="images/lucky-being-stroked.jpg" alt="Lucky"
    style="width: 400px; height: 300px; border: none;" /></p>
    
    <p>She is very good at catching mice. She also catches birds,
    which can be a problem. Now that she has a collar and bell,
    though, she catches fewer birds.</p>

    <h2>Email Lucky!</h2>

    <p>Use the form below to send Lucky an email. You never know -
    she might even reply, if she's not too busy!</p>

    <form method="post" action="mailform.cgi">
      <p>Your email: <input type="text" name="email" /></p>
      <p>Your message: <textarea name="message" cols="40" rows="8">
      </textarea></p>
      <p>Do you have a cat?
        <input type="radio" name="haveCat" value="yes" checked="checked" />Yes
        <input type="radio" name="haveCat" value="no" />No</p>
      <p><input type="submit" name="Send" value="Send Email" /></p>
    </form>

    <p><a href="#top">Top of page</a></p>

  </body>
</html>

View the finished XHTML page in all its glory!

As you can see, converting an HTML 4 page to XHTML can be fairly time-consuming, though the process is straightforward. If you're converting a lot of pages, you might find tools such as HTML Tidy helpful, as they can convert HTML to XHTML automatically.

Follow Elated

Related articles

Responses to this article

6 responses (oldest first):

02-Mar-10 11:34
Hi Matt - This is a great tutorial for beginners - thanks a lot.

The link towards the end of the tutorial - "finished XHTML page" is pointing to the original html file. You may want to correct this

Best Regards,
Girish
03-Mar-10 16:28
Hi Girish,

So it is! Well spotted. I'll get that fixed up.

Thanks!
Matt
02-Nov-10 03:31
"This means that attributes such as width, height, align, size and border should be replaced with CSS equivalents"

You got a little confused here. What you say is true concerning align and border - and that's exactly why they're deprecated in xhtml1/html4.
'size' on the other is not a valid attribute, you made that up.
Your point is however not valid concerning the attributes 'width' and 'height': they are not just representational mark-up, they give information about the image in use. This is why these attributes are not depracated. You might want to correc that in your article.

Besides, avoiding representational mark-up was just as much a goal for html4, so this chapter doesn't really belong in here anyway.
07-Nov-10 22:03
@elioxar: Thanks for your feedback. Good point about width and height - they're not deprecated. I've updated the article.

"'size' on the other is not a valid attribute, you made that up."

No I didn't: http://www.w3schools.com/tags/tag_font.asp
08-Nov-10 01:17
After my comment I googled the topic. Apparantly I shouldn't have said you got confused - as it was the w3c who got confused. html4/xhtml is inconsistent about width and height.
10-Nov-10 19:55
@elioxar: Yeah I think I've seen conflicting advice from the W3C too. I believe your basic point about width and height is valid though, since width and height are (usually) intrinsic to the image, rather than being presentational.

Post a response

Want to add a comment, or ask a question about this article? Post a response.

To post responses you need to be a member. Not a member yet? Signing up is free, easy and only takes a minute. Sign up now.

Top of Page