HTML Data Types


Note on SGML Data Types

This site does not refer to the standard SGML data types as used in the HTML DTD, but mixes that information with the prose for the definitions listed here. If you need the syntax rules as defined by the DTD, go use the DTD.

Case Sensitivity

In HTML, tags and attributes are case-insensitive. Attribute values, however, can be case-sensitive, case-insensitive, or case-neutral. URIs, for instance, are case-sensitive while list types are not. Number values are case-neutral.


Basic HTML Data Types

[CN] CHARACTER

A [character] value is a single character from the document character set. A character may also be referenced by its character entity (escape code).

[CS] TEXT

Basically, [text] stands for text. Can take any and all characters from the document character set and may include character entities (escape codes).

[CS] NAME

[name] values may include capital letters (A-Z), small letters (a-z), hyphens (-), periods (.), underscores (_), and colons (:). However, they must begin with a letter.

[CN] NUMBER

[number] values may include any positive integer and zero (0) unless further restricted.


HTML Data Types defined by RFC and IANA documentation

[CS] URI

[URI] values are defined by RFC2396, and include relative and absolute URIs

URIs include URLs, so even if you've never heard of URIs, this data type should be nothing radically new.

URI=Uniform Resource Identifier
URL=Uniform Resource Locator.

The Basics of the Absolute URL

A URL is the address to a file, given as the protocol followed by the path. The protocol is the method, or "language", used to fetch the file and is separated from the path by a colon (:).

Common Internet Protocols:
http:
HyperText Transfer Protocol, designed specifically for the World Wide Web
ftp:
File Transfer Protocol
gopher:
??? (I don't know anything about this.)
file:
used to access local files (on your computer)
mailto:
initiates a new email message to given address

All protocols except 'mailto' are followed by two slashes after the colon.

ex: http://

HTTP and FTP Paths

The first part of the path for an http or ftp URL is the location of the machine (server) that has the file on its system. A number designates each individual server on the WWW; however, they are rarely used. Instead, domain names associated with the servers are used. Domain names are like nicknames--the same server may be called by different names, and the domain name can also be used to represent a specific directory on the server.

Domain names are case insensitive:
fantasai.tripod.com is the same as Fantasai.Tripod.COM

Following the domain name is the path to the file itself. Paths are given UNIX-style, with forward slashes separating the directories. If a filename is not given, the server automatically looks for an index file (e.g. index.html) to return; if none is found, it returns a list of all files in the directory.

Paths are case sensitive:
https://fantasai.tripod.com/UTF-8/contents.htm is not the same as https://fantasai.tripod.com/utf-8/contents.htm or https://fantasai.tripod.com/UTF-8/contents.HTM

Mailto:

Mailto URLs are simply the protocol followed by the email address:
mailto:fantasai@yahoo.com

Relative URLs

Relative URLs give the location of a file relative to a base URL. (The base URL is the location of the source anchor's file unless overridden by <BASE>.) One of the nice things about relative URLS is that you don't have to change them every time you move your files from place to place. This allows you to check your files for broken links on your hard drive before uploading them.

UTF-8 holding Appendix (holding datatype.html and index.html), HTML (holding Links (holding a.html)), and contents.html

Relative URLs work like a set of point to point directions. You start in the base URL directory. If the destination file is in the same directory, you simply specify the filename. So if I wanted to go from this file (datatype.htm) to the index of this directory, I would simply specify a URL of "index.html".

If you're going down a directory to get to the destination file, you specify the directory and then the filename. For example, a link from my main page (contents.htm) to this page (datatype.htm) would specify a URL of "Appendix/datatype.html".

To go up a directory level, you specify "../". Therefore, to link back up to my main page from here, I would specify "../contents.html".

You can also combine these for relationships that go up and down several directory levels in the tree. A URL from here to the tag entry for hypertext links would be "../HTML4/Links/a.html". This goes up one directory level into "UTF-8/", down into "HTML4/", down from there into "Links/", and then picks out "a.html" from that directory.

Fragment Identifiers

Fragment identifiers can be used to specify a specific element, or part of the document, as the destination anchor. The element must be named to function as an anchor--either by the name attribute of the anchor tag (<A>) or by the id attribute on any element. To refer to this named element, a URL first designates the file by either an absolute URL or a relative one, followed by a hash (#), then the identifier.

ex: "https://fantasai.tripod.com/UTF-8/Appendix/datatype.htm#URI" refers to the heading of the entry for URIs (on this page).
From "a.html", I can also use "../../Appendix/datatype.htm#URI".

From within the same file, no file needs to be specified. "#URI" from here will take you back up to the URI header.

[CI] CONTENT-TYPE

[content-type] values are media types as defined by RFC2045 and RFC2046. Do not confuse them with media types as used in HTML, which are different. (While the RFC uses "media type" in its definition, I will be using "content type" to avoid confusion.)

The Basic Format of a Content-Type

There are five discreet top-level content types. These are followed by a slash (/) and a subtype. Example: text/html, where text is the top-level media type, and html is the subtype.

The Five Discreet Top-Level Content Types:
Common Content Types (with common extensions in parentheses):

A complete list of registered MIME types.

[CI] LANGUAGE CODE

[language code] values are language codes as defined in RFC1766. I had quite a time tracking down a copy--the original server doesn't exist anymore, but the W3C was kind enough to update their resource links for HTML 4.01.

Basic Format of a Language Code

A language code consists of two parts: the primary language tag, optionally followed by a hyphen (-) and a hyphen-separated series of subtags. Example: "en" is the code for English. To be more specific, you can also specify "en-US", indicating the US variant of English.

The primary tag uses a language code from ISO 639. It may also take the values "i" and "x", whose uses are defined in RFC1766.

The subtag can be used to indicate:

CHARACTER SET

[character set] values are character sets from the IANA character set registry.


HTML Data Types defined in the HTML Specification & by W3C

[CS] DATETIME

[datetime] values use the ISO date format (ISO 8601). Since the document is not readily available (ya have to pay -_-;;), the HTML specification covers the format it uses:

The ISO Date Format

YYYY-MM-DDThh:mm:ssTZD or [year]-[month]-[date]T[hour]:[minute]:[second][time zone designator]

YYYY [year]
Four digits--this date format is Y2K compliant, Y3K compliant, Y4K compliant, etc. So we have 8000 years until there's a problem. :)
MM [month]
Two digits--add a zero for single digits (January becomes 01).
DD [date]
Two digits--add a zero for single digits (7 becomes 07).
T
This letter separates the date from the time. It must be capitalized.
hh [hour]
Twenty-four hour clock, two digits--add a zero for single digits (4am becomes 04; 4pm becomes 16).
mm [minute]
Two digits--add a zero for single digits.
ss [seconds]
Two digits--add a zero for single digits.
TZD [time zone designator]
Time Zone Description Time Zone Designator
UTC (Coordinated Universal Time)
a.k.a. GMT (Greenwich Mean Time)
Z
Time zones ahead of UTC
(to the east)
+hh:mm
+[hours]:[minutes] ahead of UTC
Time zones behind UTC
(to the west)
-hh:mm
-[hours]:[minutes] behind UTC

Examples:

[CI] RGB TRIPLET

RGB Triplets

RGB triplets code for a color. They are a six digit number in hexadecimal form.
Each two digits represent a color:

The first two represent Red. --> R
The next two represent Green. --> G
The last two represent Blue. --> B

The higher the number for the color, the more it is shown, and the brighter it is. For example, #FF0000 would be bright red, while #660000 would be a really deep, dark red. What if you wanted to make your color lighter, say, pink? You would add some green and blue to lighten it. Just don't add more green and blue than you have red, or you'll get greenish-blue.
If all the colors are equal, you'll get a shade of gray. So #FFFFFF would be a the brightest gray (white), and #000000 would be the darkest gray (black).

Don't really understand any of this? Check out BigNoseBird's page about COLORS.

The Hexadecimal System

The hexadecimal number system has sixteen digits (0 - F), rather than ten digits (0 - 9) like the decimal system we normally use. Therefore a one in the "ten's" place now represents sixteen. 10 in hexadecimal is the same as saying 16 in decimal, and F stands for fifteen. Most of the time, the three colors' numbers are written separately in decimal form, and can take values from 0 to 255. In the hexadecimal system, this would correspond to 0 - FF.

Converting from decimal to hexadecimal:

Be sure to convert each color by itself.

Now convert the next color.

Once you have all three numbers converted, write them one right after the other in the order Red, Green, Blue. This is your RGB triplet.

Decimal-Hexadecimal Digit Conversions
Decimal = Hexadecimal
1 = 1
2 = 2
3 = 3
4 = 4
5 = 5
6 = 6
7 = 7
8 = 8
9 = 9
10 = A
11 = B
12 = C
13 = D
14 = E
15 = F
16 = 10

[CI] 16COLOR NAME

[16color name] values can take one of sixteen color names corresponding with RGB triplets as defined below:

Color Name RGB Triplet
Black #000000
Gray #808080
Silver #C0C0C0
White #FFFFFF
Maroon #800000
Red #FF0000
Olive #808000
Yellow #FFFF00
Green #008000
Lime #00FF00
Teal #008080
Aqua #00FFFF
Navy #000080
Blue #0000FF
Purple #800080
Fuchsia #FF00FF

[CI] LINK TYPE LIST

Alternate
Indicates an alternate version of the current document (such as in a different language or for a different medium).
Stylesheet
Indicates an external style sheet. Can be used with "alternate" for an "alternate stylesheet."
Script
Indicates an external script. Not all browsers support this; I know I have had trouble with MSIE, and had to resort to the <SCRIPT> element instead. (For those of you about to defend Microsoft's browser, I have version THREE! So don't explain that it works fine in 5.0 or whatever.)
Start
Indicates the first document in a series, of which the current document is a part.
Next
Indicates the next document in the series.
Prev
Indicates the previous document the series.
Contents
Indicates a table of contents for the current document or collection of documents.
Index
Indicates an index for the current document or collection of documents.
Glossary
Indicates a glossary for terms used in the current document or collection of documents.
Copyright
Indicates a copyright statement for the current document or collection of documents.
Chapter
Indicates a chapter in the current document or collection of documents. (Can be used, for instance, in the table of contents.)
Section
Indicates a section in the current document or collection of documents.
Subsection
Indicates a section in the current document or collection of documents.
Appendix
Indicates an appendix to the current document or collection of documents.
Help
Indicates a help section.
Bookmark
Indicates a bookmark, which is "a key entry point within an extended document."

You can also define your own link types by using a meta data profile. I have no idea what a profile is, so here the specification entry: HTML 4.0 - Meta Data Profiles It's in the section on the META tag, so there will be references to that tag and its attributes.

[CI] MEDIA TYPE

screen
For non-paged computer screens
tty
For media such as teletypes and terminals, with limited displays using fixed-width font.
tv
For television-type media (low resolution, color)
projection
For projectors
handheld
For handheld devices (small screen, monochrome, bitmapped graphics, limited bandwidth). A Palm Pilot would fit here.
print
For paged media (printed materials, print preview).
braille
For braille devices (tactile).
aural
For speech browsers.
all
For all media.