Media type and character set encoding correctly identified

To enable the correct processing of contents, it is fundamental to correctly identify the character set encoding.

The media type (MIME) of each content must be correctly identified through the HTTP parameter Content-Type. The value for this field can be provided in two ways:

  • Content-Type field of the HTTP header: the HTTP header is supplied by a Web server and it defines a set of characteristics of the content before it is downloaded.
  • Meta tag Content-Type: can be included on the source code of a page. Example:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Attention: These two sources of information must be consistent among them.

Identifying the correct character set encoding

The value for the Content-Type field has two components: the media type format (MIME) and the character set encoding (charset). The charset must be correctly identified to avoid, for instance that non-English characters will be presented incorrectly (e.g. “codificação” instead of “codificação”).

Problems due to the inaccurate definition of charset are hard to identify because different browsers react to them differently. When a page presents broken characters, it is most likely that the charset provided by the Content-Type field is wrong. The following simple procedure can help identifying which one of the values for the Content-Type is correct, HTTP header or meta-tag:

  1. Access the HTTP header (field Content-Type) of the page address that is going to be validated and identify the defined charset. There are tools online that enable the access to the HTTP header of any page on the Web, such as the Rex Swain’s HTTP Viewer or Web-sniffer. For instance, if the field Content-Type presented the value “text/html; charset=utf-8”, this means that the character set encoding used to write the page was UTF-8. According to the HTTP protocol, if a value is not resented for the charset, it should be assumed that the ISO-8859-1 encoding was used;
  2. Open the page using the Firefox browser, right-click, choose View Page Source and look for the meta tag Content-Type. Verify if the charset value supplied in the meta-tag is the same as the one supplied in the HTTP header. If it is not, one of them is wrong;
  3. In Firefox, choose View -> Character Encoding according to the values previously obtained, inspect the text carefully to identify if there are errors on the text. Pay special attention to non-English characters or accented characters, such as ç, ã or á;
  4. The page presents errors with both characters. Well, probably you will have no option but to guess which charset was used to write the page. Try several Character Encodings in Firefox and choose the one that provided better results. Anyway, you should inform the author of the page to fix it.