Home > Software Development > Unicode Web Apps

Unicode Web Apps

Are you attempting to help foreign languages in your internet application and find ?? instead of 你好?  Do your customers get a bunch of funny vowels (Ãă) when they try to past “smart” quotes in your content management system (a problem I termed Irritable Vowel Syndrome)?  The problem is broken Unicode help someplace in your stack.

With the prevalence and lengthy history of Unicode, 1 may well be shocked that it is nonetheless not trivial to develop a internet application that supports unicode!  Let me show you how.

This is actually an update and extension to Learning Monk’s good post “How numerous palces [sic] do you have to set Character encoding for a site“.

So, the deal is you require to ensure your entire webstack is configured to use Unicode.

The OS, VM and Language

Most operating systems support unicode now.  So, this is rarely an issue.  If you are using something other than Windows or Linux, double check to be sure.

Java, on which this article is based, was built on Unicode.

The Data Store

At the bottom of your web stack is generally a persistent data store.  This is normally the first place where Unicode breaks down.

Files

Obviously, you must make sure that your files are in Unicode. If you’re storing data in plain files using Java then Unicode is enabled by default, with one great and very stupid exception: Java properties files, which only support 8bit latin.  Since most of us use properties files for internationalization, this becomes a huge problem.  As such, you must make sure to escape all Unicode characters in your properties file.

For example, the classic Chinese greeting 你好 should be encoded thusly:

title.welcome= \u4F60\u597D

Most IDEs have some degree of support for converting characters.  For example, in Eclipse 3.7 you can simply paste Unicode text into a properties file and Eclipse will escape it.  For other situations, you can use this page to escape content (use the JavaScript escapes result).

Eclipse

If you are create files (other than property files, such as FreeMarker or Velocity templates) inside Eclipse, you should make Unicode the default character encoding. Open Window->Preferences and select General/Workspace. Set the Text File Encoding to UTF-8.

The Database

Some databases do not support Unicode by default.  For example, in MySQL, which defaults to 8bit latin, one should create a database using:

CREATE DATABASE db CHARACTER SET utf8 COLLATE utf8_general_ci;

Note that the collate option specifies how text will be sorted (in this example case insensitively).

If you want to upgrade existing tables to support Unicode you must use the alter table syntax:

ALTER TABLE tbl DEFAULT CHARACTER SET 'utf8' COLLATE utf8_general_ci;

Test this before you deploy it.  You need to make sure that all of the existing text is converted the way you expect to be.

The Database Connection

Similarly, you must make sure your database connection is prepared to handle Unicode.  When using MySQL and JDBC use the following connection URL:

jdbc:mysql://localhost/db?useUnicode=true&characterEncoding=UTF-8

You may have additional parameters as well.  The important ones are useUnicode and characterEncoding.

The Web Application

You need to ensure that all components in your web application support Unicode as well.  Generally, this involves three major components.

Spring

If you’re using the Spring Framework (which this tutorial assumes) you need to enable and configure the CharacterEncodingFilter in your application’s web.xml file.

<filter>
   <filter-name>CharacterEncodingFilter</filter-name>
   <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
   <init-param>
	  <param-name>encoding</param-name>
	  <param-value>UTF-8</param-value>
   </init-param>
   <init-param>
	  <param-name>forceEncoding</param-name>
	  <param-value>true</param-value>
   </init-param>
</filter>
<filter-mapping>
	<filter-name>CharacterEncodingFilter</filter-name>
	<url-pattern>/*</url-pattern>
</filter-mapping>

Application Server

You also need to ensure your application server is configured to use Unicode.  Tomcat (at least some versions) does not default to Unicode.  Open your Tomcat’s server.xml file and add URIEncoding=”UTF-8″ to your Connector.  It should look something like this:

<Connector port="80" protocol="HTTP/1.1" redirectPort="8443" URIEncoding="UTF-8"/>

If you are using Eclipse’s WTP to develop your web application, open the Servers folder in your Package Explorer.  There you will find a folder for your Tomcat instance.  Open that to reveal the correct server.xml.

The View

Finally, you must ensure that your view layer technology is configured to generate Unicode.  For JSP’s, you need to add this declaration to your pages:

<%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>

This can be done manually or through the use of an auto included JSP header, which is configured thusly in your web.xml file:

<jsp-config>
	<jsp-property-group>
		<url-pattern>*.jsp</url-pattern>
		<page-encoding>UTF-8</page-encoding>
		<include-prelude>/WEB-INF/jsp/config.jspf</include-prelude>
	</jsp-property-group>
</jsp-config>

If you are using FreeMarker, then you should set the default and output encoding to UTF-8:

<bean id="freemarkerConfiguration" class="org.springframework.ui.freemarker.FreeMarkerConfigurationFactoryBean"
	p:templateLoaderPath="/WEB-INF/ftl/"
	p:defaultEncoding="UTF-8">
	<property name="freemarkerSettings">
		<props>
			<prop key="default_encoding">UTF-8</prop>
			<prop key="output_encoding">UTF-8</prop>
		</props>
	</property>
</bean>

Spring’s View Resolver

Regardless of which you technology use, you should also configure the view resolver to use Unicode as well.  Here’s how you would do it for Freemarker.  However, all of the other technologies would be configured the same way.

<bean class="org.springframework.web.servlet.view.freemarker.FreeMarkerViewResolver"
	p:contentType="text/html;charset=UTF-8"
	p:prefix= "/view/" p:suffix=".ftl"/>

Generated HTML

Some suggest that the following meta-tag needs to be included on each HTML page:

<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />

I am not convinced that this is necessary.  I have never had to use it, and the browser should definitely look at the response header before it looks at meta-tags.  However, perhaps some have found in necessary in more obscure situations.  In short, don’t worry about it.

Conclusion

So, it is absolutely ridiculous that, in this day and age, so much work has to be done to get Unicode support in your web application.  Hopefully, this article will help.

  1. No comments yet.