Internationalized GET parameters with Tomcat
One tiny, oft overlooked detail when working with internationalized web apps is that the Content-Type header specifies the type and encoding of the body of the http request.
Say you want to add searching to your site, which is implemented as a form that submits using method="GET", and your target audience is Japanese people. How do you deal with internationalized query parameters, considering that GET requests do not have a body?
Historically, query strings were expected to contain characters in the ISO-8859-1 character set, with anything but a subset of ASCII encoded using the %xx notation. In discussions a few years ago it was agreed that using UTF-8 encoding for international characters was the best alternative. This is reflected in RFC 3986 (which is backed by the W3C):
If you're running Tomcat, watch out! If your URL contains unicode characters, the browser will properly encode them as UTF-8 and then do the %xx magic for each byte. Tomcat will treat each encoded byte in the UTF-8 representation as a single ISO-8859-1 character and make your Japanese users very unhappy.
Unfortunately, this is Tomcat's default behavior. Fortunately, there is server configuration option that will force it to treat URLs as per RFC 3986. To enable it, edit server.xml and add the following parameter to the connector element:
<Server ...>
<Service ...>
<Connector ... URIEncoding="UTF-8"/>
...
</Connector>
</Service>
</Server>
BTW, I've tested this in Tomcat 5.5.12. I don't know whether it will work in previous versions.
Say you want to add searching to your site, which is implemented as a form that submits using method="GET", and your target audience is Japanese people. How do you deal with internationalized query parameters, considering that GET requests do not have a body?
Historically, query strings were expected to contain characters in the ISO-8859-1 character set, with anything but a subset of ASCII encoded using the %xx notation. In discussions a few years ago it was agreed that using UTF-8 encoding for international characters was the best alternative. This is reflected in RFC 3986 (which is backed by the W3C):
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
If you're running Tomcat, watch out! If your URL contains unicode characters, the browser will properly encode them as UTF-8 and then do the %xx magic for each byte. Tomcat will treat each encoded byte in the UTF-8 representation as a single ISO-8859-1 character and make your Japanese users very unhappy.
Unfortunately, this is Tomcat's default behavior. Fortunately, there is server configuration option that will force it to treat URLs as per RFC 3986. To enable it, edit server.xml and add the following parameter to the connector element:
<Server ...>
<Service ...>
<Connector ... URIEncoding="UTF-8"/>
...
</Connector>
</Service>
</Server>
BTW, I've tested this in Tomcat 5.5.12. I don't know whether it will work in previous versions.
7 Comments:
This is an old but very helpful post for me, thanks.
Very helpful, even years after the post. You rock!
Thank you, helped a lot (even today in 2009!).
Hi Guys,
This is regarding the issue I am facing while sending UTF-8 characters using GET method to a servlet directly from browser.
I have done the following settings:
1. Created a CharsetFilter, which sets encoding type for each request as UTF-8
2. Applied this filter in web.xml before all the requests
3. In my servlet, while writing the response, I have set response.setContentType to text/html;charset=utf-8
For the above mentioned settings accented characters like ÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ works correctly. But Chinese characters, Arabic characters etc does not work.
How ever if along with above settings, I change the server.xml settings to have useBodyEncodingForURI="true" OR/AND URIEncoding="UTF-8" in connector tag, the Chinese & Arabic characters works fine but now accented characters do not work :(.
I have tried all the combination of the settings mentioned but some how only one of the above two situations work.
Has anybody come across this problem? Any pointers will be great.
Thanks
Param
Cell phones from simple communication became a full-featured entertainment terminal . Can be said that cell phone china new milestone in the industry. This fully demonstrates that, cheap cell phones entertainment has become the most sought after consumer applications and has become the focus of the mobile phones market.
Merry Christmas, my dear friends:
Nike shoes
nike shox torch
Cheap nike shoes
Discount nike shoes
Nike shox r4
nike shox shoes
puma cat
cheap sport shoes
cheap nike shox
cheap nike max
nike tn dollar
nike running shoes
nike air max tn
puma shoes
discount puma shoes
puma mens shoes
puma running shoes
puma shoes
ed hardy clothes
ed hardy shirts
ed hardy jackets
ed hardy hoodies
ed hardy boots
ed hardy polo shirts
ed hardy shoes
ed hardy jeans
ed hardy outerwear
ed hardy long sleeve shirts
ed hardy bags
ed hardy winter boots
ed hardy handbags
ed hardy love kills slowly shirts
ed hardy love kills slowly shoes
ed hardy love kills slowly boots
ed hardy trousers
ed hardy mens
ed hardy womens
ed hardy t shirts
ed hardy sunglasses
ghd hair straighteners mk4
hair straightners
ghd iv styler hair straightener
ghd hair straightners
cheap ghd hair straighteners
Thank you!
You saved my day!
Post a Comment
<< Home