Discussions

Web tier: servlets, JSP, Web frameworks: 8859-1 to UTF-8 conversion and platform default charset

  1. Hi,
    I have a JSP with UTF-8 page encoding (I have in HTML head). This JSP receives form data from external website which are encoded in ISO-8859-1. I have set request.setCharacterEncoding("ISO-8859-1"); in the beginning before getting parameter. I am trying to convert these ISO-8859-1 to UTF-8 before presenting (or storing it in Database) them.
    Following is my code:
    Configuration:
    Application server: Resin on Windows
    JDK (JVM): jre1.5.0_11
    JVM default charset: UTF-8


    Code:
    System.out.println("Default charset:"+Charset.defaultCharset());
    Enumeration params = request.getParameterNames();
    while(params.hasMoreElements()) { String name = (String)params.nextElement(); String value = request.getParameter(name); String value1 = new String(value.getBytes("UTF-8")); System.out.println("Name:"+name+"\t value:"+value1); %> <%=name%>
    Server Console Output:
    Default charset:UTF-8
    Name:subject value: Vincent ‘Sonny’ Pirozzi Jr., Merrimack


    Actual encoded text in originating HTML:
    Vincent ‘Sonny’ Pirozzi Jr., Merrimack

    Do you see anything I am doing wrong? My intention is to convert ISO-8859-1 to UTF-8 characters on JVM with UTF-8 default charset?
  2. Hi,

    I have a JSP with UTF-8 page encoding (I have in HTML head). This JSP receives form data from external website which are encoded in ISO-8859-1. I have set request.setCharacterEncoding("ISO-8859-1"); in the beginning before getting parameter. I am trying to convert these ISO-8859-1 to UTF-8 before presenting (or storing it in Database) them.
    ISO-8859-1 is a subset of UTF-8. (meaning UTF-8 -- when using ONLY ISO-8859-1 characters -- is single byte only...so on the disk, the bytes look exactly the same whether they are stored as ISO-8859-1 or UTF-8). For this reason, you never have to "convert" arbitrary UTF-8 into ISO-8859-1 because that's impossible. And you never have to convert ISO-8859-1 to UTF-8 because ISO-8859-1 is already UTF-8 (more accurately a subset of UTF-8 but still UTF-8). Secondly, I don't know off the top of my head but are those funky open/close quotes part of ISO-8859-1 ? Thirdly, is your terminal/console capable of printing characters like those quotes ? Best regards, --j
  3. Sorry j, but you are wrong. If all your text are in english you would be right, but then all ISO-8859-X are UTF-8. There exist a lot lot 'letters' in ISO-8859-1 that are 16 bytes in UTF-8. Så you will often have to convert. --jso