General J2EE: Encoding converion through JDBC
- Posted by: sourav mazumder
- Posted on: June 06 2003 01:26 EDT
Ive some query on how JDBC is supposed to convert encoding of characters while updating/selecting multibyte strings to/from a Database supporting Unicode characters. The specific queries are like
1. Where from the JDBC driver is supposed to get the encoding of the target database instance/database column ? Is there something specifically mandated by JDBC specs or left for particular vendor implementation ? For example
- If from a Java client an Oracle DB, having UTF-8 as database character set, is being connected, how does JDBC driver know what is the database character set ?
- If from a Java Client a SQL Server 2000 DB instance is queried for a nvarchar field (SQL server 2000 always expects nvarchar characters to be in UCS-2), how JDBC driver will know that it has to send UCS-2 characters
- If from a Java Client a SQL Server 2000 DB instance is queried for a varchar field having encoding type same as the code page of the locale of the Windows on which the SQL server is running, how JDBC driver will know that it has to send multibyte characters corresponding to that code page
2. Can the encoding of the target database instance be specified through the JDBC URL ?
3. Can this encoding be specified through any JDBC API ? (Ive not found anything when I checked out) ?
4. How does the encoding of the OS on which the Java Client (which does the JDBC connection, select/update etc.) is running affect the JDBC query ?
5. How much of these issues are covered in JDBC specification and what are the thing left to the vendors implementation of JDBC ? Are all different type of JDBC drivers (type 1, type 2 , type 3 and type 4) supposed to work similar way in this context is any mandate from JDBC specification for this matter ?
At the high level, what Im looking for are the working principles for converting encoding of characters while updating/selecting multibyte strings to/from a Database based on JDBC specification (or absence of any such thing).
Thanks in advance,
Did you get the answers to your questions?
I am facing same queries about JDBC drivers.
Kindly let me know the answers if you know them now.
Me too facing the same questions while globalizing
It wld be really great if you can post your findings
on this topic here.
1: Java Strings are UTF-16 encoded. When you use JDBC to retrieve data from or insert into the database, the datab must be converted from UTF-16 to the database character set or the national character set and vice versa. The JDBC driver does not know what the database character set is. When data insertion/retrieval is attempted from Java, the database implicitly converts the data from/to UTF-16 to/from the native (or national) character set.
Therefore, setting the national character set (or the native character set) properly is essential to proper data conversion. The character set that is specified for the database should reflect the setting for the client operating system. Setting this paramter correctly (NLS_LANG, in case of Oracle), enables proper conversion from the client operating system code page to the database character set. When these settings are the same, the database assumes that the data being sent or received is encoded in the same character set as the database character set, so no validation or conversion is performed. This can lead to corrupt data if conversions are necessary.
Database vendors provide JDBC implementations of their drivers, where the PreparedStatement tries to convert the data to the database character set. During this conversion, data can be lost when the database character set is a subset of the OS national character set.
Note in case of Oracle, this convestion to and from the database character set depends upon which client you are using.
1.1: UCS-2 is a superset of UTF-16 and the vendor implementation knows it has to use UCS-2
1.2: Again, JDBC uses UTF-16 and PreparedStatement implementation of the vendor will know what encoding to use for NCHAR or NVARCHAR columns of the database. Since the codepage of the OS and the database, there will be no loss of data in conversion (since the two character sets match).
2 & 3. Not that I know of..
4. check out http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96529/ch6.htm#1005683 for how it affects in Oracle.
5. I can answer that partly. The different type of drivers work differently. As mentioned before, most notable difference is where the conversion takes place (for example, when you use thin client, the database character set conversoin happens at the database level). As for what things are left to vendor's implementation, I guess you'd have to look at the JDBC API spec..
Forgot to mention, have a look at the following URL for a more generic discussion:
http://www.mail-archive.com/tomcat-user at jakarta dot apache dot org/msg134989.html