News: TSS Interop Blog: String encodings - Another thorn in interop

  1. Think about it - you have Java on the one hand using UTF-16. Thankfully, so does the CLR (.Net), which covers a decent majority of language interop cases. But what about other languages that use UTF-8, or even pure C/C++ with Favorite Multibyte String Library of the Week bolted on? In this post on TheServerSide Interoperability Blog, Scott Balmos takes a look at a thorn in interoperability’s side: the less-than-vaunted string encoder. What do you think? Should we need to ‘’explicitly encode our string data into UTF-8 or elsewhere’’ if we want to be “completely” interoperable with languages? http://tssblog.techtarget.com/index.php/interoperability/string-encodings-another-thorn-in-interop/

    Threaded Messages (5)

  2. And as far as I have seen so far (please correct me!), SOAP and friends do not really specify encodings to use when passing data that smells like a string.
    XML content declares its character encoding in the <!--?xml?--> declaration at the top of the file. http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing So any format that is based on XML will have no problem with interop, provided that the XML parser used is sufficiently compliant with the specification. (As long as they send text as text, and not as base64 encoded binary data - not a good idea but it happens) - Erwin
  3. Data inside a SOAP variable[ Go to top ]

    True, Erwin. But that's for the XML data by itself. In the case of SOAP, I mean the encoding for any string datatypes in a SOAP message. The SOAP spec simply defines the string datatype as a general catch-all sequence of letters and numbers. But what letters and numbers are being used? We generally assume US-ASCII. What about international string data inside of it, though? Personally, I've seen string used alongside another message variable that defines the encoding, I've seen base64-encoded binary blobs, and (of course) others that just say screw it and use only ASCII. And yes, I have seen some rather dumb SOAP stacks that specify the XML encoding tag, and then end up using another encoding for string data. --S
  4. Not a problem[ Go to top ]

    As Erwin indicated, XML documents get their encoding from the XML prolog at the top of the file. The relevant specs are clear -- string data is to be encoded using the same encoding as the XML document it is contained within. Any SOAP stack which violates this rule is not compliant.
  5. Re: Not a problem[ Go to top ]

    Untrue. With HTTP as a transport the Content-type-header's charset parameter will override the XML declaration's encoding because the server or a proxy might have changed the encoding on the fly. Also, while XML documents default to UTF-8, HTTP defaults to ISO-8859-1, so without a specified encoding you can only guess. The full gory details can and (judging from the blog post *should*) be read here: http://diveintomark.org/archives/2004/02/13/xml-media-types Saying that "the SOAP specs are clear enough for producing interoperable SOAP stacks", btw, for me, sounds like saying "3.14 is a crystal clear expression of PI". :-)
  6. Re: Data inside a SOAP variable[ Go to top ]

    True, Erwin. But that's for the XML data by itself. In the case of SOAP, I mean the encoding for any string datatypes in a SOAP message.
    There is no difference between "the encoding of the xml" and the encoding of "any string datatypes" in an XML (and SOAP is XML). There is only "the encoding". That's the one of the major points of XML: it's just a text file, not some binary format with different portions with different encodings. But John Vance is making an interesting point. I need to look up if it is true that proxy servers are allowed to change encodings. I'm not sure they can - we're talking about HTTP, not about the hell that's called the SMTP protocol. But even if it is, it can simply be avoided by specifying the content-type as application/xml instead of text/xml; no proxy server is allowed to change binary content in an HTTP payload. - Erwin