Discussions

XML & Web services: how to write file in utf-8

  1. how to write file in utf-8 (14 messages)

    I read about how you can use

    <%@page contentType="text/xml; charset=UTF-8" %>

    to instruct the jsp writer to output in utf-8.

    I wonder how i can write a File in utf-8. What's the default charset when you write a file using FileWriter? How can you change it?

    Threaded Messages (14)

  2. how to write file in utf-8[ Go to top ]

    Since this is in the XML and Web Services forum, I presume you are talking about creating an XML File. In an XML file, the first tag allows you to denote what form of encoding you are using.

    i.e

    <?xml version="1.0" encoding="UTF-8" ?>
  3. how to write file in utf-8[ Go to top ]

    When you wish to write UTF8 content to an output stream, you should open an OutputStreamWriter with the charset parameter set to "UTF-8". So the complete code would be (bufferring ommitted for brevity):

    OutputStream out = new FileOutputStream(...);
    Writer writer = new OutputStreamWriter(out, "UTF-8");

    As for the default encoding, as far as I know it is determined by the "file.encoding" environment property, which is an unsupported feature of Sun's VMs. The encoding is needed by some libraries (i.e, property files, security policy files, etc) and should generally match the OS's native encoding. Changing it would probably be a bad idea (plus, it is unsupported, and may cease to work in a future version of the VM).

    Hope that helps.
    Gal
  4. thanks[ Go to top ]

    Thats exactly what i was looking for.
  5. url.openStream() returns an InputStream, i can set an InputStreamReader around that, but i don't know how i can find out which charset is used.

    For example: i try to open a stream from one of my jsp's. How can i find out what charset the jsp is using?

    I think i'm missing something here; i think charset is a http header (?), but the URL is for any protocol (?).
  6. Use url.openConnection() instead. This returns a URLConnection, from which you can get the input stream as well as the content encoding (URLConnection#getContentEncoding).

    Gal
  7. i was wondering also where you put the buffer: close to the file, or close to the code?

    Is the buffered stream the one you write to, or is it the last one in the chain?

    How about input?
  8. Thats an interesting question. I think most things here are equal except the frequency of the char-to-byte conversion calls. If you don't use a buffered writer, there will be a seperate character conversion call for every write (which persumably translates into more method invocations, more array copies, etc). So my guess would be that placing a BufferedWriter at the top would yield the best performance.
    By the way, OutputStreamWriter buffers it's output at the byte level anyway, so placing a BufferedOutputStream under it probably won't do much good.

    The same logic applies to InputStreamReaders. Unless you buffer around it, there would be more byte-to-char conversion calls and array copies.

    Gal
  9. There is no OutputStreamWriter constructor with buffer size parameter. Also from code the default buffer size cannot be known. Considering write call to underlying output stream, placing BufferedOutputStream under it may be good.
  10. The default buffer size is 8192 bytes. I agree that not being able to change it could theoretically be a problem, but I don't think it usually is. 8192 is probably big enogth for most practical purposes.
    Anyway, the point of discussion was which is better: buffering the writer or the stream. Do you see a point in buffering both? Or is there a reason to buffer the stream instead of the writer?

    Gal
  11. Point is buffering both. Default buffer size 8192 is of Bufferwriter. I cant find default buffer size for OutputStreamWriter in API doc/code, so being unsure about the default size, using BufferOutputStream (with increased buffer size - default 512) may be efficient - only make sense if its very larger file.
  12. The default buffer size for OutputStreamWriter is 8192 at all current versions of Java.

    How does buffering both make sense? As far as I can tell all it can possibly do is waste memory on buffers. Consider this chain:
    BufferedWriter(bufferSize1)->OutputStreamWriter->BufferedOutputStream(bufferSize2)->FileOutputStream

    BufferedWriter only performs a write on the OutputStreamWriter every bufferSize1 characters, which then translates into about 2*bufferSize1 bytes sent to the BufferedOutputStream. If this is bigger than bufferSize2 then the BufferedOutputStream doesn't do anything. But why shouldn't it be bigger? Instead of assiging a buffer on both, just assign the BufferedWriter with a sufficiently large buffer. This will achieve the same result and save the memory needed for the smaller buffer.

    Gal
  13. using 2 buffers[ Go to top ]

    i think the first buffer prevents many conversion calls, because it groups data in bigger chunks before the first writer converts the data.

    The second buffer prevents many io calls, because data is grouped into chunks equal to the buffer size, before it is actually written to the file.

    The first one is necessairy if the stream is written in little chunks (like writing xml for java code). Its not interesting if you read data from another stream because you use a buffer to hold whatever you read, so you can choose an appropriate byte buffer there.

    The second one is important if the file you write is big, because writing (actual file output) 3 times a chunk of 1 megabyte is probably faster than writing 128 times a block of 8kb.

    Buffered input before converting it makes sense for big files because you'll avoid many actual read calls because you will read data in chunks. Buffering input after conversion makes sense if you want to minimize the number of conversion calls, because one single read would fill the entire buffer, so conversion woul happen in big chunks. If you read the stram into a buffer in java code, the buffer after conversion may become useless, because you already have a buffer in your code.

    Is this a correct interpretation of the mechanisms? Is this a complete description of how performance can be improved in all i/o situations?
  14. using 2 buffers[ Go to top ]

    Dieter,

    As far as I can tell, it doesn't really matter if you buffer on the writer or on the stream as far as the sheer number of physical IO calls is concerned. Assume the buffer size is 100. When you buffer on the writer, it will only produce write calls to the stream every 100 chars. When it does produce a call to the stream, it won't write every char seperately. It writes them all in one chunk. Whether the stream is buffered or not will have no implication unless the size of the stream buffer is bigger than the size of the writer buffer. And like I mentioned in a previous post, there's no reason why it should be.
    The only thing difference between buffering on the writer and on the stream is the number of conversion calls: buffering on the writer also avoids many small conversion calls, and buffering on the stream doesn't.

    To conclude: I think buffering on the writer would be marginally better than buffering on the stream. Buffering on both will make absolutely no difference (except for wasting memory on the extra buffer).

    The same is true for readers. A buffered reader will allready read from the stream in large chunks (~ the size of the buffer), so I don't see a reason why having a buffered stream beneath it should do any kind of good.

    Gal
  15. using 2 buffers[ Go to top ]

    BufferedOutputStream make sense for very large file and efficiency you looking for - write call to underlying FileOutputStream. Example below clear it.


    import java.io.*;

    public class Test {

         public static void main(String [] args) throws Exception {
            BufferedWriter bw;

    System.out.println("Test 1: BufferedWriter Def buffer 8192");
    bw = new BufferedWriter(new OutputStreamWriter(new MyFileOutputStream("TestFile1")));
    doWrite(bw);

    System.out.println("Test 2: BufferedWriter buffer 16384");
    bw = new BufferedWriter(new OutputStreamWriter(new MyFileOutputStream("TestFile1")),16384);
    doWrite(bw);


    System.out.println("Test 3: BufferedWriter Def buffer 8192 and BufferedOutputStream buffer 16384");
    bw = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(new MyFileOutputStream("TestFile2"),16384)));
    doWrite(bw);

    System.out.println("Test 4: BufferedWriter buffer 16384 and BufferedOutputStream buffer 16384");
    bw = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(new MyFileOutputStream("TestFile2"),16384)),16384);
    doWrite(bw);

         }


         static void doWrite(Writer wr) throws IOException {
    String s = "Time Waste";

    for (int i = 0;i < 10000;i++) {
    wr.write(s);
    }

    wr.flush();
    wr.close();
    System.out.println();
         }
        
    }



    import java.io.IOException;

    public class MyFileOutputStream extends java.io.FileOutputStream {
         private int i;

         public MyFileOutputStream(String name) throws java.io.FileNotFoundException {
    super(name);
         }

         public void write(int b) throws IOException {
    super.write(b);
    msg(1);
         }

         public void write(byte[] b) throws IOException {
    super.write(b);
    msg(b.length);
         }

         public void write(byte[] b, int off, int len) throws IOException {
    super.write(b,off,len);
    msg(len);
         }

         private void msg(int j) {
    System.out.print("W " + ++i + " B "+j+", ");
         }


    }