java - translating bytes from korean to utf-8, what am I not getting here? -
there incomplete understanding here. if run code below, expect see:
translatetest:: start start_korean: (6) c0 af c8 f1 c8 c6 expected_utf8: (6) c7 20 d7 6c d6 c8 found_utf8: (6) c7 20 d7 6c d6 c8 expected utf8 matches found? true
what is:
translatetest:: start start_korean: (6) c0 af c8 f1 c8 c6 expected_utf8: (6) c7 20 d7 6c d6 c8 found_utf8: (9) ec 9c a0 ed 9d ac ed 9b 88 expected utf8 matches found? false
i think creating string, declaring bytes x-windows-949, , getting bytes utf-8 translate them 1 other. apparently, not correct this.
public class translatetest { public static void main (string [] argv) { (new translatetest()).translate(); } void translate() { system.out.println("translatetest:: start"); try { // pages below linked http://msdn.microsoft.com/en-us/goglobal/cc305154 // please ignore lame bytestohex helper method. including completeness. // http://msdn.microsoft.com/en-us/goglobal/gg696909 // // 0xc0af = u+c720 = hangul syllable ieung yu // http://msdn.microsoft.com/en-us/goglobal/gg696960 // // 0xc8f1 = u+d76c = hangul syllable hieuh yi // http://msdn.microsoft.com/en-us/goglobal/gg696960 // // 0xc8c6 = u+d6c8 = hangul syllable hieuh u nieun byte[] start_korean = new byte[] { (byte)0xc0, (byte)0xaf, (byte)0xc8, (byte)0xf1, (byte)0xc8, (byte)0xc6 }; byte[] expected_utf8 = new byte[] { (byte)0xc7, (byte)0x20, (byte)0xd7, (byte)0x6c, (byte)0xd6, (byte)0xc8 }; string str = new string(start_korean, "x-windows-949"); byte[] found_utf8 = str.getbytes("utf8"); boolean isequal = java.util.arrays.equals(expected_utf8, found_utf8); system.out.println(" start_korean: "+bytestohex(start_korean)); system.out.println("expected_utf8: "+bytestohex(expected_utf8)); system.out.println(" found_utf8: "+bytestohex(found_utf8)); system.out.println("expected utf8 matches found? "+isequal); } catch (java.io.unsupportedencodingexception uee) { system.err.println(uee.getmessage()); } } public static string bytestohex(byte[] b) { stringbuffer str = new stringbuffer("("+b.length+") "); (int idx = 0; idx < b.length; idx++) { str.append(" "+bytetohex(b[idx])); } return str.tostring(); } public static string bytetohex(byte b) { string hex = integer.tohexstring(b); while (hex.length() < 2) hex = "0"+hex; if (hex.length() > 2) hex = hex.substring(hex.length()-2); return hex; } }
your problem "expected utf8" values unicode code points , not utf-8 encoding of code points. added code:
stringbuilder buf = new stringbuilder(); (int i=0; i<str.length(); i++) buf.append(", ").append(integer.tohexstring(str.codepointat(i))); system.out.println(" internal: "+buf.substring(2));
producing values show.
when these code points utf-8 encoded, rendered values see.
use online unicode code converter check out. enter string c720 d76c d6c8
in "mixed input" box , click "convert numbers hex code points".
Comments
Post a Comment