Saving an well form xml as UTF-8 encoding format using VB6
In my last project, I have to read some text from WinWord doc file, and then I have to save them as a well form xml file for a java application to read. For these doc files are from global business departments, contain English, French, German, Chinese, etc, we have to save the xml file as UTF-8.
I will not explain how I read WinWord files here. I just want to talk about what happened when I try to save the xml as UTF-8 encoding. That confused me for a long time.
First, as a simple step, we save the xml using UTF-8 encoding. The code below shows how this works.
Private Sub ToUtf8(ByVal s As String, ByVal FilePath As String)
Dim stmStr As ADODB.Stream
Set stmStr = CreateObject("ADODB.Stream")
stmStr.Open
stmStr.Charset = "utf-8"
stmStr.WriteText s
stmStr.SaveToFile FilePath, adSaveCreateOverWrite
stmStr.Close
Set stmStr = Nothing
End Sub
Now we can save an xml as UTF-8 encoding. When open this xml file with IE, it looks good. But unfortunately java parser throws exception when we try parsing it. Parser told me that the xml file is not well form.
What happened?
Let抯 see what happened when we call ToUtf8.
When we call ToUtf8 to write the encoded string to the xml file, it put a 3 bytes before the strings. Their hex codes are EF, BB and BF. But in java world these will not be recognized. That is why the bug appears.
How can we solve it?
My solution is a stupid way : cut those three bytes.
Private Sub cut_utf8(file_name As String)
Dim tempFile As Long
Dim TempFile1 As Long
Dim LoadBytes() As Byte
Dim OutBytes() As Byte
tempFile = FreeFile
Open file_name For Binary As #tempFile
ReDim LoadBytes(1 To LOF(tempFile)) As Byte
Get #tempFile, , LoadBytes
'On Error GoTo NoEncoding
Dim i As Integer, FileCharset As String, strFileHead As String, str As String
For i = 1 To 3
strFileHead = strFileHead & Hex(LoadBytes(i))
Next
'Debug.Print strFileHead
If strFileHead = "EFBBBF" Then
'FileCharset = "UTF-8"
new_len = (LOF(tempFile) - 3)
ReDim OutBytes(1 To new_len) As Byte
For i = 1 To new_len
OutBytes(i) = (LoadBytes(i + 3))
Next
Close #tempFile
Dim fs_t As New Scripting.FileSystemObject
Dim fi_t As file
Set fi_t = fs_t.GetFile(file_name)
fi_t.Delete (True)
Set fi_t = Nothing
TempFile1 = FreeFile
Open file_name For Binary As #TempFile1
Put #TempFile1, , OutBytes
Close #TempFile1
Else
Close #tempFile
End If
End Sub