VBA - 将字符串转换为UNICODE

Tre*_*era 3 unicode vba cyrillic

我需要将字符串HTML从西里尔和拉丁符号的混合转换为UNICODE.

我尝试了以下方法:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")
    MsgBox HTML
    HTML = StrConv(HTML, vbUnicode)
    MsgBox HTML
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function
Run Code Online (Sandbox Code Playgroud)

您可以看到StrConv之前和之后的内容.如果您想在文件中获取html,可以使用以下代码:

Public HTML As String
    Sub HTMLsearch()

    GetHTML ("http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1")

    Dim path As String

    path = ThisWorkbook.path & "\html.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1

    HTML = StrConv(HTML, vbUnicode)

    path = ThisWorkbook.path & "\htmlUNICODE.txt"
    Open path For Output As #1
    Print #1, HTML
    Close #1
End Sub

Function GetHTML(URL As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        HTML = .ResponseText
    End With
End Function
Run Code Online (Sandbox Code Playgroud)

想法?

Tom*_*lak 10

VBA对Unicode的支持并不是那么好.

可以处理Unicode字符串,但是您将无法看到实际字符Debug.PrintMsgBox- 它们将在?那里显示.

您可以将控制面板>区域和语言>管理选项卡>"非Unicode程序的当前语言"设置为"俄语"切换到不同的代码页,这将允许您在VBA消息框中查看西里尔字母而不是问号.但这只是一种美化改变.


你真正的问题在于此处.

服务器(nfs.mobile.bg)将文档发送为Content-Type: text/html.没有关于字符编码的信息.这意味着接收器必须自己找出字符编码.

浏览器通过查看响应字节流并进行猜测来完成此操作.在您的情况下,<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">HMTL源中存在有用的标记.因此,字节流应该被解释为Windows-1251,恰好是Windows中的西里尔语ANSI代码页.

所以,我们这里甚至没有Unicode!

在没有任何其他信息的情况下,对象的responseText属性XMLHTTP默认为us-ascii.西里尔字母表中的扩展字符不存在于ASCII中,因此它们将转换为实际的问号并丢失.这就是为什么你不能responseText用于任何事情.

但是,响应的原始字节仍然可用,在responseBody属性中,这是一个数组Byte.

在VBA中,您必须执行与浏览器相同的操作.您必须将字节流解释为特定字符集.该ADODB.Stream对象可以为您做到这一点,它也非常简单:

' reference: "Microsoft XML, v6.0" (or any other version)
' reference: "Microsoft ActiveX Data Objects 6.1 library" (or any other version)
Option Explicit

Sub HTMLsearch()
    Dim url As String, html As String

    url = "http://nfs.mobile.bg/pcgi/mobile.cgi?act=3&slink=6jkjov&f1=1"
    html = GetHTML(url, "Windows-1251")

    ' Cyrillic characters are supported in Office, so they will appear correctly
    ActiveDocument.Range.InsertAfter html
End Sub

Function GetHTML(Url As String, Optional Charset As String = "UTF-8") As String
    Dim request As New MSXML2.XMLHTTP
    Dim converter As New ADODB.stream

    ' fetch page
    request.Open "GET", Url, False
    request.send

    ' write raw bytes to the stream
    converter.Open
    converter.Type = adTypeBinary
    converter.Write request.responseBody

    ' switch the stream to text mode and set charset
    converter.Position = 0
    converter.Type = adTypeText
    converter.Charset = Charset

    ' read text characters from the stream, close the stream
    GetHTML = converter.ReadText
    converter.Close
End Function
Run Code Online (Sandbox Code Playgroud)

我一直在这里使用MS Word并HTMLsearch()正确调用将西里尔字符写入页面.尽管如此,它们仍然会出现?MsgBox我身上,但现在这纯粹是一个显示问题,这是因为VBA创建的UI无法处理Unicode.

  • @ Tomalak:我不相信我见过更好的答案和解释。非常感谢您的支持! (2认同)

ham*_*ish 5

我的生产订单数据来自许多国家。这是我能找到的唯一真正有效的 vba 函数。

Private Const CP_UTF8 = 65001

Private Declare Function MultiByteToWideChar Lib "kernel32" ( _
   ByVal CodePage As Long, ByVal dwFlags As Long, _
   ByVal lpMultiByteStr As Long, ByVal cchMultiByte As Long, _
   ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long


Public Function sUTF8ToUni(bySrc() As Byte) As String
   ' Converts a UTF-8 byte array to a Unicode string
   Dim lBytes As Long, lNC As Long, lRet As Long

   lBytes = UBound(bySrc) - LBound(bySrc) + 1
   lNC = lBytes
   sUTF8ToUni = String$(lNC, Chr(0))
   lRet = MultiByteToWideChar(CP_UTF8, 0, VarPtr(bySrc(LBound(bySrc))), lBytes, StrPtr(sUTF8ToUni), lNC)
   sUTF8ToUni = Left$(sUTF8ToUni, lRet)
End Function
Run Code Online (Sandbox Code Playgroud)

示例用法:

Dim sHTML As String
Dim bHTML() As Byte
bHTML = GetHTML("http://yoururlhere/myorderdata.php")
sHTML = sUTF8ToUni(bHTML)
sHTML = Mid(sHTML, 2)  'strip off Byte Order Mark: EF BB BF
Run Code Online (Sandbox Code Playgroud)