AfxGetFileEncoding

Paul Squires · April 09, 2020, 01:48:03 PM

Hi Jose,

I have attached for your consideration a new Afx function that attempts to detect the encoding of a file. The attachment is a project with sample files that shows the function in action. I have had occasion to need to know the encoding of a text file more than just that it is unicode (AfxIsFileUnicode).

Here is the function:

Code Select



'//
'//   From the unicode.org FAQ:
'//
'//   00 00 FE FF      UTF-32, big-endian 
'//   FF FE 00 00      UTF-32, little-endian 
'//   FE FF            UTF-16, big-endian 
'//   FF FE            UTF-16, little-endian 
'//   EF BB BF         UTF-8 
'//
'//   Match the first x bytes of the file against the
'//   Byte-Order-Mark (BOM) lookup table
'//
private function AfxGetFileEncoding( byref wszFilename as wstring ) as Integer

   type _BOM_LOOKUP
      bom   as DWORD  
      nlen  as ulong
      ntype as Integer
   end type

   '// define longest headers first
   static BOMLOOK(...) as _BOM_LOOKUP = _
   {( &H0000FEFF, 4, NCP_UTF32    ), _
    ( &HFFFE0000, 4, NCP_UTF32BE  ), _
    ( &HBFBBEF,   3, NCP_UTF8     ), _
    ( &HFFFE,     2, NCP_UTF16BE  ), _
    ( &HFEFF,     2, NCP_UTF16    ), _
    ( 0,          0, NCP_ASCII    ) _
   }
   
   DIM as DWORD dwBytesRead 
   DIM as HANDLE hFile 
   
   dim as BYTE header(4)

   hFile = CreateFile( @wszFileName, GENERIC_READ, FILE_SHARE_READ, NULL, _
                       OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL)
   
   IF hFile <> INVALID_HANDLE_VALUE THEN 
      if ReadFile( hFile, @header(0), 4, @dwBytesRead, NULL ) <> 0 then
         for i as long = lbound(BOMLOOK) to ubound(BOMLOOK)
            if dwBytesRead >= BOMLOOK(i).nLen then
               if memcmp( @header(0), @BOMLOOK(i).bom, BOMLOOK(i).nlen ) = 0 then
                  return BOMLOOK(i).ntype
               end if
            end if
         next
      end if
      CloseHandle(hFile)
   end if

   return NCP_ASCII   '// default to ASCII  
end function

Here is the example code (sample text files are also in the attachment):

Code Select


#define unicode

#include once "Afx\AfxWin.inc"


'//
'// currently supported codepages
'//
#define NCP_ASCII     0
#define NCP_UTF8      1
#define NCP_UTF16     2
#define NCP_UTF16BE   3
#define NCP_UTF32     4
#define NCP_UTF32BE   5

#include once "AfxGetFileEncoding.inc"



' ========================================================================================
' MAIN PROGRAM ENTRY POINT
' ========================================================================================

' Test all of the sample files in the "samples" subfolder

DIM as HANDLE hSearch 
dim AS WIN32_FIND_DATA WFD 
 
dim as CWSTR wszFilename, wszFileType, wszPath
dim as Boolean IsUnicode

wszPath = AfxGetExePathName + "samples\"

hSearch = FindFirstFile( wszPath + "*.txt", @WFD )
IF hSearch <> INVALID_HANDLE_VALUE THEN
   DO
      IF (WFD.dwFileAttributes AND FILE_ATTRIBUTE_DIRECTORY) <> FILE_ATTRIBUTE_DIRECTORY THEN
         wszFilename = wszPath & WFD.cFileName

         select case AfxGetFileEncoding( wszFilename )
            case NCP_UTF8
               wszFileType = "NCP_UTF8":    IsUnicode = true
            case NCP_UTF16
               wszFileType = "NCP_UTF16":   IsUnicode = true
            case NCP_UTF16BE
               wszFileType = "NCP_UTF16BE": IsUnicode = true
            case NCP_UTF32
               wszFileType = "NCP_UTF32":   IsUnicode = true
            case NCP_UTF32BE
               wszFileType = "NCP_UTF32BE": IsUnicode = true
            case NCP_ASCII
               wszFileType = "NCP_ASCII"
               ' If no BOM exists then it is possible that the file still contains
               ' unicode characters. We can test for that using AfxIsFileUnicode.
               ' We would only do this test in cases where for greater certainty
               ' that we need to know that the file contains unicode text. This is 
               ' a more expensive test because the whole file has to be read into
               ' memory in order to be analyzed.
               if AfxIsFileUnicode( wszFilename ) then IsUnicode = true
                  
         end select

         ? "Encoding: "; wszFileType, "IsUnicode: "; IsUnicode, "Filename: "; AfxStrPathName( "NAME", wszFilename)

      END IF
   LOOP WHILE FindNextFile(hSearch, @WFD)
   FindClose(hSearch)
END IF
   
   
sleep

José Roca · April 09, 2020, 06:51:55 PM

I'm having timeout problems again.

I fyou need that function, I will include it.

However, you're wrong in the assumption that AfxIsFileUnicode has to analyze the whole file. It just reads and analyzes the first 1024 bytes.

Paul Squires · April 09, 2020, 09:09:17 PM

Ah yes right you are. I should have looked at the code more closely... it is the first 1K that is read not the full file greater than 1K.

Sorry that you are having timeout problems. I wish I knew what causes that.

Paul Squires · April 09, 2020, 09:54:26 PM

I went into the forum admin settings and changed the following server setting:

Seconds before an unused session timeout

I changed the value to 14400 because I saw that mentioned somewhere on the web. Maybe this make a difference for you with that session timeout problem.

José Roca · April 10, 2020, 06:58:33 AM

It is a strange problem. Yesterday I only was able to connect once. Today is woking fine. And it only happens with this site.

Paul Squires · April 11, 2020, 10:35:14 AM

Thanks Jose, please let me know if accessing the forum continues to cause you problems. I'll try to search for more solutions if it does.

PlanetSquires Forums

AfxGetFileEncoding

Paul Squires

José Roca

Paul Squires

Paul Squires

José Roca

Paul Squires