• Welcome to PlanetSquires Forums.
 

AfxGetFileEncoding

Started by Paul Squires, April 09, 2020, 01:18:03 PM

Previous topic - Next topic

Paul Squires

Hi Jose,

I have attached for your consideration a new Afx function that attempts to detect the encoding of a file. The attachment is a project with sample files that shows the function in action. I have had occasion to need to know the encoding of a text file more than just that it is unicode (AfxIsFileUnicode).

Here is the function:


'//
'//   From the unicode.org FAQ:
'//
'//   00 00 FE FF      UTF-32, big-endian
'//   FF FE 00 00      UTF-32, little-endian
'//   FE FF            UTF-16, big-endian
'//   FF FE            UTF-16, little-endian
'//   EF BB BF         UTF-8
'//
'//   Match the first x bytes of the file against the
'//   Byte-Order-Mark (BOM) lookup table
'//
private function AfxGetFileEncoding( byref wszFilename as wstring ) as Integer

   type _BOM_LOOKUP
      bom   as DWORD 
      nlen  as ulong
      ntype as Integer
   end type

   '// define longest headers first
   static BOMLOOK(...) as _BOM_LOOKUP = _
   {( &H0000FEFF, 4, NCP_UTF32    ), _
    ( &HFFFE0000, 4, NCP_UTF32BE  ), _
    ( &HBFBBEF,   3, NCP_UTF8     ), _
    ( &HFFFE,     2, NCP_UTF16BE  ), _
    ( &HFEFF,     2, NCP_UTF16    ), _
    ( 0,          0, NCP_ASCII    ) _
   }
   
   DIM as DWORD dwBytesRead
   DIM as HANDLE hFile
   
   dim as BYTE header(4)

   hFile = CreateFile( @wszFileName, GENERIC_READ, FILE_SHARE_READ, NULL, _
                       OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL)
   
   IF hFile <> INVALID_HANDLE_VALUE THEN
      if ReadFile( hFile, @header(0), 4, @dwBytesRead, NULL ) <> 0 then
         for i as long = lbound(BOMLOOK) to ubound(BOMLOOK)
            if dwBytesRead >= BOMLOOK(i).nLen then
               if memcmp( @header(0), @BOMLOOK(i).bom, BOMLOOK(i).nlen ) = 0 then
                  return BOMLOOK(i).ntype
               end if
            end if
         next
      end if
      CloseHandle(hFile)
   end if

   return NCP_ASCII   '// default to ASCII 
end function



Here is the example code (sample text files are also in the attachment):


#define unicode

#include once "Afx\AfxWin.inc"


'//
'// currently supported codepages
'//
#define NCP_ASCII     0
#define NCP_UTF8      1
#define NCP_UTF16     2
#define NCP_UTF16BE   3
#define NCP_UTF32     4
#define NCP_UTF32BE   5

#include once "AfxGetFileEncoding.inc"



' ========================================================================================
' MAIN PROGRAM ENTRY POINT
' ========================================================================================

' Test all of the sample files in the "samples" subfolder

DIM as HANDLE hSearch
dim AS WIN32_FIND_DATA WFD

dim as CWSTR wszFilename, wszFileType, wszPath
dim as Boolean IsUnicode

wszPath = AfxGetExePathName + "samples\"

hSearch = FindFirstFile( wszPath + "*.txt", @WFD )
IF hSearch <> INVALID_HANDLE_VALUE THEN
   DO
      IF (WFD.dwFileAttributes AND FILE_ATTRIBUTE_DIRECTORY) <> FILE_ATTRIBUTE_DIRECTORY THEN
         wszFilename = wszPath & WFD.cFileName

         select case AfxGetFileEncoding( wszFilename )
            case NCP_UTF8
               wszFileType = "NCP_UTF8":    IsUnicode = true
            case NCP_UTF16
               wszFileType = "NCP_UTF16":   IsUnicode = true
            case NCP_UTF16BE
               wszFileType = "NCP_UTF16BE": IsUnicode = true
            case NCP_UTF32
               wszFileType = "NCP_UTF32":   IsUnicode = true
            case NCP_UTF32BE
               wszFileType = "NCP_UTF32BE": IsUnicode = true
            case NCP_ASCII
               wszFileType = "NCP_ASCII"
               ' If no BOM exists then it is possible that the file still contains
               ' unicode characters. We can test for that using AfxIsFileUnicode.
               ' We would only do this test in cases where for greater certainty
               ' that we need to know that the file contains unicode text. This is
               ' a more expensive test because the whole file has to be read into
               ' memory in order to be analyzed.
               if AfxIsFileUnicode( wszFilename ) then IsUnicode = true
                 
         end select

         ? "Encoding: "; wszFileType, "IsUnicode: "; IsUnicode, "Filename: "; AfxStrPathName( "NAME", wszFilename)

      END IF
   LOOP WHILE FindNextFile(hSearch, @WFD)
   FindClose(hSearch)
END IF
   
   
sleep





 
Paul Squires
PlanetSquires Software
WinFBE Editor and Visual Designer

José Roca

I'm having timeout problems again.

I fyou need that function, I will include it.

However, you're wrong in the assumption that AfxIsFileUnicode has to analyze the whole file. It just reads and analyzes the first 1024 bytes.

Paul Squires

Ah yes right you are. I should have looked at the code more closely... it is the first 1K that is read not the full file greater than 1K.

Sorry that you are having timeout problems. I wish I knew what causes that.
Paul Squires
PlanetSquires Software
WinFBE Editor and Visual Designer

Paul Squires

I went into the forum admin settings and changed the following server setting:

Seconds before an unused session timeout

I changed the value to 14400 because I saw that mentioned somewhere on the web. Maybe this make a difference for you with that session timeout problem.
Paul Squires
PlanetSquires Software
WinFBE Editor and Visual Designer

José Roca

It is a strange problem. Yesterday I only was able to connect once. Today is woking fine. And it only happens with this site.

Paul Squires

Thanks Jose, please let me know if accessing the forum continues to cause you problems. I'll try to search for more solutions if it does.
Paul Squires
PlanetSquires Software
WinFBE Editor and Visual Designer