PlanetSquires Forums

Support Forums => WinFBX - Windows Framework for FreeBASIC => Topic started by: Paul Squires on April 09, 2020, 01:18:03 PM

Title: AfxGetFileEncoding
Post by: Paul Squires on April 09, 2020, 01:18:03 PM
Hi Jose,

I have attached for your consideration a new Afx function that attempts to detect the encoding of a file. The attachment is a project with sample files that shows the function in action. I have had occasion to need to know the encoding of a text file more than just that it is unicode (AfxIsFileUnicode).

Here is the function:
Code: [Select]

'//
'//   From the unicode.org FAQ:
'//
'//   00 00 FE FF      UTF-32, big-endian
'//   FF FE 00 00      UTF-32, little-endian
'//   FE FF            UTF-16, big-endian
'//   FF FE            UTF-16, little-endian
'//   EF BB BF         UTF-8
'//
'//   Match the first x bytes of the file against the
'//   Byte-Order-Mark (BOM) lookup table
'//
private function AfxGetFileEncoding( byref wszFilename as wstring ) as Integer

   type _BOM_LOOKUP
      bom   as DWORD 
      nlen  as ulong
      ntype as Integer
   end type

   '// define longest headers first
   static BOMLOOK(...) as _BOM_LOOKUP = _
   {( &H0000FEFF, 4, NCP_UTF32    ), _
    ( &HFFFE0000, 4, NCP_UTF32BE  ), _
    ( &HBFBBEF,   3, NCP_UTF8     ), _
    ( &HFFFE,     2, NCP_UTF16BE  ), _
    ( &HFEFF,     2, NCP_UTF16    ), _
    ( 0,          0, NCP_ASCII    ) _
   }
   
   DIM as DWORD dwBytesRead
   DIM as HANDLE hFile
   
   dim as BYTE header(4)

   hFile = CreateFile( @wszFileName, GENERIC_READ, FILE_SHARE_READ, NULL, _
                       OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL)
   
   IF hFile <> INVALID_HANDLE_VALUE THEN
      if ReadFile( hFile, @header(0), 4, @dwBytesRead, NULL ) <> 0 then
         for i as long = lbound(BOMLOOK) to ubound(BOMLOOK)
            if dwBytesRead >= BOMLOOK(i).nLen then
               if memcmp( @header(0), @BOMLOOK(i).bom, BOMLOOK(i).nlen ) = 0 then
                  return BOMLOOK(i).ntype
               end if
            end if
         next
      end if
      CloseHandle(hFile)
   end if

   return NCP_ASCII   '// default to ASCII 
end function


Here is the example code (sample text files are also in the attachment):

Code: [Select]
#define unicode

#include once "Afx\AfxWin.inc"


'//
'// currently supported codepages
'//
#define NCP_ASCII     0
#define NCP_UTF8      1
#define NCP_UTF16     2
#define NCP_UTF16BE   3
#define NCP_UTF32     4
#define NCP_UTF32BE   5

#include once "AfxGetFileEncoding.inc"



' ========================================================================================
' MAIN PROGRAM ENTRY POINT
' ========================================================================================

' Test all of the sample files in the "samples" subfolder

DIM as HANDLE hSearch
dim AS WIN32_FIND_DATA WFD
 
dim as CWSTR wszFilename, wszFileType, wszPath
dim as Boolean IsUnicode

wszPath = AfxGetExePathName + "samples\"

hSearch = FindFirstFile( wszPath + "*.txt", @WFD )
IF hSearch <> INVALID_HANDLE_VALUE THEN
   DO
      IF (WFD.dwFileAttributes AND FILE_ATTRIBUTE_DIRECTORY) <> FILE_ATTRIBUTE_DIRECTORY THEN
         wszFilename = wszPath & WFD.cFileName

         select case AfxGetFileEncoding( wszFilename )
            case NCP_UTF8
               wszFileType = "NCP_UTF8":    IsUnicode = true
            case NCP_UTF16
               wszFileType = "NCP_UTF16":   IsUnicode = true
            case NCP_UTF16BE
               wszFileType = "NCP_UTF16BE": IsUnicode = true
            case NCP_UTF32
               wszFileType = "NCP_UTF32":   IsUnicode = true
            case NCP_UTF32BE
               wszFileType = "NCP_UTF32BE": IsUnicode = true
            case NCP_ASCII
               wszFileType = "NCP_ASCII"
               ' If no BOM exists then it is possible that the file still contains
               ' unicode characters. We can test for that using AfxIsFileUnicode.
               ' We would only do this test in cases where for greater certainty
               ' that we need to know that the file contains unicode text. This is
               ' a more expensive test because the whole file has to be read into
               ' memory in order to be analyzed.
               if AfxIsFileUnicode( wszFilename ) then IsUnicode = true
                 
         end select

         ? "Encoding: "; wszFileType, "IsUnicode: "; IsUnicode, "Filename: "; AfxStrPathName( "NAME", wszFilename)

      END IF
   LOOP WHILE FindNextFile(hSearch, @WFD)
   FindClose(hSearch)
END IF
   
   
sleep




 
Title: Re: AfxGetFileEncoding
Post by: Josť Roca on April 09, 2020, 06:21:55 PM
I'm having timeout problems again.

I fyou need that function, I will include it.

However, you're wrong in the assumption that AfxIsFileUnicode has to analyze the whole file. It just reads and analyzes the first 1024 bytes.
Title: Re: AfxGetFileEncoding
Post by: Paul Squires on April 09, 2020, 08:39:17 PM
Ah yes right you are. I should have looked at the code more closely... it is the first 1K that is read not the full file greater than 1K.

Sorry that you are having timeout problems. I wish I knew what causes that.
Title: Re: AfxGetFileEncoding
Post by: Paul Squires on April 09, 2020, 09:24:26 PM
I went into the forum admin settings and changed the following server setting:

Seconds before an unused session timeout

I changed the value to 14400 because I saw that mentioned somewhere on the web. Maybe this make a difference for you with that session timeout problem.
Title: Re: AfxGetFileEncoding
Post by: Josť Roca on April 10, 2020, 06:28:33 AM
It is a strange problem. Yesterday I only was able to connect once. Today is woking fine. And it only happens with this site.
Title: Re: AfxGetFileEncoding
Post by: Paul Squires on April 11, 2020, 10:05:14 AM
Thanks Jose, please let me know if accessing the forum continues to cause you problems. I'll try to search for more solutions if it does.