Hi Jose,
I have attached for your consideration a new Afx function that attempts to detect the encoding of a file. The attachment is a project with sample files that shows the function in action. I have had occasion to need to know the encoding of a text file more than just that it is unicode (AfxIsFileUnicode).
Here is the function:
'//
'// From the unicode.org FAQ:
'//
'// 00 00 FE FF UTF-32, big-endian
'// FF FE 00 00 UTF-32, little-endian
'// FE FF UTF-16, big-endian
'// FF FE UTF-16, little-endian
'// EF BB BF UTF-8
'//
'// Match the first x bytes of the file against the
'// Byte-Order-Mark (BOM) lookup table
'//
private function AfxGetFileEncoding( byref wszFilename as wstring ) as Integer
type _BOM_LOOKUP
bom as DWORD
nlen as ulong
ntype as Integer
end type
'// define longest headers first
static BOMLOOK(...) as _BOM_LOOKUP = _
{( &H0000FEFF, 4, NCP_UTF32 ), _
( &HFFFE0000, 4, NCP_UTF32BE ), _
( &HBFBBEF, 3, NCP_UTF8 ), _
( &HFFFE, 2, NCP_UTF16BE ), _
( &HFEFF, 2, NCP_UTF16 ), _
( 0, 0, NCP_ASCII ) _
}
DIM as DWORD dwBytesRead
DIM as HANDLE hFile
dim as BYTE header(4)
hFile = CreateFile( @wszFileName, GENERIC_READ, FILE_SHARE_READ, NULL, _
OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL)
IF hFile <> INVALID_HANDLE_VALUE THEN
if ReadFile( hFile, @header(0), 4, @dwBytesRead, NULL ) <> 0 then
for i as long = lbound(BOMLOOK) to ubound(BOMLOOK)
if dwBytesRead >= BOMLOOK(i).nLen then
if memcmp( @header(0), @BOMLOOK(i).bom, BOMLOOK(i).nlen ) = 0 then
return BOMLOOK(i).ntype
end if
end if
next
end if
CloseHandle(hFile)
end if
return NCP_ASCII '// default to ASCII
end function
Here is the example code (sample text files are also in the attachment):
#define unicode
#include once "Afx\AfxWin.inc"
'//
'// currently supported codepages
'//
#define NCP_ASCII 0
#define NCP_UTF8 1
#define NCP_UTF16 2
#define NCP_UTF16BE 3
#define NCP_UTF32 4
#define NCP_UTF32BE 5
#include once "AfxGetFileEncoding.inc"
' ========================================================================================
' MAIN PROGRAM ENTRY POINT
' ========================================================================================
' Test all of the sample files in the "samples" subfolder
DIM as HANDLE hSearch
dim AS WIN32_FIND_DATA WFD
dim as CWSTR wszFilename, wszFileType, wszPath
dim as Boolean IsUnicode
wszPath = AfxGetExePathName + "samples\"
hSearch = FindFirstFile( wszPath + "*.txt", @WFD )
IF hSearch <> INVALID_HANDLE_VALUE THEN
DO
IF (WFD.dwFileAttributes AND FILE_ATTRIBUTE_DIRECTORY) <> FILE_ATTRIBUTE_DIRECTORY THEN
wszFilename = wszPath & WFD.cFileName
select case AfxGetFileEncoding( wszFilename )
case NCP_UTF8
wszFileType = "NCP_UTF8": IsUnicode = true
case NCP_UTF16
wszFileType = "NCP_UTF16": IsUnicode = true
case NCP_UTF16BE
wszFileType = "NCP_UTF16BE": IsUnicode = true
case NCP_UTF32
wszFileType = "NCP_UTF32": IsUnicode = true
case NCP_UTF32BE
wszFileType = "NCP_UTF32BE": IsUnicode = true
case NCP_ASCII
wszFileType = "NCP_ASCII"
' If no BOM exists then it is possible that the file still contains
' unicode characters. We can test for that using AfxIsFileUnicode.
' We would only do this test in cases where for greater certainty
' that we need to know that the file contains unicode text. This is
' a more expensive test because the whole file has to be read into
' memory in order to be analyzed.
if AfxIsFileUnicode( wszFilename ) then IsUnicode = true
end select
? "Encoding: "; wszFileType, "IsUnicode: "; IsUnicode, "Filename: "; AfxStrPathName( "NAME", wszFilename)
END IF
LOOP WHILE FindNextFile(hSearch, @WFD)
FindClose(hSearch)
END IF
sleep