Folks,
I have a client who needs me to process a large data file of records (125,000) of varying length averaging 3700 bytes. The problem is that the file was encoded with <LF> line endings instead of <CR><LF> line endings. The file is rather large (453 MB) and so I don't want to read it all into memory to fix the line endings. I have sent the the file back to the client to get it "fixed" but I was wondering if anyone else has encountered this situation and if there is an easy method to parse these types of lines. Thanks.
Dave see here -
http://www.powerbasic.com/support/forums/Forum7/HTML/003152.html
It is easy enough to adapt the routine to suit specifics.
Ian.
Hi Ian,
Thanks for the link. I compiled the code and now automated.
Up to now, I have this same "problem" with my server logs. I would open it in WordPad then save it as txt file to get them into correct format.
As the post that Ian shared states, it requires reading the whole file into memory. If you still don't want to do that, this will work for you (you will need to edit the two file names of course):
#COMPILE EXE
FUNCTION PBMAIN () AS LONG
LOCAL StrIn AS STRING
OPEN "OriginalFile" FOR BINARY AS 1
OPEN "NewFile" FOR BINARY AS 2
WHILE NOT EOF(1)
GET$ #1, 1, StrIn
IF StrIn = $LF THEN ; specific to this problem
PUT$ #2, $CRLF
ELSE
PUT$ #2, StrIn
END IF
WEND
CLOSE 1
CLOSE 2
END FUNCTION
Of course, while it works as you need, this is just demo code. If you need to do it often, you could easily work it into a FF program and have the user select the file with a file browser, preview it, etc. BTW, 450MB is not so big you could do the same thing with this code:
#COMPILE EXE
FUNCTION PBMAIN () AS LONG
LOCAL StrIn AS STRING
OPEN "OriginalFile" FOR BINARY AS 1
GET$ 1, LOF(1), StrIn ; read the whole file in
CLOSE 1
REPLACE $LF WITH $CRLF IN StrIn ; specific to this problem
OPEN "NewFile" FOR OUTPUT AS 1
PRINT#1, StrIn;
CLOSE 1
END FUNCTION
While not as versatile as the code Ian referenced, as it is written to your specific needs, it is much shorter.
David
Edited to remove the 'Any' keyword from the Replace command.
This is from PBCC's sample folder:
'=============================================================================
'
' LF2CRLF : Convert a Unix-style linefeed-delimited (LF) text file to CR/LF
' format.
' Copyright (c) 2000-2011 PowerBASIC, Inc.
' All Rights Reserved.
'
'=============================================================================
#COMPILER PBCC 6
#COMPILE EXE
#DIM ALL
'-----------------------------------------------------------------------------
' Main application entry point...
'
FUNCTION PBMAIN() AS LONG
LOCAL ix AS LONG
LOCAL sFileName AS STRING
LOCAL sText AS STRING
sFileName = TRIM$(COMMAND$)
IF ASC(sFileName, 1) = 34 THEN
sFileName = MID$(sFileName, 2)
ix = INSTR(sFileName, $DQ)
IF ix THEN
sFileName = LEFT$(sFileName, ix - 1)
END IF
END IF
IF LEN(sFileName) = 0 OR sFileName = "/?" OR sFileName = "-?" THEN
STDOUT
STDOUT "LF2CRLF 1.1 Copyright (c) 2000-2011 PowerBASIC, Inc."
STDOUT
STDOUT "Purpose:"
STDOUT " Convert a Unix-style text file to an ASCII text file."
STDOUT " The existing file is overwritten by the new file."
STDOUT
STDOUT "Syntax:"
STDOUT " LF2CRLF filename"
STDOUT
WAITKEY$
EXIT FUNCTION
END IF
IF LEN(DIR$(sFileName)) = 0 THEN
STDERR "Can't find file " + sFileName
EXIT FUNCTION
END IF
OPEN sFileName FOR BINARY ACCESS READ LOCK WRITE AS #1
GET$ 1, LOF(1), sText
CLOSE #1
IF ERR THEN
STDERR "Error reading file " + sFileName
EXIT FUNCTION
END IF
REPLACE $LF WITH $CRLF IN sText
OPEN sFileName FOR BINARY ACCESS READ WRITE LOCK READ WRITE AS #1
PUT$ 1, sText
CLOSE #1
IF ERR THEN
STDERR "Error writing file " + sFileName
END IF
END FUNCTION
Knuth,
Nice find. I never bought PBCC. That code, with the exception of the error checking and file name acquisition, is almost exactly like the second program in my post. They did open the results file as binary instead of as output, but the function is the same. The only reason to open the first one as binary instead of as input is to specify a single read of size LOF(1) (the whole file). Can't do that when open as input. Both ways of opening the file when writing can write an entire string in one go.
Quote from: David Kenny on April 19, 2014, 04:47:43 AM
BTW, 450MB is not so big you could do the same thing with this code:
#COMPILE EXE
FUNCTION PBMAIN () AS LONG
LOCAL StrIn AS STRING
OPEN "OriginalFile" FOR BINARY AS 1
GET$ 1, LOF(1), StrIn ; read the whole file in
CLOSE 1
REPLACE ANY $LF WITH $CRLF IN StrIn ; specific to this problem <----- This won't work
OPEN "NewFile" FOR OUTPUT AS 1
PRINT#1, StrIn;
CLOSE 1
END FUNCTION
While not as versatile as the code Ian referenced, as it is written to your specific needs, it is much shorter.
David
PBCC won't do this with the 'Replace Any' statement because of this quote from PBCC Help -
" If you use the ANY option, within MainString, each occurrence of each character in MatchString will be replaced with the corresponding character in NewString. In this case, MatchString and NewString must be the same length, because there is a one-to-one correspondence between their characters."
@David Kenny - Thanks for the explanation of why the were opening as Binary, I was wondering about that.
The help files for PBWin and PBCC say the same thing about the 'Any' keyword. That was my mistake. My recollection, flawed of course, was that the 'Replace' function only replaces the first occurrence of the match string and the 'Any' keyword was needed to replace 'Any' occurrence.
Leave the 'Any' keyword out and it will work as intended.
Yes your incorrect assumption actually sounds quite logical. Unfortunately if you have mixed CRLF / LF (as some file I've dealt with have), then you will end up with something like CRCRLF. There are probably quite a few ways around this, but I think I first went through and changed all CRLF's to LF and then did the LF to CRLF part. You could of course do LF to CRLF and then do CRCRLF to CRLF - I think I'm starting to confuse myself now, so I'll shut-up :-\
Don't forget Remove$... as in:
x$ = REMOVE$(x$, $CR)
Replace $LF with $CRLF in x$
As you said, quite a few ways to do this.
Yeah, I think I like your way better, I might change my code.
Thanks for the tip
And while we're counting different ways to skin a cat: RegRepl hasn't been mentioned yet.
Quote from: Knuth Konrad on June 10, 2014, 12:09:13 PM
And while we're counting different ways to skin a cat: RegRepl hasn't been mentioned yet.
If MCM was here, he would have already mentioned it! :D
QuoteRegRepl hasn't been mentioned yet.
Yea, I like that one too, but you must use a loop to get them all. Not a reason to not use it, but it would be nice if it supported an optional 'All' keyword. At least with RegRepl you are able to target just the lone CR$'s (leaving the CRLF$'s alone) to satisfy Grant's needs.
While talking about the Replace, some of the results here are similar to a situation once where I found a kind of 'limitation' (for want of a better word) in the Replace statement (might have been the Remove function). I can't recall the exact details, but basically doing the Replace made a situation where replacing the findstring within the targetstring actually caused that findstring to come into being, and then Replace didn't Replace it - Replace wasn't recursive.
That probably sounds as clear as mud, so take the example where you want to replace any occurrences of "catdog" in a string, with "" and you have the string "catcatdogdog". If you remove catdog (or replace it with "") then you end up with catdog, which is what you were trying to get rid of in the first place!
In that case I think I looped while the find string was Instr.
I'm not sure if the regex would end up with the same result as Replace.