Efficient way to handle Unix-type line endings

Started by David Chisholm, April 18, 2014, 10:07:46 PM

Previous topic - Next topic

David Chisholm

Folks,

I have a client who needs me to process a large data file of records (125,000) of varying length averaging 3700 bytes.  The problem is that the file was encoded with <LF> line endings instead of <CR><LF> line endings.  The file is rather large (453 MB) and so I don't want to read it all into memory to fix the line endings.  I have sent the the file back to the client to get it "fixed" but I was wondering if anyone else has encountered this situation and if there is an easy method to parse these types of lines.  Thanks.
/Dave Chisholm

Ian Vincent


Cho Sing Kum


Hi Ian,

Thanks for the link. I compiled the code and now automated.

Up to now, I have this same "problem" with my server logs. I would open it in WordPad then save it as txt file to get them into correct format.

David Kenny

#3
As the post that Ian shared states, it requires reading the whole file into memory.  If you still don't want to do that, this will work for you (you will need to edit the two file names of course):

#COMPILE EXE
FUNCTION PBMAIN () AS LONG
    LOCAL StrIn AS STRING
    OPEN "OriginalFile" FOR BINARY AS 1
    OPEN "NewFile" FOR BINARY AS 2
    WHILE NOT EOF(1)
        GET$ #1, 1, StrIn
        IF StrIn = $LF THEN  ; specific to this problem
            PUT$ #2, $CRLF
        ELSE
            PUT$ #2, StrIn
        END IF
    WEND
    CLOSE 1
    CLOSE 2
END FUNCTION


Of course, while it works as you need, this is just demo code.  If you need to do it often, you could easily work it into a FF program and have the user select the file with a file browser, preview it, etc.  BTW, 450MB is not so big you could do the same thing with this code:

#COMPILE EXE
FUNCTION PBMAIN () AS LONG
    LOCAL StrIn AS STRING

    OPEN "OriginalFile" FOR BINARY AS 1
    GET$ 1, LOF(1), StrIn                           ; read the whole file in
    CLOSE 1

    REPLACE $LF WITH $CRLF IN StrIn    ; specific to this problem

    OPEN "NewFile" FOR OUTPUT AS 1
    PRINT#1, StrIn;
    CLOSE 1
END FUNCTION

While not as versatile as the code Ian referenced, as it is written to your specific needs, it is much shorter.

David

Edited to remove the 'Any' keyword from the Replace command.

Knuth Konrad

This is from PBCC's sample folder:


'=============================================================================
'
'  LF2CRLF : Convert a Unix-style linefeed-delimited (LF) text file to CR/LF
'            format.
'  Copyright (c) 2000-2011 PowerBASIC, Inc.
'  All Rights Reserved.
'
'=============================================================================

#COMPILER PBCC 6
#COMPILE EXE
#DIM ALL


'-----------------------------------------------------------------------------
' Main application entry point...
'
FUNCTION PBMAIN() AS LONG

    LOCAL ix AS LONG
    LOCAL sFileName AS STRING
    LOCAL sText AS STRING

    sFileName = TRIM$(COMMAND$)
    IF ASC(sFileName, 1) = 34 THEN
        sFileName = MID$(sFileName, 2)
        ix = INSTR(sFileName, $DQ)
        IF ix THEN
            sFileName = LEFT$(sFileName, ix - 1)
        END IF
    END IF

    IF LEN(sFileName) = 0 OR sFileName = "/?" OR sFileName = "-?" THEN
        STDOUT
        STDOUT "LF2CRLF 1.1  Copyright (c) 2000-2011  PowerBASIC, Inc."
        STDOUT
        STDOUT "Purpose:"
        STDOUT "  Convert a Unix-style text file to an ASCII text file."
        STDOUT "  The existing file is overwritten by the new file."
        STDOUT
        STDOUT "Syntax:"
        STDOUT "  LF2CRLF filename"
        STDOUT
        WAITKEY$
        EXIT FUNCTION
    END IF

    IF LEN(DIR$(sFileName)) = 0 THEN
        STDERR "Can't find file " + sFileName
        EXIT FUNCTION
    END IF

    OPEN sFileName FOR BINARY ACCESS READ LOCK WRITE AS #1
    GET$ 1, LOF(1), sText
    CLOSE #1

    IF ERR THEN
        STDERR "Error reading file " + sFileName
        EXIT FUNCTION
    END IF

    REPLACE $LF WITH $CRLF IN sText

    OPEN sFileName FOR BINARY ACCESS READ WRITE LOCK READ WRITE AS #1
    PUT$ 1, sText
    CLOSE #1

    IF ERR THEN
        STDERR "Error writing file " + sFileName
    END IF

END FUNCTION

David Kenny

Knuth,

Nice find.  I never bought PBCC.  That code, with the exception of the error checking and file name acquisition, is almost exactly like the second program in my post.  They did open the results file as binary instead of as output, but the function is the same.  The only reason to open the first one as binary instead of as input is to specify a single read of size LOF(1) (the whole file).  Can't do that when open as input.  Both ways of opening the file when writing can write an entire string in one go.

Grant McIntosh


Quote from: David Kenny on April 19, 2014, 04:47:43 AM
  BTW, 450MB is not so big you could do the same thing with this code:

#COMPILE EXE
FUNCTION PBMAIN () AS LONG
    LOCAL StrIn AS STRING

    OPEN "OriginalFile" FOR BINARY AS 1
    GET$ 1, LOF(1), StrIn                           ; read the whole file in
    CLOSE 1

    REPLACE ANY $LF WITH $CRLF IN StrIn    ; specific to this problem  <----- This won't work

    OPEN "NewFile" FOR OUTPUT AS 1
    PRINT#1, StrIn;
    CLOSE 1
END FUNCTION

While not as versatile as the code Ian referenced, as it is written to your specific needs, it is much shorter.

David

PBCC won't do this with the 'Replace Any' statement because of this quote from PBCC Help -
" If you use the ANY option, within MainString, each occurrence of each character in MatchString will be replaced with the corresponding character in NewString. In this case, MatchString and NewString must be the same length, because there is a one-to-one correspondence between their characters."

@David Kenny - Thanks for the explanation of why the were opening as Binary, I was wondering about that.
PBCC6.04
PBWin10.04
FF3.70
Vista

David Kenny

The help files for PBWin and PBCC say the same thing about the 'Any' keyword.  That was my mistake.  My recollection, flawed of course, was that the 'Replace' function only replaces the first occurrence of the match string and the 'Any' keyword was needed to replace 'Any' occurrence.

Leave the 'Any' keyword out and it will work as intended.

Grant McIntosh

Yes your incorrect assumption actually sounds quite logical. Unfortunately if you have mixed CRLF / LF (as some file I've dealt with have), then you will end up with something like CRCRLF. There are probably quite a few ways around this, but I think I first went through and changed all CRLF's to LF and then did the LF to CRLF part. You could of course do LF to CRLF and then do CRCRLF to CRLF  -  I think I'm starting to confuse myself now, so I'll shut-up   :-\
PBCC6.04
PBWin10.04
FF3.70
Vista

David Kenny

Don't forget Remove$... as in:

x$ = REMOVE$(x$, $CR)
Replace $LF with $CRLF in x$


As you said, quite a few ways to do this.

Grant McIntosh

Yeah, I think I like your way better, I might change my code.

Thanks for the tip
PBCC6.04
PBWin10.04
FF3.70
Vista

Knuth Konrad

And while we're counting different ways to skin a cat: RegRepl hasn't been mentioned yet.

Paul Squires

Quote from: Knuth Konrad on June 10, 2014, 12:09:13 PM
And while we're counting different ways to skin a cat: RegRepl hasn't been mentioned yet.

If MCM was here, he would have already mentioned it!  :D
Paul Squires
PlanetSquires Software

David Kenny

QuoteRegRepl hasn't been mentioned yet.
Yea, I like that one too, but you must use a loop to get them all.  Not a reason to not use it, but it would be nice if it supported an optional 'All' keyword.  At least with RegRepl you are able to target just the lone CR$'s (leaving the CRLF$'s alone) to satisfy Grant's needs.

Grant McIntosh

While talking about the Replace, some of the results here are similar to a situation once where I found a kind of 'limitation' (for want of a better word) in the Replace statement (might have been the Remove function). I can't recall the exact details, but basically doing the Replace made a situation where replacing the findstring within the targetstring actually caused that findstring to come into being, and then Replace didn't Replace it - Replace wasn't recursive.

That probably sounds as clear as mud, so take the example where you want to replace any occurrences of "catdog" in a string, with "" and you have the string "catcatdogdog".  If you remove catdog (or replace it with "") then you end up with catdog, which is what you were trying to get rid of in the first place!

In that case I think I looped while the find string was Instr.

I'm not sure if the regex would end up with the same result as Replace.

PBCC6.04
PBWin10.04
FF3.70
Vista