Difference between revisions of "CSV file reading and writing"

From BeebWiki
Jump to: navigation, search
m (1 revision)
m (Note about simplistic Excel solution.)
 
(5 intermediate revisions by 2 users not shown)
Line 20: Line 20:
  
 
==Records with no quote characters==
 
==Records with no quote characters==
If you know that no records contain ''"'' (quote) characters, or are willing to
+
If you know that no records contain ''"'' (quote) characters, or are willing
ignore them, the following code will read and write a CSV file.
+
to ignore them, the following code will read and write a CSV file.
  
     REM Read CSV file from in% to array$()
+
     REM Read CSV file from in% to array$(rmax%,fmax%)
     imax%=0:REPEAT
+
     rmax%=0:REPEAT
 
       s$=FNrd(in%)
 
       s$=FNrd(in%)
       FOR record%=0 TO rmax%-1
+
       FOR field%=0 TO fmax%
 
         i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
 
         i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
         IFLEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
+
         IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
         array$(imax%,record%)=r$
+
         array$(rmax%,field%)=r$
 
       NEXT record%
 
       NEXT record%
       imax%=imax%+1
+
       rmax%=rmax%+1
 
     UNTIL EOF#in%
 
     UNTIL EOF#in%
 +
    rmax%=rmax%-1
 
      
 
      
     REM Write CSV file from array$() to out%
+
     REM Write CSV file from array$(rmax%,fmax%) to out%
     FOR item%=0 TO imax%-1
+
     FOR record%=0 TO rmax%
       FOR record%=0 TO rmax%-1
+
       FOR field%=0 TO fmax%
         s$=array$(item%,record%)
+
         s$=array$(record%,field%)
 
         IF INSTR(s$,",")<>0 THEN s$=""""+s$+""""
 
         IF INSTR(s$,",")<>0 THEN s$=""""+s$+""""
         PROCwr(out%,s$+",")
+
         PROCwr(out%,s$):IF field%<>fmax%:BPUT#out%,ASC","
      NEXT record%
+
      NEXT field%
      PTR#out%=PTR#out%-1:BPUT#out%,13
+
     NEXT record%
     NEXT item%
 
  
This code uses <code>'''imax%'''</code> to hold the maximum number of items
+
This code uses <code>'''fmax%'''</code> to hold the maximum field number
and <code>'''rmax%'''</code> to hold the maximum number of records, and
+
and <code>'''rmax%'''</code> to hold the maximum record number, and
 
<code>'''PROCwr()'''</code> to write a string to a file and
 
<code>'''PROCwr()'''</code> to write a string to a file and
 
<code>'''FNrd()'''</code> to read a terminated line from a file.
 
<code>'''FNrd()'''</code> to read a terminated line from a file.
Line 55: Line 55:
 
records containing ''"'' (quote) characters.
 
records containing ''"'' (quote) characters.
  
     REM Read CSV file from in% to array$()
+
     REM Read CSV file from in% to array$(rmax%,fmax%)
     imax%=0:REPEAT
+
     rmax%=0:REPEAT
 
       s$=FNrd(in%)
 
       s$=FNrd(in%)
       FOR record%=0 TO rmax%-1
+
       FOR field%=0 TO fmax%
 
         i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
 
         i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
         IFLEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
+
         IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
 
         i%=0
 
         i%=0
 
         REPEAT
 
         REPEAT
           i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%+1)
+
           i%=INSTR(r$,"""""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%+1)
 
         UNTIL i%=0
 
         UNTIL i%=0
         array$(imax%,record%)=r$
+
         array$(rmax%,field%)=r$
       NEXT record%
+
       NEXT field%
       imax%=imax%+1
+
       rmax%=rmax%+1
 
     UNTIL EOF#in%
 
     UNTIL EOF#in%
 +
    rmax%=rmax%-1
 
      
 
      
     REM Write CSV file from array$() to out%
+
     REM Write CSV file from array$(rmax%,fmax%) to out%
     FOR item%=0 TO imax%-1
+
     FOR record%=0 TO rmax%
       FOR record%=0 TO rmax%-1
+
       FOR field%=0 TO fmax%
         r$=array$(item%,record%)
+
         r$=array$(record%,field%)
 
         i%=0
 
         i%=0
 
         REPEAT
 
         REPEAT
           i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,2,i%):i%=i%+1
+
           i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%):i%=i%+1
 
         UNTIL i%=0
 
         UNTIL i%=0
 
         IF INSTR(r$,",")<>0 THEN r$=""""+r$+""""
 
         IF INSTR(r$,",")<>0 THEN r$=""""+r$+""""
         PROCwr(out%,r$+",")
+
         PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
      NEXT record%
+
      NEXT field%
      PTR#out%=PTR#out%-1:BPUT#out%,13
+
     NEXT record%
     NEXT item%
 
  
 
Note that this code is vunerable to ''String too long'' errors if escaping
 
Note that this code is vunerable to ''String too long'' errors if escaping
 
embedded characters result in the output string becoming too long.
 
embedded characters result in the output string becoming too long.
 +
 +
==Creating files that Excel will read==
 +
Microsoft Excel has several long-standing bugs where it mangles text fields
 +
where it thinks it looks like a numeric value. It strips leading zeros from
 +
strings of digits, it converts strings of digits with an 'e' character into
 +
an exponential value, and also strings that look like a fraction, and
 +
anything that looks like a date gets converted to an internal date value.
 +
 +
The simplest thing to do is output all data surrounded with <code>'''=""'''</code>,
 +
for example <code>'''="018118055"'''</code>. However, this significantly increases
 +
the size of the output file.
 +
 +
The following code will read and write CSV files if you need to deal with
 +
Excel loading files containing records that Excel thinks look like numbers,
 +
protecting vulnerable data to output the most compact output file. Note that
 +
even with fields protected in this manner, if you load the file into Excel
 +
'''and then save it out again''' Excel will save the file with anything that
 +
looks like a number converted into Excel's internal number format. There is
 +
absolutely nothing that can be done to protect against this.
 +
 +
    REM Read CSV file from in% to array$(rmax%,fmax%)
 +
    rmax%=0:REPEAT
 +
      s$=FNrd(in%)
 +
      FOR field%=0 TO fmax%
 +
        i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
 +
        IF LEFT$(r$,2)="=""":r$=MID$(r$,2)
 +
        IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
 +
        i%=0
 +
        REPEAT
 +
          i%=INSTR(r$,"""""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%+1)
 +
        UNTIL i%=0
 +
        array$(rmax%,field%)=r$
 +
      NEXT field%
 +
      rmax%=rmax%+1
 +
    UNTIL EOF#in%
 +
    rmax%=rmax%-1
 +
   
 +
    REM Write CSV file from array$(rmax%,fmax%) to out%
 +
    FOR record%=0 TO rmax%
 +
      FOR field%=0 TO fmax%
 +
        r$=array$(record%,field%)
 +
        i%=0
 +
        REPEAT
 +
          i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%):i%=i%+1
 +
        UNTIL i%=0
 +
        q%=INSTR(r$,",")
 +
        IF q%=0:q%=(ASCr$=48)AND(INSTR(r$,"/")=0)      :REM leading zeros 00001
 +
        IF q%=0:q%=LENSTR$VALr$>8                      :REM long numbers 12345678901234
 +
        IF q%=0:IFVALr$:q%=INSTR(r$,"/")AND(ASCr$<>48)  :REM fractions 12/34
 +
        IF q%=0:q%=MID$(r$,3,1)=" "ANDMID$(r$,7,1)=" "  :REM dates xx XXX xxxx
 +
        IF q%:r$=""""+r$+"""":IFINSTR(r$,",")=0:r$="="+r$
 +
        PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
 +
        IF INSTR(r$,",")<>0 THEN r$=""""+r$+""""
 +
        PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
 +
      NEXT field%
 +
    NEXT record%
  
 
[[User:Jgharston|Jgharston]] 20:20, 29 August 2008 (BST)
 
[[User:Jgharston|Jgharston]] 20:20, 29 August 2008 (BST)
 +
[[User:Jgharston|Jgharston]] ([[User talk:Jgharston|talk]]) 00:18, 5 April 2017 (UTC)

Latest revision as of 16:58, 12 November 2017

Comma Separated Value or CSV files are a common format typically used to represent the contents of a spreadsheet or database (or any other 2D matrix of data cells); they can by read and written by programs such as Microsoft Excel and Microsoft Access. The main properties of CSV files are as follows:

  • Data rows are separated by end-of-line sequences.
  • Data columns are separated by commas.
  • Quotation marks must be escaped using the convention "".
  • Data cells containing commas or quotes must be enclosed in quotes.
  • Any data cell may optionally be enclosed in quotes.

Here is an example of the contents of a CSV file:

   1997,Ford,E350,"AC, ABS, moon roof",3000.00
   1999,Chevy,"Venture ""Extended Edition""",,4900.00

For more details see this Wikipedia article.

Records with no quote characters

If you know that no records contain " (quote) characters, or are willing to ignore them, the following code will read and write a CSV file.

   REM Read CSV file from in% to array$(rmax%,fmax%)
   rmax%=0:REPEAT
     s$=FNrd(in%)
     FOR field%=0 TO fmax%
       i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
       IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
       array$(rmax%,field%)=r$
     NEXT record%
     rmax%=rmax%+1
   UNTIL EOF#in%
   rmax%=rmax%-1
   
   REM Write CSV file from array$(rmax%,fmax%) to out%
   FOR record%=0 TO rmax%
     FOR field%=0 TO fmax%
       s$=array$(record%,field%)
       IF INSTR(s$,",")<>0 THEN s$=""""+s$+""""
       PROCwr(out%,s$):IF field%<>fmax%:BPUT#out%,ASC","
     NEXT field%
   NEXT record%

This code uses fmax% to hold the maximum field number and rmax% to hold the maximum record number, and PROCwr() to write a string to a file and FNrd() to read a terminated line from a file. PROCwr() and FNrd() are in the StringIO library.

Records containing quote characters

The following code will read and write CSV files if you need to deal with records containing " (quote) characters.

   REM Read CSV file from in% to array$(rmax%,fmax%)
   rmax%=0:REPEAT
     s$=FNrd(in%)
     FOR field%=0 TO fmax%
       i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
       IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
       i%=0
       REPEAT
         i%=INSTR(r$,"""""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%+1)
       UNTIL i%=0
       array$(rmax%,field%)=r$
     NEXT field%
     rmax%=rmax%+1
   UNTIL EOF#in%
   rmax%=rmax%-1
   
   REM Write CSV file from array$(rmax%,fmax%) to out%
   FOR record%=0 TO rmax%
     FOR field%=0 TO fmax%
       r$=array$(record%,field%)
       i%=0
       REPEAT
         i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%):i%=i%+1
       UNTIL i%=0
       IF INSTR(r$,",")<>0 THEN r$=""""+r$+""""
       PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
     NEXT field%
   NEXT record%

Note that this code is vunerable to String too long errors if escaping embedded characters result in the output string becoming too long.

Creating files that Excel will read

Microsoft Excel has several long-standing bugs where it mangles text fields where it thinks it looks like a numeric value. It strips leading zeros from strings of digits, it converts strings of digits with an 'e' character into an exponential value, and also strings that look like a fraction, and anything that looks like a date gets converted to an internal date value.

The simplest thing to do is output all data surrounded with ="", for example ="018118055". However, this significantly increases the size of the output file.

The following code will read and write CSV files if you need to deal with Excel loading files containing records that Excel thinks look like numbers, protecting vulnerable data to output the most compact output file. Note that even with fields protected in this manner, if you load the file into Excel and then save it out again Excel will save the file with anything that looks like a number converted into Excel's internal number format. There is absolutely nothing that can be done to protect against this.

   REM Read CSV file from in% to array$(rmax%,fmax%)
   rmax%=0:REPEAT
     s$=FNrd(in%)
     FOR field%=0 TO fmax%
       i%=INSTR(s$+",",","):r$=LEFT$(s$,i%-1):s$=MID$(s$,i%+1)
       IF LEFT$(r$,2)="=""":r$=MID$(r$,2)
       IF LEFT$(r$,1)="""":IF RIGHT$(r$,1)="""":r$=MID$(r$,2,LENr$-2)
       i%=0
       REPEAT
         i%=INSTR(r$,"""""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%+1)
       UNTIL i%=0
       array$(rmax%,field%)=r$
     NEXT field%
     rmax%=rmax%+1
   UNTIL EOF#in%
   rmax%=rmax%-1
   
   REM Write CSV file from array$(rmax%,fmax%) to out%
   FOR record%=0 TO rmax%
     FOR field%=0 TO fmax%
       r$=array$(record%,field%)
       i%=0
       REPEAT
         i%=INSTR(r$,"""",i%+1):IF i%:r$=LEFT$(r$,i%)+MID$(r$,i%):i%=i%+1
       UNTIL i%=0
       q%=INSTR(r$,",")
       IF q%=0:q%=(ASCr$=48)AND(INSTR(r$,"/")=0)       :REM leading zeros 00001
       IF q%=0:q%=LENSTR$VALr$>8                       :REM long numbers 12345678901234
       IF q%=0:IFVALr$:q%=INSTR(r$,"/")AND(ASCr$<>48)  :REM fractions 12/34
       IF q%=0:q%=MID$(r$,3,1)=" "ANDMID$(r$,7,1)=" "  :REM dates xx XXX xxxx
       IF q%:r$=""""+r$+"""":IFINSTR(r$,",")=0:r$="="+r$
       PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
       IF INSTR(r$,",")<>0 THEN r$=""""+r$+""""
       PROCwr(out%,r$):IF field%<>fmax%:BPUT#out%,ASC","
     NEXT field%
   NEXT record%

Jgharston 20:20, 29 August 2008 (BST) Jgharston (talk) 00:18, 5 April 2017 (UTC)