What is the fastest way to extract data from a huge text file?

Illustration
per isakson - 2022-04-04T14:12:54+00:00
Question: What is the fastest way to extract data from a huge text file?

I have a text file like this:   1.0 IONOSPHERE MAPS GNSS IONEX VERSION / TYPE ADDNEQ2 V5.3 AIUB 03-JUL-14 20:57 PGM / RUN BY / DATE CODE'S GLOBAL IONOSPHERE MAPS FOR DAY 180, 2014 COMMENT Global ionosphere maps (GIM) are generated on a daily basis DESCRIPTION (I don't want this part) . . . (skip 600 lines) 1 START OF TEC MAP 2014 6 29 0 0 0 EPOCH OF CURRENT MAP 87.5-180.0 180.0 5.0 450.0 LAT/LON1/LON2/DLON/H 154 154 155 155 155 156 156 156 156 155 155 155 154 154 153 153 152 151 150 149 148 147 146 145 145 144 143 142 141 140 139 139 138 138 137 137 137 137 136 136 137 137 137 137 137 138 138 139 139 139 140 140 141 142 142 143 143 144 145 145 146 147 147 148 149 149 150 151 152 152 153 153 154 85.0-180.0 180.0 5.0 450.0 LAT/LON1/LON2/DLON/H 160 161 162 163 164 164 165 165 165 164 164 163 163 162 161 159 158 157 155 153 151 149 147 145 143 141 139 138 136 134 133 132 131 130 130 129 129 129 130 130 131 131 132 133 134 135 136 136 137 138 139 139 140 140 141 142 142 143 144 145 146 146 148 149 150 151 153 154 155 157 158 159 160 . . . I have to search for a specific value by entering specific latitude, longitude and time.   I have a function using fopen and fgetl for searching this. The data have a fixed spacing. So, I use strcmp string comparison and isequal to search for the value I want. . . .   Let say, value = search(lat, lon, time) lat = 85.0; lon = -175; time (UT) = 0; I will first compare each line getting from fgetl with the string:     2014 6 29 0 0 0 EPOCH OF CURRENT MAP   If matched, then search for 85.0 from the from following line getting by fgetl   85.0-180.0 180.0 5.0 450.0 LAT/LON1/LON2/DLON/H   If matched, store all related data into a vector:   160 161 162 163 164 164 165 165 165 164 164 163 163 162 161 159 158 157 155 153 151 149 147 145 143 141 139 138 136 134 133 132 131 130 130 129 129 129 130 130 131 131 132 133 134 135 136 136 137 138 139 139 140 140 141 142 142 143 144 145 146 146 148 149 150 151 153 154 155 157 158 159 160 (in vector form) then get the value by specific vector index (corresponding to longitude, index=2 in this example) . . . But I have to call this search function for 250,000 times. It will take over 24 hours!!!   How can I do? I cannot change my computer. Thank!! I need your help!!* PS: the text file is about 12,000 row * 80 column

Expert Answer

Profile picture of Prashant Kumar Prashant Kumar answered . 2025-11-20

Now I'm done:
  • less than a tenth of a second to read and parse the sample file (with the file in the system cache)
  • less than a tenth of a millisecond to retrieve one value
  • the array ION is half a MB. Make ION uint8 to save memory - if needed.
  • 62196 values retrieved from the sample file.
You add tests and comments!
 
 
    >> tic,ION = cssm();toc
    Elapsed time is 0.074765 seconds.
    >> sum(not(isnan(ION(:))))
    ans =
           62196
    >> whos ION
      Name       Size                Bytes  Class     Attributes

      ION       73x71x12            497568  double            

    >> ION(lon2ix(0),lat2ix(85),ut2ix(20))
    ans =
       164  

    >> tic,ION(lon2ix(0),lat2ix(85),ut2ix(20));toc
    Elapsed time is 0.000067 seconds.

compared to

    >> tic, [gim_tec] = sample_search_function( 20, 85, 0 ), toc
    gim_tec =
       164
    Elapsed time is 0.265756 seconds.

where

    function    ION = cssm()
    str = fileread( 'c:\m\cssm\CODG1520.txt' );
    ca1 = regexp( str, '(?<=START OF TEC MAP).+?(?=END OF TEC MAP)', 'match' );
    ION = nan( 73, 70, 11 );
    lat2ix  = @(lat) round((lat+87.5)/2.5)+1;
    lon2ix  = @(lon) round((lon+180)/5.0)+1; %#ok
    ut2ix   = @(ut)  round(ut/2)+1;
    for jj = 1 : length( ca1 )

        buf = regexp( ca1{jj}, '\n', 'split', 'once' );
        buf = regexp( buf{2} , '\n', 'split', 'once' );
        ut  = textscan( buf{1}, '%*f%*f%*f%f%*[^\n]' );
        ut  = ut{1};
        ca2 = regexp( buf{2}, 'LAT/LON1/LON2/DLON/H', 'split' );
        pos = ca2{1};
        for kk = 2 : length( ca2 )
            lat = textscan( pos,'%f%*[^\n]' ); 
            lat = lat{1};
            num = sscanf( ca2{kk}(1:end-60), '%f' );
            pos =  strtrim( ca2{kk}(end-60+1:end) );
            ION(:,lat2ix(lat),ut2ix(ut)) = num;
        end
    end
    end

 


Not satisfied with the answer ?? ASK NOW

Get a Free Consultation or a Sample Assignment Review!