[Discuss] 'C' string tokenizer for those who hate strtok

David Bronaugh dbronaugh at linuxboxen.org
Thu Jun 29 11:58:59 PDT 2006


p.willis at telus.net wrote:
> Quoting Paul Nienaber <phox at phox.ca>:
>
>   
>> p.willis at telus.net wrote:
>>     
>>> Quoting p.willis at telus.net:
>>>
>>>   
>>>       
>>>> Quoting Paul Nienaber <phox at phox.ca>:
>>>>
>>>>     
>>>>         
>>>>> Buffer from nowhere?  POSIX has mandated strtok_r() for like... ever. 
>>>>> strtok() _is_ stupid.  It's also about one more line to use strchr() or
>>>>> one can use BSD strsep(), or whatever...
>>>>>
>>>>> ~p
>>>>>       
>>>>>           
>>>> Paul,
>>>>
>>>> It's fluff. It's a learning excercise for 'C' linked lists for beginners.
>>>>
>>>> It's entertainment...or would you rather read about partitioning hard
>>>>         
>> drives
>>     
>>>> 200 more times.
>>>>
>>>> Peter
>>>>     
>>>>         
>>> I should also note that this technique is better than all 
>>> of the above mentioned tokenization routines in that it doesn't
>>> destroy the original data. It always makes me wonder about
>>> libraries when the actual 'man pages' say to avoid the routine
>>> if possible.
>>> (ie: strtok, strtok_r, and strsep all come with this warning)
>>>
>>> A second point regarding the storage is that the deallocation
>>> is obviated by this technique reducing memory leaks. But that's
>>> splitting hairs since free() also works [most of the time] for
>>> some of the other routines.
>>>   
>>>       
>> Yeah.  It's way better to allocate another buffer for every token,
>> rather than copying the whole thing and delimiting it, which of course
>> makes your technique "not better", because it incurs a whole pile more
>> calls to malloc() and friends...  </rant>  (but you were the one who
>> decided to use the word "better"...)
>>
>> I should come clean here and mention that I've taught C at UVic on at
>> least one occasion.  I won't even go into the C-specific issues here.
>>
>> ~p
>>     
>
> How is allocating 5 buffers any different than allocating one?
> I'd really like to be informed regarding malloc since most
> of the linux system uses it to allocate memory at anything 
> above the kernel level. What's wrong with malloc and friends?
>   
Nothing is wrong with malloc and friends; but they are general. General 
methods are usually slower than specific ones. If you want to look at it 
a quirky way, strtok et al have their own "memory allocation", using the 
provided string as a pool. This will always be higher performance than 
using malloc.

Using malloc as you do also uses more memory. 4 additional bytes are 
allocated before each chunk of memory allocated using malloc; they 
contain the size of the chunk.

Using a linked list as you do hugely bloats things as well.

So basically, if I have a string like "The quick brown fox jumped over 
the lazy dog", strtok will use n bytes of memory, where n is the length 
of the string in characters including the \0 at the end. Let m be the 
number of words in the string. Then your method will use n + (m * 4) 
bytes of memory... just for the strings. Now we'll drag in the linked 
list overhead of 16 bytes per chunk, including the pointer. So the total 
memory usage, if we give you the pointers for free, is n + (m * 20) 
bytes of memory.

In our "quick brown fox" example, there are 9 words, and the string is 
45 characters long. strtok will use 45 bytes. Your method will use 45 + 
(9 * 20) bytes -- or 225 bytes of memory. That's 5 times as much memory 
to do the same thing.

Now imagine tokenizing a 20 megabyte string...

David


More information about the Discuss mailing list