[Discuss] 'C' string tokenizer for those who hate strtok

p.willis at telus.net p.willis at telus.net
Thu Jun 29 13:02:15 PDT 2006


Quoting David Bronaugh <dbronaugh at linuxboxen.org>:

> p.willis at telus.net wrote:

> >
> > How is allocating 5 buffers any different than allocating one?
> > I'd really like to be informed regarding malloc since most
> > of the linux system uses it to allocate memory at anything 
> > above the kernel level. What's wrong with malloc and friends?
> >   
> Nothing is wrong with malloc and friends; but they are general. General 
> methods are usually slower than specific ones. If you want to look at it 
> a quirky way, strtok et al have their own "memory allocation", using the 
> provided string as a pool. This will always be higher performance than 
> using malloc.
> 
> Using malloc as you do also uses more memory. 4 additional bytes are 
> allocated before each chunk of memory allocated using malloc; they 
> contain the size of the chunk.
> 
> Using a linked list as you do hugely bloats things as well.
> 
> So basically, if I have a string like "The quick brown fox jumped over 
> the lazy dog", strtok will use n bytes of memory, where n is the length 
> of the string in characters including the \0 at the end. Let m be the 
> number of words in the string. Then your method will use n + (m * 4) 
> bytes of memory... just for the strings. Now we'll drag in the linked 
> list overhead of 16 bytes per chunk, including the pointer. So the total 
> memory usage, if we give you the pointers for free, is n + (m * 20) 
> bytes of memory.
> 
> In our "quick brown fox" example, there are 9 words, and the string is 
> 45 characters long. strtok will use 45 bytes. Your method will use 45 + 
> (9 * 20) bytes -- or 225 bytes of memory. That's 5 times as much memory 
> to do the same thing.
> 
> Now imagine tokenizing a 20 megabyte string...
> 
> David


Well, it's a given that I am using more memory.
I am, however, using more memory to be safe rather
than sorry.

I can understand your argument if I were coding something
on a MC6809 where I only get 256 bytes of working memory,
but let's be realistic here. If I was really worried about
memory usage I would be hand coding the program in HEX
and I sure wouldn't be running linux as an OS.

'C' is a relatively low level language. The baseline
allocation of memory is just that 'baseline'. The
purpose and process of programming *safe* applications
that provide robust and flexible structures to work with
is the responsibility of the programmer.

for example, I prefer the case B, below over case A because
I can actually check the storage to see what type it is. Does
case B use more space? *hell yes*. Does the storage mechanism
 in B provide more information for the runtime use of the data?
*** HELL YES ***


A.)

void *a=malloc(sizeof(float));

b.)

#define TYPE_FLOAT 3

typedef union __DATA_STORE
{
  void *data;
  float *float_data;
  int *int_data;
  etc *etc_data;
}DATA_STORE;

typedef struct __STORAGE
{
   long data_type;
   void *data;
   DATA_STORE data_union;
   OTHER_INFORMATION useful_information;
   GPS_LOCATION locale;
}STORAGE;

STORAGE *store=(STORAGE *)malloc(sizeof(STORAGE));
store->data_type=TYPE_FLOAT;
store->data_union.data=malloc(sizeof(float));
/*fill out OTHER_INFORMATION ie: time, date, etc, ...*/
...


I agree that most programmers making small apps
will go for case A or something less. But it sure is
nice to have all the information in the variable. 
That's pretty much what's going on with the linked list
idea. There's even a note regarding making a routine
to GetTokenByID using the STR.id which is additional info
that couldn't be gleaned from a simple allocation.

The VARIANT type and class are one thing that Microsoft
got right. VARIANT is an example of wrapping storage into
an object.
Same idea here...

(Now I guess I'll get 'fan mail' from people who hate MS...)

Peter



More information about the Discuss mailing list