[Discuss] 'C' string tokenizer for those who hate strtok
p.willis at telus.net
p.willis at telus.net
Thu Jun 29 13:02:15 PDT 2006
Quoting David Bronaugh <dbronaugh at linuxboxen.org>:
> p.willis at telus.net wrote:
> >
> > How is allocating 5 buffers any different than allocating one?
> > I'd really like to be informed regarding malloc since most
> > of the linux system uses it to allocate memory at anything
> > above the kernel level. What's wrong with malloc and friends?
> >
> Nothing is wrong with malloc and friends; but they are general. General
> methods are usually slower than specific ones. If you want to look at it
> a quirky way, strtok et al have their own "memory allocation", using the
> provided string as a pool. This will always be higher performance than
> using malloc.
>
> Using malloc as you do also uses more memory. 4 additional bytes are
> allocated before each chunk of memory allocated using malloc; they
> contain the size of the chunk.
>
> Using a linked list as you do hugely bloats things as well.
>
> So basically, if I have a string like "The quick brown fox jumped over
> the lazy dog", strtok will use n bytes of memory, where n is the length
> of the string in characters including the \0 at the end. Let m be the
> number of words in the string. Then your method will use n + (m * 4)
> bytes of memory... just for the strings. Now we'll drag in the linked
> list overhead of 16 bytes per chunk, including the pointer. So the total
> memory usage, if we give you the pointers for free, is n + (m * 20)
> bytes of memory.
>
> In our "quick brown fox" example, there are 9 words, and the string is
> 45 characters long. strtok will use 45 bytes. Your method will use 45 +
> (9 * 20) bytes -- or 225 bytes of memory. That's 5 times as much memory
> to do the same thing.
>
> Now imagine tokenizing a 20 megabyte string...
>
> David
Well, it's a given that I am using more memory.
I am, however, using more memory to be safe rather
than sorry.
I can understand your argument if I were coding something
on a MC6809 where I only get 256 bytes of working memory,
but let's be realistic here. If I was really worried about
memory usage I would be hand coding the program in HEX
and I sure wouldn't be running linux as an OS.
'C' is a relatively low level language. The baseline
allocation of memory is just that 'baseline'. The
purpose and process of programming *safe* applications
that provide robust and flexible structures to work with
is the responsibility of the programmer.
for example, I prefer the case B, below over case A because
I can actually check the storage to see what type it is. Does
case B use more space? *hell yes*. Does the storage mechanism
in B provide more information for the runtime use of the data?
*** HELL YES ***
A.)
void *a=malloc(sizeof(float));
b.)
#define TYPE_FLOAT 3
typedef union __DATA_STORE
{
void *data;
float *float_data;
int *int_data;
etc *etc_data;
}DATA_STORE;
typedef struct __STORAGE
{
long data_type;
void *data;
DATA_STORE data_union;
OTHER_INFORMATION useful_information;
GPS_LOCATION locale;
}STORAGE;
STORAGE *store=(STORAGE *)malloc(sizeof(STORAGE));
store->data_type=TYPE_FLOAT;
store->data_union.data=malloc(sizeof(float));
/*fill out OTHER_INFORMATION ie: time, date, etc, ...*/
...
I agree that most programmers making small apps
will go for case A or something less. But it sure is
nice to have all the information in the variable.
That's pretty much what's going on with the linked list
idea. There's even a note regarding making a routine
to GetTokenByID using the STR.id which is additional info
that couldn't be gleaned from a simple allocation.
The VARIANT type and class are one thing that Microsoft
got right. VARIANT is an example of wrapping storage into
an object.
Same idea here...
(Now I guess I'll get 'fan mail' from people who hate MS...)
Peter
More information about the Discuss
mailing list