[Discuss] 'C' string tokenizer for those who hate strtok

pw p.willis at telus.net
Thu Jun 29 20:10:00 PDT 2006


Paul Nienaber wrote:

>>How is allocating 5 buffers any different than allocating one?
>>I'd really like to be informed regarding malloc since most
>>of the linux system uses it to allocate memory at anything 
>>above the kernel level. What's wrong with malloc and friends?
>>  
> 
> There's overhead.  Even when it's all done in userspace, the GNU
> implementation is nasty, and by allocating more little bits, you can end
> up making future calls to malloc() slower...
> 

We'll you have a point there. But I didn't design malloc
or the GNU implementation. The nice thing about open software
is we can go in and change that if we like. The point was to make a
better functioning tokenizer that can be extended like an object.

>>I think dynamically allocating memory for storage
>>is a pretty good idea. That's what programs *should* do.
>>Otherwise we end up with buffer overrun exploits, etc..
>>  
> 
> It is, but allocating a "chunk" instead of calling malloc() a gazillion
> times is far more efficient, especially when you're pretty much being
> handed a way of having the buffers neatly packed into the chunk.

Well that's really a taste thing. Not very useful outside of the
aesthetic realm. If I'm getting an iterated data set back from
a routine, personally I prefer something a bit more graceful than
that. Other than iterating through the tokens once to get a total,
there's no way to get a count. The linked list approach provides a
count up-front (TOKENS.count). We can also add other things to the 
linked list like STR.length, STR.Number_of_capitol_letters,
TOKENS.whatever_we_want,

how about:

int TOKENS.(*go_team)(int input, int *output);

>>As for C-Specific issues I'm not sure what you mean. 
>>Does my 'C' code have punctuation problems? :)
>>  
> 
> Ok, answers to that then: (not meant to be offensive)
> 
> Your use of void* is ugly:  Don't cast the return value of malloc(), and
> replace all those void*'s in your structs with pointers to the proper
> struct types.

You have a good point here. except that 'next'  and 'prev' are
pointers to the same type of struct. This can be problematic
for some compilers. So, casting void* is more of a habit than
a necessity. Not a problem with gcc apparently.


typedef struct __STR
{
	struct STR *next_str;
	struct STR *prev_str;
}STR;

As you note. This is a more clear representation of
the list member.


> There is no reason to be using: typedef struct foo_ {} foo;
> instead of: typedef struct {} foo;
> Unless of course you're taking advantage of it to have a pointer to
> itself in there somewhere, as mentioned above.

I thought it best to leave all options open. Those STR.next
and STR.prev voids should be changed. They function the same
with the type cast. (aestheics)

> Magic numbers:  What's with the extra 4 bytes you've allowed out the end
> of each token?  I didn't look much, but the only thing I saw it being
> used for is to store at least one '\0'.  Using memset there is also
> pointless... just use string[length_of_string + 1] = '\0';  (or you can
> use calloc)

The 4 bytes were done for testing purposes. I wanted to see what the
end of the allocation was looking like. I originally thought to make the 
same code do 2 byte unicode strings as well, but as you note, 4 bytes is 
a bit excessive.

I have a habit of clearing allocated storage using memset before use.
This is in case someone unwittingly takes a new allocation and then uses
strcat. If the allocation contains crap to start with, the resulting
concatenated string ends up being junk. I have noticed, when allocating
buffers this can be the case, causing strange bugs until found.

Peter



More information about the Discuss mailing list