[Discuss] 'C' string tokenizer for those who hate strtok
pw
p.willis at telus.net
Thu Jun 29 20:10:00 PDT 2006
Paul Nienaber wrote:
>>How is allocating 5 buffers any different than allocating one?
>>I'd really like to be informed regarding malloc since most
>>of the linux system uses it to allocate memory at anything
>>above the kernel level. What's wrong with malloc and friends?
>>
>
> There's overhead. Even when it's all done in userspace, the GNU
> implementation is nasty, and by allocating more little bits, you can end
> up making future calls to malloc() slower...
>
We'll you have a point there. But I didn't design malloc
or the GNU implementation. The nice thing about open software
is we can go in and change that if we like. The point was to make a
better functioning tokenizer that can be extended like an object.
>>I think dynamically allocating memory for storage
>>is a pretty good idea. That's what programs *should* do.
>>Otherwise we end up with buffer overrun exploits, etc..
>>
>
> It is, but allocating a "chunk" instead of calling malloc() a gazillion
> times is far more efficient, especially when you're pretty much being
> handed a way of having the buffers neatly packed into the chunk.
Well that's really a taste thing. Not very useful outside of the
aesthetic realm. If I'm getting an iterated data set back from
a routine, personally I prefer something a bit more graceful than
that. Other than iterating through the tokens once to get a total,
there's no way to get a count. The linked list approach provides a
count up-front (TOKENS.count). We can also add other things to the
linked list like STR.length, STR.Number_of_capitol_letters,
TOKENS.whatever_we_want,
how about:
int TOKENS.(*go_team)(int input, int *output);
>>As for C-Specific issues I'm not sure what you mean.
>>Does my 'C' code have punctuation problems? :)
>>
>
> Ok, answers to that then: (not meant to be offensive)
>
> Your use of void* is ugly: Don't cast the return value of malloc(), and
> replace all those void*'s in your structs with pointers to the proper
> struct types.
You have a good point here. except that 'next' and 'prev' are
pointers to the same type of struct. This can be problematic
for some compilers. So, casting void* is more of a habit than
a necessity. Not a problem with gcc apparently.
typedef struct __STR
{
struct STR *next_str;
struct STR *prev_str;
}STR;
As you note. This is a more clear representation of
the list member.
> There is no reason to be using: typedef struct foo_ {} foo;
> instead of: typedef struct {} foo;
> Unless of course you're taking advantage of it to have a pointer to
> itself in there somewhere, as mentioned above.
I thought it best to leave all options open. Those STR.next
and STR.prev voids should be changed. They function the same
with the type cast. (aestheics)
> Magic numbers: What's with the extra 4 bytes you've allowed out the end
> of each token? I didn't look much, but the only thing I saw it being
> used for is to store at least one '\0'. Using memset there is also
> pointless... just use string[length_of_string + 1] = '\0'; (or you can
> use calloc)
The 4 bytes were done for testing purposes. I wanted to see what the
end of the allocation was looking like. I originally thought to make the
same code do 2 byte unicode strings as well, but as you note, 4 bytes is
a bit excessive.
I have a habit of clearing allocated storage using memset before use.
This is in case someone unwittingly takes a new allocation and then uses
strcat. If the allocation contains crap to start with, the resulting
concatenated string ends up being junk. I have noticed, when allocating
buffers this can be the case, causing strange bugs until found.
Peter
More information about the Discuss
mailing list