[sv-bc] What is a token?

From: Bresticker, Shalom <shalom.bresticker_at_.....> Date: Thu Oct 26 2006 - 00:26:32 PDT · This archive was generated by hypermail 2.1.8 : Thu Oct 26 2006 - 00:27:43 PDT

I came to this question in the context of macros, such as where can
macros be substituted and where not, but the question is relevant
elsewhere as well.

In 1364-2005, 3.1 says, 

"The types of lexical tokens in the language are as follows:

- White space
- Comment
- Operator
- Number
- String
- Identifier
- Keyword"

That does not seem to be complete. For example, a semicolon at the end
of a statement is a token, but we do not think of it as an operator.

And this says that a number is a single token, whereas 3.5.1 says that
an integer constant number is made up of 3 tokens!

Tokens are not explicitly mentioned in many places in the LRM. One of
the few places is 19.3.1, where it says, 

"The text specified for macro text shall not be split across the
following lexical tokens:

- Comments
- Numbers
- Strings
- Identifiers
- Keywords
- Operators"

The intent seems to have been to say "all tokens except white space".
Would that be correct?

The question then comes up, what is a token and what is not a token?

At first, one might propose that tokens are delimited by wherever white
space is allowed. If white space is not allowed between two pieces of
text, then we have one token and not two.

The first sentence seems probably correct. If white space is allowed,
then we have two tokens. 

But I think the second sentence is not correct. I think we have cases
where white space is not allowed, but it is still two tokens and not
one. For example, a time literal is an unsigned or fixed point number
followed by a time unit. A space is not allowed between them. Yet I
think it is still two tokens?Do we have an exhaustive list of where
space is not allowed between tokens?

I think tokenizing is also context specific.

I remember that in 1364, we discussed whether @* is one token or two,
whether @(*) is one or two or four. It is never formally settled, but I
think that de facto, the answers were two and four, respectively.

What about, for example, a (vw) entry in a UDP table (see Table 8-1)? 

Maybe you compiler people can publish a list of tokens so we can get
agreement and commonality on this?

And where are macro calls allowed? For any token, whatever a token is?
Only where white space delimiters are allowed? I don't think the latter
is correct.

For example, again in a time literal, can a time literal be written as
`A`B, where `A is a number and `B is a timeunit? This is related to the
question of whether macros insert white space before and after them,
which I think we discussed. See also Mantis 1339.

Thanks,

Shalom

Shalom Bresticker

Intel Jerusalem LAD DA

+972 2 589-6852

+972 54 721-1033

I don't represent Intel