Re: [sv-bc] hex number in string literal

From: Krishanu Debnath <krishanu_at_.....> Date: Wed Mar 29 2006 - 05:18:57 PST · This archive was generated by hypermail 2.1.8 : Wed Mar 29 2006 - 05:18:15 PST

Greg Jaxon wrote:
> Krishanu Debnath wrote:
>>>>  >> Now consider this example: (note the embedded comments)
>>>>  >>
>>>>  >> module sample;
>>>>  >>
>>>>  >>     string s;
>>>>  >>
>>>>  >>     initial
>>>>  >>     begin
>>>>  >>         s = "\x41";  // this means now s is "A". ASCII value of 
>>>> A is 0x41.
>>>>  >>         $display("value of s %s \n", s);
>>>>  >>
>>>>  >>         s = "\x4142"; // does this mean s is "A42" ?
>>>>  >>         $display("value of s %s \n", s);
>>>>  >>
>>>>  >>         s = "\x41\x42"; // does this mean s is "AB" ?
>>>>  >>         $display("value of s %s \n", s);
>>>>  >>
>>>>  >>         s = "\x4"; // less than two characters followed by x, 
>>>>  >>                    // so it will be not treated as hex number.
>>>>  >>         $display("value of s %s \n", s);
>>>>  >>     end
>>>>  >> endmodule
>>>>  >>
>>>>  >> Does the above make sense?
> 
> The question makes a lot of sense.  The LRM is very far from definitive
> on this subject.  To be fair, the C standard where this syntax got started
> is a bit ambiguous, too.  Its BNF says the octal escapes are 1-3 digits
> and its hexadecimal escapes go for as long as they can.  But then C says
> two confusing extra constraints:  Each octal or hexadecimal escape sequence
> is the longest sequence of characters that can constitute the escape 
> sequence.
> [Subject, we assume, to the BNF 1-3 digit definition, so \177 is not \17
> followed by ascii "7"?]

Correct. [I will discuss everything in terms of string as SV doesn't have
character literal, it treats elements of string as byte]

  > Secondly they want the value of the octal or
> hexadecimal escape sequence to be in range of representable unsigned char
> or wchar_t data.
> 
> I think we should consider that the natural intent is for each escape
> sequence to produce exactly one character element of the string. 

That's true for C/C++. Consider this:

char *str = "\x000041";

Above string contains two character, character whose hex value is 0x000041 and
NUL character. if the hex value cannot fit in UCHAR_MAX, behavior is
undefined[implementation defined?]

> To that
> end, there should be lexical cues that indicate whether the characters
> are to be 8 or 16 bits wide.  Those cues have to inform the lexical scan
> which upper bound to use on the escape sequence length.

C standard does mention this. Any non hexadecimal character terminates the
hex escape sequence. Consider this:

#include<stdio.h>

int main()
{
     char *s = "\x0000041H";
     printf("s = %s\n", s);
     return 0;
}

's' contains three characters.

1> Character whose hex value is 0x000041.
2> 'H'. because 'H' is not valid hexadecimal digit so it doesn't belong to
    the hex-escape sequence.
3> NUL character.

C compiler, I am using, use ASCII encoding(Note : Implementation is free to 
choose any encoding), the above program prints "AH".

> 
> I don't think SV has wchar_t strings (yet).  But, surely it is inevitable.

Probably, But I will refrain commenting anything about wide character constant
in this thread. Aside that hex escape sequence has same semantics in both the
C/C++ char and wchar_t type. wchar_t just provides you a wider range so that
you can accommodate local specific characters. It's main intent is to make room
for asian characters.

> 
> Octal notation for 16 bit characters is awkward - should the two pad bits
> both go into the first digit, or one each in the first and fourth digits?
> (\177777 vs \377377) I say neither... octal can only represent 8 bits of
> a char, or 9 of a wchar.
> 
> Hex notation can represent 4 or 8 bits of a char, 4,8,12, or 16 of a wchar.
> These right align in the char (wchar).
> 
> When lexing a  char string, 1-2 hex digits may be escaped.
> When lexing a wchar string, 1-4 hex digits may be escaped.
> 
> Allowing one escape to create several character elements makes the literals
> harder to migrate upward from char to wchar.   Requiring one
> escape per char permits means you can replace \x by \x00 and get a
> sensible result.  This also spares us from endianness problems trying
> to convert long escape sequences into byte streams.

I am not sure I understand your words fully. But I can summarized what is going
in my mind. Following is the two candidates for hex-escape sequence semantics.

1> Adopt the semantics of C/C++. i.e. whole sequence of valid hex digits be
    consumed as part of the same escape sequence. Each hex-escape sequence
    specify one character of the string. So if you want to specify a string
    having three bytes [0xAB, 0xCD, 0xEF] you need to write it like
    "\xAB\xCD\xEF". Any non hex digit will terminate the hex sequence.

    Now what happen if the value specified by hex-sequence doesn't fit in
    byte(which is 8 bit) data type. Remember, in SV every string element need to
    fit in byte. Assuming that we will still have byte as string element data
    type, it will be error if hex-escape value doesn't fit in byte.(?)

    Note in C, this behavior entirely depends on particular implementation.
    Because C doesn't restrict the value-range of strings' element type i.e.
    range of char type[unlike SV where it is 8 bit]. I have heard of a system
    where char is 32 bit long.

2> Make the semantics same as that of octal-escape sequence, what I proposed
    in my first post. My concern is what is the point of allowing arbitrary
    long hex-sequence when standard specifically restricted the upper limit
    of the value[which is the maximum value of any byte can hold, and again size
    of byte is fixed too]. I already mentioned the semantics in the example. But
    I think number of allowed hex-digits can be changed from 2 to [1-2].

Let me know if you need any more clarification.

Krishanu

-- 
int main(void){char p[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.\
  \n",*q="kl BIcNBFr.NKEzjwCIxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+=strchr(p,*q++)-p;if(i>=(int)sizeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}