Anomalous behavior while comparing a unicode character to a unicode character range

Go To StackoverFlow.com

3

For some reason, I'm getting unexpected results in the range comparisons of unicode characters.

To summarize, in my minimized test code, ("\u1000".."\u1200") === "\u1100" is false, where I would expect it to be true -- while the same test against "\u1001" is true as expected. I find this utterly incomprehensible. The results of the < operator are also interesting -- they contradict ===.

The following code is a good minimal illustration:

# encoding: utf-8

require 'pp'

a = "\u1000"
b = "\u1200"

r = (a..b)

x = "\u1001"
y = "\u1100"

pp a, b, r, x, y

puts "a < x = #{a < x}"
puts "b > x = #{b > x}"

puts "a < y = #{a < y}"
puts "b > y = #{b > y}"

puts "r === x = #{r === x}"
puts "r === y = #{r === y}"

I would naively expect that both of the === operations would produce "true" here. However, the actual output of running this program is:

ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false

Could someone enlighten me?

(Note I'm on 1.9.3 on Mac OS X, and I'm explicitly setting the encoding to utf-8.)

2012-04-04 22:35
by Perry
what does (a..b) means - Kit Ho 2012-04-04 23:43
@KitHo (a..b) produces a Range object which has a min of a and max of b and can be enumerated over or tested for inclusion - dbenhur 2012-04-04 23:50


3

ACTION: I've submitted this behavior as bug #6258 to ruby-lang.

There's something odd about the collation order in that range of characters

irb(main):081:0> r.to_a.last.ord.to_s(16)
=> "1036"
irb(main):082:0> r.to_a.last.succ.ord.to_s(16)
=> "1000"
irb(main):083:0> r.min.ord.to_s(16)
=> "1000"
irb(main):084:0> r.max.ord.to_s(16)
=> "1200"

The min and max for the range are the expected values from your input, but if we turn the range into an array, the last element is "\u1036" and it's successor is "\u1000". Under the covers, Range#=== must be enumerating the String#succ sequence rather than simple bound checking on min and max.

If we look at the source (click toggle) for Range#=== we see it dispatches to Range#include?. Range#include? source shows special handling for strings -- if answer can be determined by string length alone, or all the invloved strings are ASCII, we get simple bounds checks, otherwise we dispatch to super, which means the #include? gets answered by Enumerable#include? which enumerates using Range#each which again has special handling for string and dispatches to String#upto which enumerates with String#succ.

String#succ has a bunch of special handling when the string contains is_alpha or is_digit numbers (which should not be true for U+1036), otherwise it increments the final char using enc_succ_char. At this point I lose the trail, but presumably this calculates a successor using the encoding and collation information associated with the string.

BTW, as a work around, you could use a range of integer ordinals and test against ordinals if you only care about single chars. eg:

r = (a.ord..b.ord)
r === x.ord
r === y.ord
2012-04-04 23:41
by dbenhur
So is the general point that the behavior of === is different from that for <=>? This seems... rather odd. I can understand if comparisons are somehow locale dependent or what have you, but not if range membership and <=> use a really different algorithm. I also can't find any documentation that might explain this at all.. - Perry 2012-04-04 23:52
Why would #=== be the same as #<=> ? The former is the "Case Equality" operator and produces bool, the later the special three value comparator operator. === says nothing about less/greater than, only equality, with special semantics handy for case conditions expected - dbenhur 2012-04-04 23:57
My point is that the culprit is String#succ. And there's a bug or surprise calculating "\u1036".succ with default encoding and locale collation - dbenhur 2012-04-04 23:58
I would naively assume that <, >, etc, which are defined in the String class documentation as being built on <=> (which see) would use the same collating sequence as ===. BTW, the only reference to the collating sequence in the documentation for String is an oblique hint in succ - Perry 2012-04-04 23:59
Also, I find essentially no documentation anywhere in the ruby docs on collating sequences and what influences them. If that is the cause of what is going on here, I'd appreciate pointers (if they exist).. - Perry 2012-04-05 00:00
I found this via Google: http://developer.mimer.com/collations/myanmar/MyanmarCollation.pdf. We're dealing with the intricacies of the Burmese language, which I don't claim to understand at all, but there is a special note with regard U1036, and the comment "It is slightly more complicated than this". I think the answer to this problem is "Unicode is complicated" - matt 2012-04-05 00:02
Okay, so so far, the summary seems to be "things operate unexpectedly weirdly because of unicode collating sequences". However, I'm still confused about why < and other <=> derived operators don't seem to follow collation ordering and === does. Also, some explanation of what collating order does when a range exceeds a single unicode script block would be useful. (Is any of this documented? - Perry 2012-04-05 00:06
@Perry, I suspect < etc do follow collating order, the problem is that while "\u1036" is indeed less than "\u1037", 1037 is not the next character after 1036. Range uses #succ (alias of #next) to enumerate the values for #include? test, and finds what looks like a loop in the collation sequence giving a smaller set than one expects given the end points - dbenhur 2012-04-05 01:22
Reading the code for String#upto it's not clear to me how the enumeration ever finishes for "\u1000".upto("\u1037"). irb(main):116:0> c.ord.tos(16) => "1035" irb(main):117:0> c.succ!.ord.tos(16) => "1036" irb(main):118:0> c.succ!.ord.tos(16) => "1000" irb(main):119:0> c.succ!.ord.tos(16) => "1000 - dbenhur 2012-04-05 01:29
Well, at least I've learned one thing: this wasn't an instance of me being thick, the whole thing actually is a mess. (The code I was writing when I came across this has now been converted to compare integer codepoint numbers, as per your suggestion. - Perry 2012-04-05 02:00
@Perry Yeah unicode is complicated and ruby is actually pretty new to it (go figure for a language invented and principally maintained in Japan). Glad I could help you get unstuck and not feel thick. :) BTW, I submitted this issue as a bug to ruby-lang - dbenhur 2012-04-05 02:05


2

Looks like Range doesn't mean what we think it means.

What I think is happening is that you're creating is a Range that is trying to include letters, digits, and punctuation. Ruby is unable to do this and is not "understanding" that you want essentially an array of code points.

This is causing the Range#to_a method to fall apart:

("\u1000".."\u1099").to_a.size  #=> 55
("\u1100".."\u1199").to_a.size  #=> 154
("\u1200".."\u1299").to_a.size  #=> 73

The zinger is when you put all three together:

("\u1000".."\u1299").to_a.size  #=> 55

Ruby 1.8.7 works as expected-- as Matt points out in the comments, "\u1000" is just the literal "u1000" because no Unicode.

The string#succ C source code doesn't just return the next codepooint:

Returns the successor to <i>str</i>. The successor is calculated by                                                                                                                                                                                                          
incrementing characters starting from the rightmost alphanumeric (or                                                                                                                                                                                                         
the rightmost character if there are no alphanumerics) in the                                                                                                                                                                                                                
string. Incrementing a digit always results in another digit, and                                                                                                                                                                                                            
incrementing a letter results in another letter of the same case.                                                                                                                                                                                                            
Incrementing nonalphanumerics uses the underlying character set's                                                                                                                                                                                                            
collating sequence.     

Range is doing something different than just next, next, next.

Range with these characters does ACSII sequence:

('8'..'A').to_a
=> ["8", "9", ":", ";", "<", "=", ">", "?", "@", "A"]

But using #succ is totally different:

'8'.succ
=> '9'

'9'.succ
=> '10'  # if we were in a Range.to_a, this would be ":"
2012-04-04 23:17
by joelparkerhenderson
What version of ruby and platform are you on? 1.9.3 seems to produce consistent results for me on this. (I don't think it can be encoding as I'm specifying utf-8 -- which are you using? - Perry 2012-04-04 23:18
Okay, so you're having the same problem as me on 1.9.3 and 1.8.7 is behaving properly for you. This could be a bug, but it seems difficult to believe that a bug this bad could have survived into release.. - Perry 2012-04-04 23:24
r.to_a.size is 55 in 1.9.3 and 201 in 1.8.7. Why that is I don't know, something to do with strings and encodings in 1.9.3 I guess, but it could explain the difference - matt 2012-04-04 23:29
Of course, on 1.8.7 "\u1000" is just the literal "u1000" - no Unicode, so the range is just u1000, u1001, u1002 ... u1199, u1200 and everything works as expected - matt 2012-04-04 23:37
"\u1036".succ gives "\u1000 - dbenhur 2012-04-04 23:44
Ads