For some reason, I'm getting unexpected results in the range comparisons of unicode characters.
To summarize, in my minimized test code, ("\u1000".."\u1200") === "\u1100" is false, where I would expect it to be true -- while the same test against "\u1001" is true as expected. I find this utterly incomprehensible. The results of the < operator are also interesting -- they contradict ===.
The following code is a good minimal illustration:
# encoding: utf-8
require 'pp'
a = "\u1000"
b = "\u1200"
r = (a..b)
x = "\u1001"
y = "\u1100"
pp a, b, r, x, y
puts "a < x = #{a < x}"
puts "b > x = #{b > x}"
puts "a < y = #{a < y}"
puts "b > y = #{b > y}"
puts "r === x = #{r === x}"
puts "r === y = #{r === y}"
I would naively expect that both of the === operations would produce "true" here. However, the actual output of running this program is:
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false
Could someone enlighten me?
(Note I'm on 1.9.3 on Mac OS X, and I'm explicitly setting the encoding to utf-8.)
ACTION: I've submitted this behavior as bug #6258 to ruby-lang.
There's something odd about the collation order in that range of characters
irb(main):081:0> r.to_a.last.ord.to_s(16)
=> "1036"
irb(main):082:0> r.to_a.last.succ.ord.to_s(16)
=> "1000"
irb(main):083:0> r.min.ord.to_s(16)
=> "1000"
irb(main):084:0> r.max.ord.to_s(16)
=> "1200"
The min and max for the range are the expected values from your input, but if we turn the range into an array, the last element is "\u1036" and it's successor is "\u1000". Under the covers, Range#=== must be enumerating the String#succ sequence rather than simple bound checking on min and max.
If we look at the source (click toggle) for Range#=== we see it dispatches to Range#include?. Range#include? source shows special handling for strings -- if answer can be determined by string length alone, or all the invloved strings are ASCII, we get simple bounds checks, otherwise we dispatch to super, which means the #include? gets answered by Enumerable#include? which enumerates using Range#each which again has special handling for string and dispatches to String#upto which enumerates with String#succ.
String#succ has a bunch of special handling when the string contains is_alpha or is_digit numbers (which should not be true for U+1036), otherwise it increments the final char using enc_succ_char. At this point I lose the trail, but presumably this calculates a successor using the encoding and collation information associated with the string.
BTW, as a work around, you could use a range of integer ordinals and test against ordinals if you only care about single chars. eg:
r = (a.ord..b.ord)
r === x.ord
r === y.ord
=== is different from that for <=>? This seems... rather odd. I can understand if comparisons are somehow locale dependent or what have you, but not if range membership and <=> use a really different algorithm. I also can't find any documentation that might explain this at all.. - Perry 2012-04-04 23:52
#=== be the same as #<=> ? The former is the "Case Equality" operator and produces bool, the later the special three value comparator operator. === says nothing about less/greater than, only equality, with special semantics handy for case conditions expected - dbenhur 2012-04-04 23:57
<, >, etc, which are defined in the String class documentation as being built on <=> (which see) would use the same collating sequence as ===. BTW, the only reference to the collating sequence in the documentation for String is an oblique hint in succ - Perry 2012-04-04 23:59
< and other <=> derived operators don't seem to follow collation ordering and === does. Also, some explanation of what collating order does when a range exceeds a single unicode script block would be useful. (Is any of this documented? - Perry 2012-04-05 00:06
< etc do follow collating order, the problem is that while "\u1036" is indeed less than "\u1037", 1037 is not the next character after 1036. Range uses #succ (alias of #next) to enumerate the values for #include? test, and finds what looks like a loop in the collation sequence giving a smaller set than one expects given the end points - dbenhur 2012-04-05 01:22
Looks like Range doesn't mean what we think it means.
What I think is happening is that you're creating is a Range that is trying to include letters, digits, and punctuation. Ruby is unable to do this and is not "understanding" that you want essentially an array of code points.
This is causing the Range#to_a method to fall apart:
("\u1000".."\u1099").to_a.size #=> 55
("\u1100".."\u1199").to_a.size #=> 154
("\u1200".."\u1299").to_a.size #=> 73
The zinger is when you put all three together:
("\u1000".."\u1299").to_a.size #=> 55
Ruby 1.8.7 works as expected-- as Matt points out in the comments, "\u1000" is just the literal "u1000" because no Unicode.
The string#succ C source code doesn't just return the next codepooint:
Returns the successor to <i>str</i>. The successor is calculated by
incrementing characters starting from the rightmost alphanumeric (or
the rightmost character if there are no alphanumerics) in the
string. Incrementing a digit always results in another digit, and
incrementing a letter results in another letter of the same case.
Incrementing nonalphanumerics uses the underlying character set's
collating sequence.
Range is doing something different than just next, next, next.
Range with these characters does ACSII sequence:
('8'..'A').to_a
=> ["8", "9", ":", ";", "<", "=", ">", "?", "@", "A"]
But using #succ is totally different:
'8'.succ
=> '9'
'9'.succ
=> '10' # if we were in a Range.to_a, this would be ":"
r.to_a.size is 55 in 1.9.3 and 201 in 1.8.7. Why that is I don't know, something to do with strings and encodings in 1.9.3 I guess, but it could explain the difference - matt 2012-04-04 23:29
"\u1000" is just the literal "u1000" - no Unicode, so the range is just u1000, u1001, u1002 ... u1199, u1200 and everything works as expected - matt 2012-04-04 23:37