Consider the following Ruby code analyzing a three-byte UTF-8 string:
#encoding: utf-8
s = "\x65\xCC\x81"
p [s.bytesize, s.length, s, s.encoding.name]
#=> [3, 2, "é", "UTF-8"]
As described on this page of mine the above really is a two-character string: Latin lowercase e
followed by Combining Acute Accent. However, it looks like one character, and this matters when laying out fixed-width displays.
For example, look at the two entries for "moiré.svg" on this directory listing and notice how one of them has messed up the column alignment.
How can I calculate the 'monospace visual length' of a string in Ruby, which does not include any zero-width combining characters? (One valid technique might be a way to transform a Unicode string into its canonical representation, turning the above into "\xC3\xA9"
which also looks like é
but has a length
of 1
.)
The unicode_utils gem may help:
http://unicode-utils.rubyforge.org/UnicodeUtils.html
There is a char_display_width
method:
require "unicode_utils/char_display_width"
UnicodeUtils.char_display_width("別") # => 2
UnicodeUtils.char_display_width(0x308) # => 0
UnicodeUtils.char_display_width("a") # => 1
There is a string display_width
method:
require "unicode_utils/display_width"
UnicodeUtils.display_width("別れ") => 4
UnicodeUtils.display_width("12") => 2
UnicodeUtils.display_width("a\u{308}") => 1
Also look at each_grapheme
.
(Thanks Michael Anderson for pointing out the additional methods)
each_grapheme
method may be more appropriate. http://unicode-utils.rubyforge.org/UnicodeUtils.html#method-c-each_graphem - Michael Anderson 2012-04-05 02:14
display_width
that accepts a string rather than a character - Michael Anderson 2012-04-05 02:16
You could use a regex to get at the Unicode properties:
s = "\x65\xCC\x81"
count = s.each_char.inject(0) do |c, char|
c += 1 unless char=~/\p{Mn}/
c
end
puts count #=> 1
This works in this case, but you'd have to work out which properties to exclude in a more robust solution.
Using the unicode_utils gem as suggested in @joelparkerhenderson's answer will probably be a better option, but I thought I'd include this for completeness.
s.gsub(/\p{Mn}/,'').length
not work correctly under some circumstance - Phrogz 2012-04-05 02:33
gsub
interact with Unicode combining marks, e.g. whether the current behaviour is just an accident or whether it's deliberate, and how it might change in the future. I guess the moral is make sure you've got tests in place - matt 2012-04-05 02:45
I am far from being an expert in Ruby but this gives the following:
def length_utf8
count = 0
scan(/./mu) { count += 1 }
count
end
[3, 3, "é"]
- Ilia Frenkel 2012-04-05 01:46