Counting Unicode string length without combining marks

Consider the following Ruby code analyzing a three-byte UTF-8 string:

#encoding: utf-8
s = "\x65\xCC\x81"
p [s.bytesize, s.length, s, s.encoding.name]
#=> [3, 2, "é", "UTF-8"]

As described on this page of mine the above really is a two-character string: Latin lowercase e followed by Combining Acute Accent. However, it looks like one character, and this matters when laying out fixed-width displays.

For example, look at the two entries for "moiré.svg" on this directory listing and notice how one of them has messed up the column alignment.

How can I calculate the 'monospace visual length' of a string in Ruby, which does not include any zero-width combining characters? (One valid technique might be a way to transform a Unicode string into its canonical representation, turning the above into "\xC3\xA9" which also looks like é but has a length of 1.)

ruby
unicode

2012-04-05 01:41
by Phrogz

Which version of Ruby do you have? I tried your example and got [3, 3, "é"] - Ilia Frenkel 2012-04-05 01:46

@IliaFrenkel The above refers to Ruby 1.9 with an encoding of UTF-8 for strings. I've edited the code to show the magic comment that would be required for a standalone script on any system where UTF-8 is not the default - Phrogz 2012-04-05 01:47

The unicode_utils gem may help:

http://unicode-utils.rubyforge.org/UnicodeUtils.html

There is a char_display_width method:

require "unicode_utils/char_display_width"
UnicodeUtils.char_display_width("別")  # => 2
UnicodeUtils.char_display_width(0x308) # => 0
UnicodeUtils.char_display_width("a")   # => 1

There is a string display_width method:

require "unicode_utils/display_width"
UnicodeUtils.display_width("別れ") => 4
UnicodeUtils.display_width("12") => 2
UnicodeUtils.display_width("a\u{308}") => 1

Also look at each_grapheme.

(Thanks Michael Anderson for pointing out the additional methods)

2012-04-05 02:05
by joelparkerhenderson

Just found this myself.. But I think counting using the each_grapheme method may be more appropriate. http://unicode-utils.rubyforge.org/UnicodeUtils.html#method-c-each_graphem - Michael Anderson 2012-04-05 02:14

Or better yet. There is a display_width that accepts a string rather than a character - Michael Anderson 2012-04-05 02:16

You could use a regex to get at the Unicode properties:

s = "\x65\xCC\x81"
count = s.each_char.inject(0) do |c, char|
  c += 1 unless char=~/\p{Mn}/
  c
end

puts count #=> 1

This works in this case, but you'd have to work out which properties to exclude in a more robust solution.

Using the unicode_utils gem as suggested in @joelparkerhenderson's answer will probably be a better option, but I thought I'd include this for completeness.

2012-04-05 02:29
by matt

I like this answer for its simplicity and using only core Ruby. Would s.gsub(/\p{Mn}/,'').length not work correctly under some circumstance - Phrogz 2012-04-05 02:33

@Phrogz that seems to work, and is more concise than mine. I guess it depends on how things like gsub interact with Unicode combining marks, e.g. whether the current behaviour is just an accident or whether it's deliberate, and how it might change in the future. I guess the moral is make sure you've got tests in place - matt 2012-04-05 02:45

-1

I am far from being an expert in Ruby but this gives the following:

def length_utf8
  count = 0
  scan(/./mu) { count += 1 }
  count
end

2012-04-05 01:55
by Ilia Frenkel

This also gives 2 for the string provided by @Phrogz - Jordan Running 2012-04-05 02:05