ホーム > perl

perlのアーカイブ

「danコガいはもう正規表現をblogに書くな」と言わせないでくれ

投稿者: chiba
2009/3/20 金曜日 1:00:13
perl

(タイトルはid:hasegawayosukeさんが言ってたよ)

ああ、まただよ…

「PHP使いはもう正規表現をblogに書くな」と言わせないでくれ

正規表現って、プログラミング言語間の差が少ないサブ言語なのに、なぜ「DAN」がつくとダメ正規表現ばかり登場するのか。うんざりだ。

飽きたので以下略。

簡潔に。(正規表現はdanさんのものからシングルクォートコンテキストにあわせてエスケープをしてあります)

<?php
$email = 'test@example.com' . "\n";
$re = '/^(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:"(?:\\\\[^\r\n]|[^\\\\"])*")))\@(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:\[(?:\\\\\S|[\x21-\x5a\x5e-\x7e])*\])))$/';
if (preg_match($re,$email)) {
    echo "valid";
} else {
    echo "invalid";
}
?>

これの結果が「valid」になる。当然rfc5322でdot-atomには改行は(CRであれLFであれ)許されていない。

対策はdanさん自身が^$でなくて\A\zを使おうで述べているとおり。

<?php
$email = 'test@example.com' . "\n";
$re = '/\A(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:"(?:\\\\[^\r\n]|[^\\\\"])*")))\@(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:\[(?:\\\\\S|[\x21-\x5a\x5e-\x7e])*\])))\z/';
if (preg_match($re,$email)) {
    echo "valid";
} else {
    echo "invalid";
}
?>

ただしjavascriptではmフラグをつけない限りは$は改行直前にはマッチしないので問題なかったりもする。

ところでこの正規表現には他にも問題が残っている。domain-literalで\\\Sにマッチするようになっているがこれはなんなのだろう。

<?php
$email = 'test@[127.0.0.1' . "\\\x1f]";
$re = '/\A(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:"(?:\\\\[^\r\n]|[^\\\\"])*")))\@(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:\[(?:\\\\\S|[\x21-\x5a\x5e-\x7e])*\])))\z/';
if (preg_match($re,$email)) {
    echo "valid";
} else {
    echo "invalid";
}
?>

おかげで上記のコードもvalidだ。なんてこった。domain-literalなんてそもそもはずしてもいいような気もするけど、対応するとしたらこれでいんじゃないだろうか。

<?php
$email = 'test@[127.0.0.1' . "\\\x1f]";
$re = '/\A(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:"(?:\\\\[^\r\n]|[^\\\\"])*")))\@(?:(?:(?:(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+)(?:\.(?:[a-zA-Z0-9_!#\$\%&\'*+\/=?\^`{}~|\-]+))*)|(?:\[[\x21-\x5a\x5e-\x7e]*\])))\z/';
if (preg_match($re,$email)) {
    echo "valid";
} else {
    echo "invalid";
}
?>

\\\Sになんか意味があるんだったらすみません。知ってる人いたら教えてください。

追記1(2009/03/20 01:57): コード例の(html)エスケープがおかしかったのを直しました。
追記2(2009/03/20 05:36): \\\Sに関してdanさんより404 Blog Not Found:regexp – ‘test@[127.0.0.1’ . "\\\x1f]" はRFC2822準拠で説明いただきました。ちなみにRFC5322にはquoted-pairはないかとおもってたら、obs-dtextの中にしっかり生き残ってた…。

はてなブックマーク - 「danコガいはもう正規表現をblogに書くな」と言わせないでくれ

コメント: 0
トラックバック: 2

libmysqlclientを使うプログラムはset namesをutf8であっても使ってはいけない

投稿者: chiba
2009/2/5 木曜日 2:41:22
mysql | perl

mysql_enable_utf8 => 1 で DBIC::UTF8Columns 要らなくなるっぽいComments
上記の記事のブクマに

set namesを直接実行しちゃうのはutf8であってもコンパイルオプションによっては問題起こるのでお勧めできない
http://b.hatena.ne.jp/nihen/20090204#bookmark-11950629

ってことを書かせてもらったんだけど、この最後のset namesはutf8でも使っちゃダメという話を軽く説明します。

まずは、基本的なことはMySQL5開拓団 – 日本語処理の鉄則 / KLab株式会社を読んでください。mysqlの日本語処理についてのドキュメントとしては、私は今一番信頼できるドキュメントだと思っています。

さて、上記のページの< 図3：クライアント側文字コードの指定チャート>を、勝手ながらすべて引用させていただくことにする。(手抜きもいいところだな）

< 図3：クライアント側文字コードの指定チャート>

■初期値の設定
│
├mysqlコマンドの場合
│└【my.cnfの[mysql]にdefault-character-setで指定する】
│
└my.cnfを読めてdefault-character-setを解釈することができるクライアントか？
  ├(yes)→【my.cnfの[client]にdefault-character-setで指定する】
  └(no )→「SET NAMESコース」へ

■途中で変更したい
│
├【途中で変更しなければならないような構成はやめて、初期値だけに頼るようにする】
│
├mysqlコマンドの場合
│└5.0.25以上か？
│  ├(yes)→【charset命令で指定する】
│  └(no) →「SET NAMESコース」へ
│
└C言語APIのmysql_set_character_set()かmysql_options()が使えるクライアントか？
  ├(yes)→【mysql_set_character_set()かmysql_options()で指定する】
  └(no) →「SET NAMESコース」へ

■SET NAMESコース
│
└指定したいのはシフトJIS（cp932かsjis）か？
  │
  ├(no )→【SET NAMES文で指定する】
  └(yes)→【MyNAパッチ（注5）を当てた上で、SET NAMES文で指定する】
http://www.klab.jp/media/mysql/index6.html

で、今回のDBD::mysqlはどこにあてはまるかというと、、

my.cnfを読めてdefault-character-setを解釈することができるクライアントか？

=> yes
mysql_read_default_fileを指定することにより読み込めます。

use DBI;
my $dbh = DBI->connect(
    'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=/etc/mysql/my.cnf',
    qw/id password/,
);

下記のように一時ファイルを使うこともできるようにファイルの場所はどこでもかまわなかったりする。

use DBI;
use File::Temp qw/tempfile/;

my ($fh, $filename) = tempfile();
print {$fh} "[client]\ndefault-character-set=utf8\n";
close $fh;

my $dbh = DBI->connect(
    'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=' . $filename,
    qw/id password/,
);

さて、実は上記のようにmysql_read_default_fileを使う方法がDBD::mysqlで日本語を扱う際のほぼ唯一の”正しい”接続方法になる。

なぜか？

とりあえず上記のチャートを進めてみよう。

└C言語APIのmysql_set_character_set()かmysql_options()が使えるクライアントか？

=> 半分 yes /半分no
mysql_set_character_set()は使えない。
mysql_options()は直接は使えないが、上記で話題にあがったmysql_enable_utf8を使うとmysql_optionsでMYSQL_SET_CHARSET_NAMEが設定できる。のでutf8の場合のみこのオプションを使うことで”正しい”接続ができる。experimentalだけどね。ちなみにphpではmysql_set_charsetというのがちゃんと用意されていたりする。

さて、最後に問題のSET NAMESコースである。

■SET NAMESコース
│
└指定したいのはシフトJIS（cp932かsjis）か？
  │
  ├(no )→【SET NAMES文で指定する】
  └(yes)→【MyNAパッチ（注5）を当てた上で、SET NAMES文で指定する】

さて、ここではShift_JIS以外はSET NAMESでも問題ないと書いてあるように読める。実際ほとんどの場合はそうなのだ。
しかし、これはlatin-1がlibmysqlclientのデフォルトキャラクタセットの場合だけであって（実際コンパイルのデフォルトはそうなっている）libmysqlclientを–witth-charset=cp932オプション付きでコンパイルしていた場合なんかだとその限りではない。
–with-charset=cp932でコンパイルしているlibmysqlclientの場合、下記のコードにおいて\x5cがエスケープされず、SQLインジェクションの危険性が発生する。(手軽に確認するならmysql_read_default_fileを使ってdefault-character-setにcp932をセットすればいい)

my $dbh = DBI->connect(
    'DBI:mysql:database=sandbox;host=localhost',
    qw/id password/,
);

$dbh->do('set names utf8');
my $sth = $dbh->prepare('insert into sandbox (name) values(?)');
$sth->execute("\xe3\x81\x95\x5c");

“\xe3\x81\x95\x5c”というのは分解すると
\xe3\x81\x95 == “さ”(UTF-8)
\xe3\x81 == “縺”(Shift_JIS)
\x95\x5c == “表”(Shift_JIS)
になる。
libmysqlclientは”\xe3\x81\x95\x5c”をShift_JISとしてparseしてエスケープするため最後の\x5cはエスケープされない。しかしset names utf8しているのでmysqlサーバ側はutf8としてパースし、最後の\x5cをバックスラッシュとして処理してしまう。

長々と書いてしまったが、libmysqlclientを(間接的にでも)使っているプログラマは、libmysqlclientが認識している文字コードとmysqldが認識している文字コードはかならず一致させるという原則を守るといいと思うよ。もちろん今回出したケースはかなり特殊なケースでコンパイルオプションをきちんと管理していれば起こらない問題だけどプログラム側で回避できる問題でもあるのできちんと対策しとくべき。

まぁ、これいっちゃうとlatin-1使う場合でもちゃんとdefault-character-set指定しないと駄目ってことでなんか極論な気がしないでもないけどね・・・。

追記1; ああ、重要なことを1点言い忘れてるね。mysql_server_prepareを使えばこういう問題はもちろん起きません。なのでこっちの対策を個人的には推奨したい。

はてなブックマーク - libmysqlclientを使うプログラムはset namesをutf8であっても使ってはいけない

コメント: 0
トラックバック: 2

latin-1 is not utf-8

投稿者: chiba
2009/1/31 土曜日 18:31:14
perl

use utf8 環境下で => オペレータの左辺が UTF8 flag on になってしまう – daily dayflower
ふむふむ。これは興味深い現象ですね。

まず初めにはっきりさせておくべきなのは、
latin-1の\x{a4} == U+00a4 == utf-8の\x{c2}\x{a4}
であるということですね。

use URI;

my $uri = URI->new('http://example.com/');

use utf8;

$uri->query_form( bytes => "\x{a4}" );
print $uri, "\n";
#=> http://example.com/?bytes=%C2%A4

で、それを踏まえると上記の結果の何が問題なのか分からないですよね。utf-8エンコーディングの正しい結果のような気がします。

結果をlatin-1で出したいのかutf8で出したいのかがはっきりしていないから期待する結果にならないのではないでしょうか。

utf8で出したいのであれば

use URI;

my $uri = URI->new('http://example.com/');

use utf8;
use Encode;

# utf8バイト列でエスケープしたい場合だけflagged utf8が許されると考えておｋ
$uri->query_form( bytes => decode('latin-1', "\x{a4}") );
# もしくはbyte列だけで行うほうが安全かな
# { use bytes;
#    $uri->query_form( bytes => "\xc2\xa4") );
# }
print $uri, "\n";
#=> http://example.com/?bytes=%C2%A4

と書くべきだろうし、latin-1で出したいのであれば
(すべてutf-8でプログラムコードを書いているという前提で）

use URI;

my $uri = URI->new('http://example.com/');

use utf8;
use Encode;

$uri->query_form( Encode::encode('latin-1', 'bytes') => "\x{a4}" );
print $uri, "\n";
#=> http://example.com/?bytes=%A4

と書くべきではないでしょうか。

思うに、このような出力するバイト列に操作を加えるメソッドにutf8 flaggedな文字列を渡すようにするには出力の文字コードも一緒に渡せるようになっていないと意味がないような気がします。なので、唯一安全なのは、あらかじめ出力する文字コードのバイト列に変換した値を渡すことではないでしょうか。むしろモジュール側がそう推奨すべきなのかな。

コメント: 0
トラックバック: 0

mysqlでcharsetをucs2にした場合のasciiのみのフィールド

投稿者: chiba
2009/1/15 木曜日 2:22:59
javascript | mysql | perl | vimperator

前回のエントリで書いたように、ucs2をcharsetにするとascii部分がutf8に比べ2倍になってしまう。これを一部解決するために、asciiのみで構成されることが保障されているフィールドはそのフィールドだけasciiにしてしまえばいい。そうすることによって、asciiのみで構成されたフィールドをkeyに検索といった場合でもutf8に比べ遜色のないパフォーマンスを得ることができる。

まずは、ベンチコードを例の如くはりつけ。環境等は前エントリと同一。

ベンチコード

insert用

#!/usr/bin/perl
use strict;
use warnings;

use feature qw/say/;
use utf8;

use DBI;
use Encode;

use List::MoreUtils qw/apply/;
use Benchmark;

use Digest::MD5;

my @engines = qw/myisam innodb/;
my @charsets = qw/utf8 ucs2/;
my $bench_count = 100000;

my $client_dsn = 'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=/etc/mysql/my.cnf';
my $server_dsn = $client_dsn . ';mysql_server_prepare=1';
my @idpass = qw/sandbox sandbox/;

my @init_query = (
    'drop table if exists %table',
    q{
        create table %table(
            id int unsigned not null auto_increment,
            name char(3) not null,
            b64digest char(22) %field_charset not null,
            created datetime not null,                                                                                                                                                                                        primary key(id),
            index(b64digest)
        ) ENGINE=%engine ROW_FORMAT=%row_format DEFAULT CHARSET=%charset
    },
);
my $query = Encode::encode('utf8', q{insert into %table values(null, 'たなか', ?, now())});

my %target = ();
for my $engine ( @engines ) {
    for my $charset ( @charsets ) {
        for my $field_charset ( 'ascii', '' ) {
            for my $row_format ( 'DEFAULT', 'REDUNDANT') {
                if ( $engine eq 'myisam' && $row_format ne 'DEFAULT' ) {
                    next;
                }
                my $key = join '_', ($engine, $charset, $field_charset || 'non', $row_format);
                $target{$key} = {
                    dbh   => DBI->connect(
                        $server_dsn,
                        @idpass,
                        {
                            AutoCommit => 0,
                        },
                    ),
                    engine  => $engine,
                    charset => $charset,
                    field_charset => $field_charset,
                    row_format => $row_format,
                    table => 'sandbox_' . $key,
                };
                for my $query ( @init_query ) {
                    $target{$key}->{dbh}->do(apply { s/%(table|charset|engine|field_charset|row_format)/$target{$key}->{$1}/g } $query);
                }
                $target{$key}->{sth} = $target{$key}->{dbh}->prepare(apply { s/%(table|charset|engine)/$target{$key}->{$1}/g } $query);
            }
        }
    }
}

my $md5 = Digest::MD5->new();
Benchmark::cmpthese($bench_count,{
    map {
        my $target_sth = $target{$_}->{sth};
        $_ => sub {
            $md5->reset;
            $md5->add(int(rand(10000)));
            $target_sth->execute($md5->b64digest);
        }
    } keys %target
});
$_->{dbh}->commit for values %target;
$_->{dbh}->disconnect for values %target;

select用

#!/usr/bin/perl
use strict;
use warnings;

use feature qw/say/;
use utf8;

use DBI;

use List::MoreUtils qw/apply/;
use Benchmark;

use Digest::MD5;                                                                                                                                                                                                  
my @engines = qw/myisam innodb/;
my @charsets = qw/utf8 ucs2/;
my $bench_count = 30000;

my $client_dsn = 'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=/etc/mysql/my.cnf';
my $server_dsn = $client_dsn . ';mysql_server_prepare=1';
my @idpass = qw/sandbox sandbox/;

my $query = q{select * from %table where b64digest = ?};

my %target = ();
for my $engine ( @engines ) {
    for my $charset ( @charsets ) {
        for my $field_charset ( 'ascii', '' ) {
            for my $row_format ( 'DEFAULT', 'REDUNDANT') {
                if ( $engine eq 'myisam' && $row_format ne 'DEFAULT' ) {
                    next;
                }
                my $key = join '_', ($engine, $charset, $field_charset || 'non', $row_format);
                $target{$key} = {
                    dbh   => DBI->connect(
                        $server_dsn,
                        @idpass,
                    ),
                    engine  => $engine,
                    charset => $charset,
                    field_charset => $field_charset,
                    row_format => $row_format,
                    table => 'sandbox_' . $key,
                };
                $target{$key}->{sth} = $target{$key}->{dbh}->prepare(apply { s/%(table|charset|engine)/$target{$key}->{$1}/g } $query);
            }
        }
    }
}
                                                                                                                                                                                                                  
my $md5 = Digest::MD5->new();
Benchmark::cmpthese($bench_count,{
    map {
        my $target_sth = $target{$_}->{sth};
        $_ => sub {
            $md5->reset;
            $md5->add(int(rand(10000)));
            $target_sth->execute($md5->b64digest);
            $target_sth->fetch;
        }
    } keys %target
});

結果

b64digest char(22)

insert

                               Rate innodb_utf8_ascii_DEFAULT innodb_ucs2_non_DEFAULT myisam_ucs2_ascii_DEFAULT innodb_utf8_non_REDUNDANT innodb_ucs2_non_REDUNDANT innodb_ucs2_ascii_DEFAULT myisam_utf8_non_DEFAULT myisam_ucs2_non_DEFAULT myisam_utf8_ascii_DEFAULT innodb_utf8_ascii_REDUNDANT innodb_ucs2_ascii_REDUNDANT innodb_utf8_non_DEFAULT
innodb_utf8_ascii_DEFAULT   32895/s                        --                     -1%                       -2%                       -2%                       -2%                       -8%                    -10%                    -10%                      -13%                        -14%                        -14%                    -15%
innodb_ucs2_non_DEFAULT     33113/s                        1%                      --                       -1%                       -1%                       -1%                       -8%                     -9%                    -10%                      -12%                        -13%                        -13%                    -15%
myisam_ucs2_ascii_DEFAULT   33445/s                        2%                      1%                        --                       -0%                       -0%                       -7%                     -8%                     -9%                      -11%                        -12%                        -12%                    -14%
innodb_utf8_non_REDUNDANT   33445/s                        2%                      1%                        0%                        --                       -0%                       -7%                     -8%                     -9%                      -11%                        -12%                        -12%                    -14%
innodb_ucs2_non_REDUNDANT   33557/s                        2%                      1%                        0%                        0%                        --                       -6%                     -8%                     -8%                      -11%                        -12%                        -12%                    -13%
innodb_ucs2_ascii_DEFAULT   35842/s                        9%                      8%                        7%                        7%                        7%                        --                     -1%                     -2%                       -5%                         -6%                         -6%                     -8%
myisam_utf8_non_DEFAULT     36364/s                       11%                     10%                        9%                        9%                        8%                        1%                      --                     -1%                       -4%                         -5%                         -5%                     -6%
myisam_ucs2_non_DEFAULT     36630/s                       11%                     11%                       10%                       10%                        9%                        2%                      1%                      --                       -3%                         -4%                         -4%                     -5%
myisam_utf8_ascii_DEFAULT   37736/s                       15%                     14%                       13%                       13%                       12%                        5%                      4%                      3%                        --                         -1%                         -1%                     -3%
innodb_utf8_ascii_REDUNDANT 38168/s                       16%                     15%                       14%                       14%                       14%                        6%                      5%                      4%                        1%                          --                         -0%                     -2%
innodb_ucs2_ascii_REDUNDANT 38168/s                       16%                     15%                       14%                       14%                       14%                        6%                      5%                      4%                        1%                          0%                          --                     -2%
innodb_utf8_non_DEFAULT     38760/s                       18%                     17%                       16%                       16%                       16%                        8%                      7%                      6%                        3%                          2%                          2%                      --

まーinsertはたいして差なし。変換コストなんかも気にしないでいいかなぁ？

select

innodb_utf8_non_REDUNDANT    8333/s                        --                      -11%                    -19%                      -32%                        -35%                    -35%                      -38%                        -40%                    -59%                    -64%                      -65%                      -65%
innodb_ucs2_non_REDUNDANT    9317/s                       12%                        --                    -10%                      -24%                        -27%                    -28%                      -30%                        -33%                    -55%                    -60%                      -61%                      -61%
innodb_ucs2_non_DEFAULT     10309/s                       24%                       11%                      --                      -16%                        -20%                    -20%                      -23%                        -26%                    -50%                    -55%                      -57%                      -57%
innodb_ucs2_ascii_DEFAULT   12295/s                       48%                       32%                     19%                        --                         -4%                     -5%                       -8%                        -12%                    -40%                    -47%                      -48%                      -48%
innodb_ucs2_ascii_REDUNDANT 12821/s                       54%                       38%                     24%                        4%                          --                     -0%                       -4%                         -8%                    -38%                    -44%                      -46%                      -46%
innodb_utf8_non_DEFAULT     12876/s                       55%                       38%                     25%                        5%                          0%                      --                       -4%                         -8%                    -37%                    -44%                      -46%                      -46%
innodb_utf8_ascii_DEFAULT   13393/s                       61%                       44%                     30%                        9%                          4%                      4%                        --                         -4%                    -35%                    -42%                      -44%                      -44%
innodb_utf8_ascii_REDUNDANT 13953/s                       67%                       50%                     35%                       13%                          9%                      8%                        4%                          --                    -32%                    -40%                      -41%                      -41%
myisam_utf8_non_DEFAULT     20548/s                      147%                      121%                     99%                       67%                         60%                     60%                       53%                         47%                      --                    -11%                      -14%                      -14%
myisam_ucs2_non_DEFAULT     23077/s                      177%                      148%                    124%                       88%                         80%                     79%                       72%                         65%                     12%                      --                       -3%                       -3%
myisam_ucs2_ascii_DEFAULT   23810/s                      186%                      156%                    131%                       94%                         86%                     85%                       78%                         71%                     16%                      3%                        --                       -0%
myisam_utf8_ascii_DEFAULT   23810/s                      186%                      156%                    131%                       94%                         86%                     85%                       78%                         71%                     16%                      3%                        0%                        --

5.x以降のinnodbのデフォルトROW_FORMATはCOMPACTなんですが、やはりutf8のchar型の場合においてREDUNDANTに比べうまいこと動いていることがわかるベンチ結果ですね。でもって問題のasciiですが基本常にascii指定のほうがパフォーマンスが上になりましたね。

b64digest varchar(22)
フィールドをvarcharにしてみる。
insert

                               Rate innodb_ucs2_non_REDUNDANT innodb_ucs2_ascii_DEFAULT innodb_ucs2_non_DEFAULT myisam_ucs2_ascii_DEFAULT innodb_utf8_ascii_DEFAULT myisam_utf8_non_DEFAULT myisam_ucs2_non_DEFAULT innodb_utf8_ascii_REDUNDANT innodb_utf8_non_REDUNDANT innodb_utf8_non_DEFAULT myisam_utf8_ascii_DEFAULT innodb_ucs2_ascii_REDUNDANT
innodb_ucs2_non_REDUNDANT   34247/s                        --                       -3%                     -5%                       -5%                       -6%                     -7%                     -8%                         -9%                       -9%                     -9%                      -10%                        -11%
innodb_ucs2_ascii_DEFAULT   35336/s                        3%                        --                     -2%                       -2%                       -3%                     -4%                     -5%                         -6%                       -6%                     -6%                       -7%                         -8%
innodb_ucs2_non_DEFAULT     35971/s                        5%                        2%                      --                       -1%                       -1%                     -3%                     -4%                         -4%                       -4%                     -5%                       -5%                         -6%
myisam_ucs2_ascii_DEFAULT   36232/s                        6%                        3%                      1%                        --                       -0%                     -2%                     -3%                         -4%                       -4%                     -4%                       -5%                         -5%
innodb_utf8_ascii_DEFAULT   36364/s                        6%                        3%                      1%                        0%                        --                     -1%                     -3%                         -3%                       -3%                     -4%                       -4%                         -5%
myisam_utf8_non_DEFAULT     36900/s                        8%                        4%                      3%                        2%                        1%                      --                     -1%                         -2%                       -2%                     -2%                       -3%                         -4%
myisam_ucs2_non_DEFAULT     37313/s                        9%                        6%                      4%                        3%                        3%                      1%                      --                         -1%                       -1%                     -1%                       -2%                         -3%
innodb_utf8_ascii_REDUNDANT 37594/s                       10%                        6%                      5%                        4%                        3%                      2%                      1%                          --                       -0%                     -0%                       -1%                         -2%
innodb_utf8_non_REDUNDANT   37594/s                       10%                        6%                      5%                        4%                        3%                      2%                      1%                          0%                        --                     -0%                       -1%                         -2%
innodb_utf8_non_DEFAULT     37736/s                       10%                        7%                      5%                        4%                        4%                      2%                      1%                          0%                        0%                      --                       -1%                         -2%
myisam_utf8_ascii_DEFAULT   38023/s                       11%                        8%                      6%                        5%                        5%                      3%                      2%                          1%                        1%                      1%                        --                         -1%
innodb_ucs2_ascii_REDUNDANT 38314/s                       12%                        8%                      7%                        6%                        5%                      4%                      3%                          2%                        2%                      2%                        1%                          --

ほぼ変化なし

select

                               Rate innodb_ucs2_non_REDUNDANT innodb_ucs2_non_DEFAULT innodb_ucs2_ascii_REDUNDANT innodb_utf8_ascii_REDUNDANT innodb_utf8_ascii_DEFAULT innodb_utf8_non_DEFAULT innodb_ucs2_ascii_DEFAULT innodb_utf8_non_REDUNDANT myisam_utf8_non_DEFAULT myisam_ucs2_non_DEFAULT myisam_utf8_ascii_DEFAULT myisam_ucs2_ascii_DEFAULT
innodb_ucs2_non_REDUNDANT    8902/s                        --                     -3%                      -24%                        -25%                      -27%                        -27%                      -28%                    -32%                      -58%                    -60%                      -60%                    -62%
innodb_ucs2_non_DEFAULT      9202/s                        3%                      --                      -21%                        -23%                      -24%                        -25%                      -25%                    -30%                      -57%                    -58%                      -59%                    -60%
innodb_utf8_non_REDUNDANT   11719/s                       32%                     27%                        --                         -2%                       -4%                         -4%                       -5%                    -11%                      -45%                    -47%                      -47%                    -50%
innodb_utf8_ascii_REDUNDANT 11905/s                       34%                     29%                        2%                          --                       -2%                         -2%                       -4%                     -9%                      -44%                    -46%                      -46%                    -49%
innodb_ucs2_ascii_DEFAULT   12146/s                       36%                     32%                        4%                          2%                        --                         -0%                       -2%                     -7%                      -43%                    -45%                      -45%                    -48%
innodb_ucs2_ascii_REDUNDANT 12195/s                       37%                     33%                        4%                          2%                        0%                          --                       -1%                     -7%                      -43%                    -45%                      -45%                    -48%
innodb_utf8_ascii_DEFAULT   12346/s                       39%                     34%                        5%                          4%                        2%                          1%                        --                     -6%                      -42%                    -44%                      -44%                    -47%
innodb_utf8_non_DEFAULT     13100/s                       47%                     42%                       12%                         10%                        8%                          7%                        6%                      --                      -39%                    -41%                      -41%                    -44%
myisam_utf8_ascii_DEFAULT   21429/s                      141%                    133%                       83%                         80%                       76%                         76%                       74%                     64%                        --                     -3%                       -4%                     -8%
myisam_utf8_non_DEFAULT     22059/s                      148%                    140%                       88%                         85%                       82%                         81%                       79%                     68%                        3%                      --                       -1%                     -5%
myisam_ucs2_ascii_DEFAULT   22222/s                      150%                    141%                       90%                         87%                       83%                         82%                       80%                     70%                        4%                      1%                        --                     -4%
myisam_ucs2_non_DEFAULT     23256/s                      161%                    153%                       98%                         95%                       91%                         91%                       88%                     78%                        9%                      5%                        5%                      --

varcharだとutf8のままが若干早い。変換コストかな。

まとめ

innodb (defaultが)utf8 varchar ROW_FORMAT=COMPACT以外の組み合わせの場合は、できるだけasciiを指定しておくといいかんじになりそう。

参考
[mysql 14237] フィールドごとのキャラクタセットの異なる指定

はてなブックマーク - mysqlでcharsetをucs2にした場合のasciiのみのフィールド

コメント: 0
トラックバック: 0

mysqlの内部キャラセットはucs2にするといいんじゃないだろうか

投稿者: chiba
2009/1/12 月曜日 2:06:01
mysql | perl

mysqlの保存用のcharsetをucs2にすると、utf8の場合に比べて速くなる状況があるんじゃないのだろうかとふと思い立ちいろいろとベンチマークとってみた。視点は以下の三つ。

・myisam || innodb
・ucs2 || utf8
・client-prepare || serverside-prepare

mysqlのバージョンはmysql-server-5.0.51a-20(debian lenny)
perl v5.10.0
DBI 1.607
DBD::mysqlは4.0.10にBug #41815のパッチあてたもの。

ベンチコード

insert用

#!/usr/bin/perl
use strict;
use warnings;

use feature qw/say/;
use utf8;

use DBI;

use List::MoreUtils qw/apply/;
use Benchmark;

my @engines = qw/myisam innodb/;
my @charsets = qw/utf8 ucs2/;
my $bench_count = 100000;

my $insert_str = 'ほげ' x 100;

my $client_dsn = 'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=/etc/mysql/my.cnf';
my $server_dsn = $client_dsn . ';mysql_server_prepare=1';
my @idpass = qw/sandbox sandbox/;

my @init_query = (
    'drop table if exists %table',
    q{
        create table %table(
            id int unsigned not null auto_increment,
            name varchar(1024) not null,
            created datetime not null,
            primary key(id)
        ) ENGINE=%engine DEFAULT CHARSET=%charset
    },
);
my $query = q{insert into %table values(null, ?, now())};

my %target = ();
for my $engine ( @engines ) {
    for my $charset ( @charsets ) {
        for my $prepare_type ( qw/client server/ ) {
            my $key = join '_', ($engine, $charset, $prepare_type);
            $target{$key} = {
                dbh   => DBI->connect(
                    $prepare_type eq 'client' ? $client_dsn : $server_dsn,
                    @idpass,
                    {
                        AutoCommit => 0,
                    },
                ),
                engine  => $engine,
                charset => $charset,
                table => 'sandbox_' . $key,
            };
            for my $query ( @init_query ) {
                $target{$key}->{dbh}->do(apply { s/%(table|charset|engine)/$target{$key}->{$1}/g } $query);
            }
            $target{$key}->{sth} = $target{$key}->{dbh}->prepare(apply { s/%(table|charset|engine)/$target{$key}->{$1}/g } $query);
        }
    }
}

Benchmark::cmpthese($bench_count,{
    map {
        my $target_sth = $target{$_}->{sth};
        $_ => sub {
            $target_sth->execute($insert_str);
        }
    } keys %target
});
$_->{dbh}->commit for values %target;
$_->{dbh}->disconnect for values %target;

select用

#!/usr/bin/perl
use strict;
use warnings;

use feature qw/say/;
use utf8;

use DBI;

use List::MoreUtils qw/apply/;
use Benchmark;

my @engines = qw/myisam innodb/;
my @charsets = qw/utf8 ucs2/;
my $bench_count = 100000;

my $client_dsn = 'DBI:mysql:database=sandbox;host=localhost;mysql_read_default_file=/etc/mysql/my.cnf';
my $server_dsn = $client_dsn . ';mysql_server_prepare=1';
my @idpass = qw/sandbox sandbox/;

my $query = q{select * from %table where id = ?};

my %target = ();
for my $engine ( @engines ) {
    for my $charset ( @charsets ) {
        for my $prepare_type ( qw/client server/ ) {
            my $key = join '_', ($engine, $charset, $prepare_type);
            $target{$key} = {
                dbh   => DBI->connect(
                    $prepare_type eq 'client' ? $client_dsn : $server_dsn,
                    @idpass
                ),
                engine  => $engine,
                charset => $charset,
                table => 'sandbox_' . $key,
            };

            $target{$key}->{sth} = $target{$key}->{dbh}->prepare(apply { s/%(table|charset|engine)/$target{$key}->{$1}/g } $query);
        }
    }
}


Benchmark::cmpthese($bench_count,{
    map {
        my $target_sth = $target{$_}->{sth};
        $_ => sub {
            $target_sth->execute(150000);
            $target_sth->fetch;
        }
    } keys %target
});

/etc/mysql/my.cnfには

[client]
default-character-set=utf8

[mysqld]
query_cache_size        = 0

がかいてある。

で、ベンチ結果を以下にはっつけて少し解釈してみますが、すべての解釈には「今回の環境でこのコードにおいては」という前提付きです。

insert

my $insert_str = ‘ほげ’ x 10;

                      Rate myisam_utf8_client innodb_ucs2_client myisam_ucs2_client innodb_utf8_client innodb_utf8_server myisam_utf8_server innodb_ucs2_server myisam_ucs2_server
myisam_utf8_client 36101/s                 --                -0%                -6%                -7%               -10%               -15%               -16%               -16%
innodb_ucs2_client 36232/s                 0%                 --                -5%                -7%                -9%               -15%               -16%               -16%
myisam_ucs2_client 38314/s                 6%                 6%                 --                -2%                -4%               -10%               -11%               -11%
innodb_utf8_client 38911/s                 8%                 7%                 2%                 --                -3%                -9%                -9%               -10%
innodb_utf8_server 40000/s                11%                10%                 4%                 3%                 --                -6%                -7%                -7%
myisam_utf8_server 42553/s                18%                17%                11%                 9%                 6%                 --                -1%                -1%
innodb_ucs2_server 42918/s                19%                18%                12%                10%                 7%                 1%                 --                -0%
myisam_ucs2_server 43103/s                19%                19%                13%                11%                 8%                 1%                 0%                 --

my $insert_str = ‘ほげ’ x 100;

                      Rate innodb_ucs2_client innodb_utf8_client myisam_ucs2_client myisam_utf8_client innodb_utf8_server myisam_utf8_server innodb_ucs2_server myisam_ucs2_server
innodb_ucs2_client 29326/s                 --                -5%                -8%                -9%               -12%               -25%               -28%               -30%
innodb_utf8_client 30769/s                 5%                 --                -3%                -5%                -7%               -21%               -24%               -27%
myisam_ucs2_client 31746/s                 8%                 3%                 --                -2%                -4%               -19%               -22%               -25%
myisam_utf8_client 32362/s                10%                 5%                 2%                 --                -3%               -17%               -20%               -23%
innodb_utf8_server 33223/s                13%                 8%                 5%                 3%                 --               -15%               -18%               -21%
myisam_utf8_server 39063/s                33%                27%                23%                21%                18%                 --                -4%                -7%
innodb_ucs2_server 40486/s                38%                32%                28%                25%                22%                 4%                 --                -4%
myisam_ucs2_server 42194/s                44%                37%                33%                30%                27%                 8%                 4%                 --

my $insert_str = ‘ほげ’ x 500;

                      Rate innodb_utf8_client innodb_ucs2_client myisam_utf8_client myisam_ucs2_client innodb_utf8_server innodb_ucs2_server myisam_utf8_server myisam_ucs2_server
innodb_utf8_client 14948/s                 --                -4%               -22%               -26%               -44%               -47%               -55%               -58%
innodb_ucs2_client 15552/s                 4%                 --               -19%               -23%               -42%               -45%               -53%               -56%
myisam_utf8_client 19084/s                28%                23%                 --                -6%               -29%               -32%               -43%               -46%
myisam_ucs2_client 20202/s                35%                30%                 6%                 --               -25%               -28%               -39%               -43%
innodb_utf8_server 26810/s                79%                72%                40%                33%                 --                -5%               -19%               -25%
innodb_ucs2_server 28249/s                89%                82%                48%                40%                 5%                 --               -15%               -21%
myisam_utf8_server 33223/s               122%               114%                74%                64%                24%                18%                 --                -7%
myisam_ucs2_server 35587/s               138%               129%                86%                76%                33%                26%                 7%                 --

・myisam || innodb
当然myisamのほうが早い

・ucs2 || utf8
ucs2が若干早い気がするけど誤差の範囲

・client-prepare || serverside-prepare
serversideはデータ量が増えるにつれ明らかにclientに比べ早くなる。

select

上記の’ほげ’ x 500のinsert後に実行

                      Rate myisam_utf8_client innodb_ucs2_client myisam_utf8_server myisam_ucs2_client innodb_utf8_client myisam_ucs2_server innodb_utf8_server innodb_ucs2_server
myisam_utf8_client 25381/s                 --                -0%                -4%                -6%               -13%               -16%               -16%               -17%
innodb_ucs2_client 25381/s                 0%                 --                -4%                -6%               -13%               -16%               -16%               -17%
myisam_utf8_server 26455/s                 4%                 4%                 --                -2%               -10%               -12%               -12%               -14%
myisam_ucs2_client 27027/s                 6%                 6%                 2%                 --                -8%               -11%               -11%               -12%
innodb_utf8_client 29240/s                15%                15%                11%                 8%                 --                -3%                -3%                -5%
myisam_ucs2_server 30211/s                19%                19%                14%                12%                 3%                 --                -0%                -2%
innodb_utf8_server 30211/s                19%                19%                14%                12%                 3%                 0%                 --                -2%
innodb_ucs2_server 30675/s                21%                21%                16%                13%                 5%                 2%                 2%                 --

うーん・・・。たいした差はなし。

思いつくutf8に対するucs2のメリット

・日本語だけで埋め尽くされているフィールドの実容量が約2/3になる(＝onメモリ可能なレコード数やページあたりのレコード数が増える)
・バイト数 / 2 = 文字数になる。よって下記のようにchar_lengthをとるものだとかなり早くなる。まぁそんなにうれしいってほどではないけど。

mysql> select max(char_length(name)) from sandbox_innodb_utf8_client;
+------------------------+
| max(char_length(name)) |
+------------------------+
|                   1000 |
+------------------------+
1 row in set (18.02 sec)

mysql> select max(char_length(name)) from sandbox_innodb_ucs2_client;
+------------------------+
| max(char_length(name)) |
+------------------------+
|                   1000 |
+------------------------+
1 row in set (12.46 sec)

mysql> select max(char_length(name)) from sandbox_myisam_utf8_client;
+------------------------+
| max(char_length(name)) |
+------------------------+
|                   1000 |
+------------------------+
1 row in set (14.99 sec)

mysql> select max(char_length(name)) from sandbox_myisam_ucs2_client;
+------------------------+
| max(char_length(name)) |
+------------------------+
|                   1000 |
+------------------------+
1 row in set (7.51 sec)

ucs2のデメリット

・ASCII範囲の容量が2倍になる(myisamのCHAR型とinnodb(ROW_FORMAT=REDUNDANT)のCHAR型を除く）
ぐらい？結果的にucs2に決定的なメリットがあるわけでもないけれど、デメリットもないならばutf8にしてる人はみんなucs2にすればいいよ！とかおもうのだけどどうなんだろう。誰かデメリットが分かる人がいればおしえてください。ただ、mysql6.0.4からはutf8がサロゲートペアの範囲もサポートする(5.xはBMPのみのサポート)ので互換性がなくなるというデメリットは発生しそう。その場合は同様にサポートされたutf32を使うといいんだろうけど容量的なメリットはなくなってしまう。utf16にすると容量的なメリットは維持できるけど、バイト数と文字数の固定関係は得られない。悩みどころ。

クライアントキャラクタセットにucs2が使えないので、utf8からの変換コストが問題になりそうだなぁとおもったけどベンチとってみるとそうでもなかったのが少し意外だった。

番外編:myisamのfixed

myisamはすべてnot nullで固定長のフィールドだと、ROW_FORMATがFIXEDになって早くなる。この話とucs2のバイト数固定性能が関連するんじゃないかと少し探ったけど今のところ関連は見いだせず。

insert

my $insert_str = ‘ほげ’ x 100;
char(255)

                      Rate innodb_ucs2_client myisam_ucs2_client myisam_utf8_client innodb_utf8_client innodb_utf8_server innodb_ucs2_server myisam_utf8_server myisam_ucs2_server
innodb_ucs2_client 29499/s                 --                -2%                -6%                -6%                -8%               -21%               -26%               -28%
myisam_ucs2_client 30211/s                 2%                 --                -4%                -4%                -6%               -19%               -24%               -27%
myisam_utf8_client 31348/s                 6%                 4%                 --                -0%                -2%               -16%               -21%               -24%
innodb_utf8_client 31447/s                 7%                 4%                 0%                 --                -2%               -16%               -21%               -24%
innodb_utf8_server 32051/s                 9%                 6%                 2%                 2%                 --               -14%               -19%               -22%
innodb_ucs2_server 37313/s                26%                24%                19%                19%                16%                 --                -6%                -9%
myisam_utf8_server 39683/s                35%                31%                27%                26%                24%                 6%                 --                -4%
myisam_ucs2_server 41152/s                40%                36%                31%                31%                28%                10%                 4%                 --

my $insert_str = ‘ほげ’ x 100;
varchar(255)

                      Rate innodb_ucs2_client myisam_ucs2_client innodb_utf8_client myisam_utf8_client innodb_utf8_server myisam_utf8_server innodb_ucs2_server myisam_ucs2_server
innodb_ucs2_client 30395/s                 --                -8%               -12%               -13%               -19%               -21%               -24%               -25%
myisam_ucs2_client 33003/s                 9%                 --                -4%                -5%               -12%               -14%               -18%               -18%
innodb_utf8_client 34483/s                13%                 4%                 --                -1%                -8%               -10%               -14%               -15%
myisam_utf8_client 34843/s                15%                 6%                 1%                 --                -7%                -9%               -13%               -14%
innodb_utf8_server 37453/s                23%                13%                 9%                 7%                 --                -3%                -7%                -7%
myisam_utf8_server 38462/s                27%                17%                12%                10%                 3%                 --                -4%                -5%
innodb_ucs2_server 40161/s                32%                22%                16%                15%                 7%                 4%                 --                -1%
myisam_ucs2_server 40486/s                33%                23%                17%                16%                 8%                 5%                 1%                 --

select

char(255)

                      Rate innodb_ucs2_client innodb_utf8_server myisam_utf8_client innodb_utf8_client innodb_ucs2_server myisam_ucs2_client myisam_utf8_server myisam_ucs2_server
innodb_ucs2_client 26385/s                 --                -3%                -6%                -6%                -9%               -11%               -15%               -18%
innodb_utf8_server 27174/s                 3%                 --                -3%                -3%                -7%                -8%               -13%               -15%
myisam_utf8_client 28011/s                 6%                 3%                 --                -0%                -4%                -6%               -10%               -13%
innodb_utf8_client 28090/s                 6%                 3%                 0%                 --                -4%                -5%               -10%               -12%
innodb_ucs2_server 29155/s                10%                 7%                 4%                 4%                 --                -2%                -6%                -9%
myisam_ucs2_client 29674/s                12%                 9%                 6%                 6%                 2%                 --                -4%                -7%
myisam_utf8_server 31056/s                18%                14%                11%                11%                 7%                 5%                 --                -3%
myisam_ucs2_server 32051/s                21%                18%                14%                14%                10%                 8%                 3%                 --

varchar(255)

                      Rate myisam_ucs2_server myisam_utf8_server innodb_ucs2_client innodb_ucs2_server myisam_utf8_client innodb_utf8_server myisam_ucs2_client innodb_utf8_client
myisam_ucs2_server 27855/s                 --                -3%                -3%                -4%                -8%                -8%               -12%               -14%
myisam_utf8_server 28736/s                 3%                 --                -0%                -1%                -5%                -5%                -9%               -11%
innodb_ucs2_client 28736/s                 3%                 0%                 --                -1%                -5%                -5%                -9%               -11%
innodb_ucs2_server 29070/s                 4%                 1%                 1%                 --                -4%                -4%                -8%               -10%
myisam_utf8_client 30211/s                 8%                 5%                 5%                 4%                 --                -1%                -5%                -7%
innodb_utf8_server 30395/s                 9%                 6%                 6%                 5%                 1%                 --                -4%                -6%
myisam_ucs2_client 31646/s                14%                10%                10%                 9%                 5%                 4%                 --                -3%
innodb_utf8_client 32468/s                17%                13%                13%                12%                 7%                 7%                 3%                 --

はてなブックマーク - mysqlの内部キャラセットはucs2にするといいんじゃないだろうか

コメント: 0
トラックバック: 0

正月早々DBD::mysqlのデバッグやってた

投稿者: chiba
2009/1/1 木曜日 14:33:56
mysql | perl

DBD::mysqlのベンチマークとってたらメモリリークが発生。

どうやらサーバサイドprepareなStatementHandlerを使いまわしているとリークする模様。rt.cpan.orgとどっちが反応よいのかよくわからなかったがとりあえずbugs.mysql.comのほうにバグリポート。Bug #41815

というわけで、今年もよろしくおねがいします。

コメント: 0
トラックバック: 0

はてぶタグとともに2008年を振り返る

投稿者: chiba
2008/12/31 水曜日 19:12:50
perl | webservice

はてぶのタグを年と月ごとにカウントするやつつくりました。
明らかに車輪の再発明ぽいですね。誰かやってるんでしょうけどXML::LibXMLのリハビリがてらに作ってみました。
2007年と2008年のぶんをとってくるです。

$tags = {
    year => {
        month => {
            'TAG' => count,
        },
    },
};

↑とってくるデータ構造はこんなかんじ。

んでコードと。

#!/usr/bin/perl
use strict;
use warnings;

use feature qw/say/;
use utf8;

use Encode;
use LWP::Simple qw/get/;
use XML::LibXML;
use XML::LibXML::XPathContext;

use DateTime;
use DateTime::Format::ISO8601;

use Storable;

my $base_url = 'http://b.hatena.ne.jp/nihen/atomfeed';
my $parser = XML::LibXML->new();
my $xc = XML::LibXML::XPathContext->new();
my $dt_fmt = DateTime::Format::ISO8601->new;

my $page = 0;
my $tags = {
    2007 => {map { $_ => +{} } 1..12 },
    2008 => {map { $_ => +{} } 1..12 },
};
SEARCH_2007_2008:
    while (1) {
        my $content = get($base_url . '?of=' . ( ++$page - 1 ) * 20);
        say 'entry of ' . ($page - 1) * 20 . '-' . $page * 20;
        my $doc = $parser->parse_string($content);

        my $namespace = $doc->documentElement()->getAttribute('xmlns');
        my $namespace_dc = $doc->documentElement()->getAttribute('xmlns:dc');
        $xc->registerNs('x', $namespace);
        $xc->registerNs('xdc', $namespace_dc);

         my @entries = $xc->findnodes('/x:feed/x:entry', $doc);

        if ( !@entries ) {
            last SEARCH_2007_2008;
        }

        foreach my $entry ( @entries ) {
            my ($issued) = $xc->findnodes('x:issued', $entry);
            my $dt = $dt_fmt->parse_datetime($issued->textContent);

            if ( $dt->year < 2007 ) {
                last SEARCH_2007_2008;
            }

            my ($title) = $xc->findnodes('x:title', $entry);
            my @tags = $xc->findnodes('xdc:subject', $entry);
            for my $tag_el ( @tags ) {
                my $tag = $tag_el->textContent;
                if ( !defined $tags->{$dt->year}{$dt->month}{$tag} ) {
                    $tags->{$dt->year}{$dt->month}{$tag} = 1;
                }
                else {
                    $tags->{$dt->year}{$dt->month}{$tag}++;
                }
            }
        }
        store $tags, 'tagsstorefile';
    }

store $tags, 'tagsstorefile';

コメント: 0
トラックバック: 0

UTF-16時代のエスケープ処理

投稿者: chiba
2008/12/29 月曜日 6:12:08
java | perl | security

Connector/JのSQLインジェクション脆弱性

えーと、変換後がUCS-2とかだとマズいかな？

…(実験中)…

試作パッチ以前に、Connector/Jでcharacter_set_server=ucs2のサーバに繋がらないんですけど。
Connector/J側でcharacterEncoding=UTF-8などとしておけば繋がりますが、こんな仕様あったかな…。

そもそもucs2は現在クライアントキャラクタセットとしては使えないんじゃないでしょうか。

9.1.9. Unicode Support

UCS-2 cannot be used as a client character set,
which means that SET NAMES 'ucs2' does not work.
(See Section 9.1.4, “Connection Character Sets and Collations”.)

ちなみにUCS-2とかUTF-16のようにASCIIな部分までマルチバイトなキャラクタセットが使えるとなると、Cのmysql_real_escape_stringも問題が発生しそうですね。マルチバイトはエスケープ対象外という処理でShift_JIS(やGBK)の5c問題を解決しているようなので。これは現在のサポート文字セットに依存した脆弱な実装ということになるんでしょうか。

そんな時代がくるのか分かりませんが、perl的にUTF-16時代の正しいescape処理を考えてみるとこんな感じ？

#!/usr/bin/perl
use strict;
use warnings;

use utf8;
use Encode;

# from DBD::mysqlPP::quote
my @quote_target = (
    "\\"   => '\\\\',
    "\0"   => '\\0',
    "\n"   => '\\n',
    "\r"   => '\\r',
    "'"    => q{\\'},
    '"'    => '\\"',
    "\x1a" => '\\Z',
);

my $str = "I'm sorry\nok.";

print quote($str, 'utf-16LE');

sub quote {
    my ($str, $charset) = @_;

    my %quote_target_for_charset = map { Encode::encode($charset, $_) } @quote_target;

    $str =~ s{
        (.)
    }{
        my $bytes = Encode::encode($charset, $1);
        $quote_target_for_charset{$bytes} // $bytes;
    }exmsg;

    return $str;
}

Yen markのhtmlでのエスケープもなんかへんな感じになってるのでファイルごとup
1文字ずつencodeしてるのはなんかやーなかんじですね。試してないけどRubyだと$KCODEの切り替えを使うとencode的な事は1回で済むのかも。あとで試してみよう。

コメント: 0
トラックバック: 0

perl-mongers.orgのメール送信の記事にちとつっこみ

投稿者: chiba
2008/5/30 金曜日 2:58:51
perl

はてブのほうでちょろっと書いたんだけど、
Mail::Sendmail でメール送信の以下のコードについて。

use Encode;
use utf8;

my $subject = "テスト";
my $message = "テストです。ほげほげ。";

foreach my $str ( $subject, $message ) {
    utf8::encode($str);
    Encode::from_to( $str, "utf8", "iso-2022-jp" );
}

ここでutf8::encode => Encode::from_toするのは冗長

encode('iso-2022-jp', $str);

だけでおｋ。
use utf8してなければ（で、ソースの文字コードがutf8であれば)

Encode::from_to( $str, "utf8", "iso-2022-jp" );

だけでおｋ。
http://b.hatena.ne.jp/kamawada/20080530#bookmark-8776610

encode('iso-2022-jp', $str); しても結局メールを送るときにutf8フラグを落とさなきゃいけないので

といってますがそれは勘違いかと・・・。
EncodeのPODのencodeの説明のところにも

the UTF8 flag for $octets is always off

とかかれていますし。

以下確認コードでっす。

#!/usr//bin/perl
use strict;
use utf8;

use Encode;
use Devel::Peek;

my $str = 'あああいうえお';

print Dump($str);
print Dump(simple_encode($str));
print Dump(encode_from_to($str));

sub simple_encode {
    my $str = shift;

    return encode('iso-2022-jp', $str);
}

sub encode_from_to {
    my $str = shift;

    $str = encode('utf8', $str);
    Encode::from_to($str, 'utf8' => 'iso-2022-jp');
    return $str;
}

結果はこちら

SV = PV(0x8151b00) at 0x815160c
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81fefb0 "343201202343201202343201202343201204343201206343201210343201212"� [UTF8 "x{3042}x{3042}x{3042}x{3044}x{3046}x{3048}x{304a}"]
  CUR = 21
  LEN = 24
SV = PVMG(0x81e1fc8) at 0x8230ca0
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x8236630 "33$B$\"$\"$\"$$$&$($*33(B"�
  CUR = 20
  LEN = 24
SV = PVMG(0x81e2028) at 0x8230ca0
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x8236630 "33$B$\"$\"$\"$$$&$($*33(B"�
  CUR = 20
  LEN = 24

ま、まったくおなじものですねー。

はてなブックマーク - perl-mongers.orgのメール送信の記事にちとつっこみ

コメント: 0
トラックバック: 0

CodeResposに初めてコミットしたモジュールに初めてpatchを送ってくれたのが異国の方だった件

投稿者: chiba
2008/2/25 月曜日 20:51:44
coderepos | perl

Codereposにコミットした初めてのモジュールである、「深度が一定ではない階層化されたデータをDBIx::Classで扱えるようにする」DBIx::Class::Tree::NestedSet。その使いどころがニッチすぎるためか、今までこれといったフィードバックは(予想通り)特にありませんでした。

ところが今日このモジュールに初めてpatchを送ってきてくれたのがなんと異国のKevin Bakerなる人物。コードをオープンにするっていうのはこういうことなんだなぁとちょっと感動。英語がからきし駄目な自分ですがKevinさんとコードの中で意思疎通はできたのかなぁと。

今まで異国の方にpatchを送ったことは何度かあったけど送ってもらったのは初めてで、うれしかったのでこんなエントリをたててしまった。今は反省している。

ところで、そのpatchはというとCodeReposのr7130で取り込んであります。見てもらうと分かるのですが恥ずかしいスペルミスとかを直されている部分もあったりするわけですが・・・。