Split String based on Byte Numbers in Specific Encoding


FME treats the number of characters as the length of a character string, will not consider the byte number of multi-byte characters which are used in Japanese locale. I believe that it's reasonable and convenient in almost all the cases, since you don't need to think of difference in the number of bytes among different encoding.

However, I sometimes need to split a string based on byte numbers counted in a specific encoding when reading datasets in some legacy formats that determine column widths in a record based on byte numbers. Not only Japanese, but also Korean and Chinese users may encounter a similar situation.

The Column Aligned Text (CAT) reader can be used to read a dataset in fixed-length columns format. But it seems to always consider one character as one byte, so it's useless for a dataset that could contain multi-byte characters, unfortunately.
How can you split a string based on byte number of each column?
Column Aligned Text (CAT) リーダーは、固定列幅のフォーマットのデータセットを読むのに使えますが、これは常に1文字を1バイトとみなすようなので、残念ながらマルチバイト文字を含む可能性があるデータセットには使えません。

2014-11-03: The above description about the CAT reader is not exact. If there were a line consisting of only ASCII characters in the source text file, the proper field boundaries could be set to the CAT reader in the parameters dialog, and the reader could read the data as expected. Otherwise, the field boundaries cannot be set properly to the reader. That's a limitation of the current CAT reader for reading data including multi-byte characters.

Tcl scripting is one of quick ways. For example, a TclCaller with this script splits a string into tokens of 4 bytes and 6 bytes counted in cp932 encoding. If the "src" holds "あいうえお", the new attributes "col0" and "col1" will store "あい" and "うえお", since a Japanese character is represented by 2 bytes in cp932. cp932 is the default encoding of Japanese Windows.
Tclスクリプトは簡単な方法のひとつです。例えば、次のスクリプトを設定したTclCallerは、ある文字列を、cp932エンコーディングでの4バイトと6バイトのトークンに分割します。日本語の文字はcp932では2バイトで表現されるので、"src" が「あいうえお」を保持していれば、新しい属性 "col0" と "col1" はそれぞれ、「あい」と「うえお」を格納することになります。cp932は日本語版Windowsのデフォルトのエンコーディングです。
proc byteSplit {} {
    set src [encoding convertto cp932 [FME_GetAttribute "src"]]
    FME_SetAttribute "col0" [encoding convertfrom cp932 [string range $src 0 3]]
    FME_SetAttribute "col1" [encoding convertfrom cp932 [string range $src 4 9]]

Based on this idea, I created a custom transformer named MbStringByteSplitter and published it in FME Store. Find it in this folder.
Transformer Gallery / FME Store / Pragmatica
これを基本にして MbStringByteSplitter というカスタムトランスフォーマーを作成し、FME Storeで公開しました。このフォルダ内を見てください。
Transformer Galley / FME Store / Pragmatica

I think the transformer works fine in any locales theoretically, but I cannot test in locales other than Japanese. If you find some issues in your locale, please let me know it.

FME 2014 SP4 build 14433

2014-11-07: I thought the string splitting by byte count was a special requirement for a few specific cultures such as Japanese, Korean, and Chinese. So I've never asked any improvement on this issue to Safe (There are many things that have to be improved before this!).
However, surprisingly, this article got many accesses from around the world including Europe and America in a few days. Perhaps there are similar requirements in alphabetic cultures - Latin, Cyrillic, Greece, Arabic, Hebrew etc.?
If so, the CAT reader (Field Boundaries setting) will have to be improved as soon as possible, and also adding some new Transformers that process a character string based on byte count in a specified encoding might be helpful to many users.
How do you feel?
もしそうならば、CATリーダー(Field Boundaries 設定)はできるだけ早く改良されるべきだし、指定したエンコーディングでのバイト数によって文字列を処理する新しいトランスフォーマーを追加することは多くのユーザーにとって有用かも知れません。

No comments:

Post a Comment