2014-01-26

Efficiency of List Attribute Manipulation with Python

Important
2014-01-29: I heard that Safe is planning to change the implementation of Python API fmeobjects.FMEFeature.getAttribute() method, so that it returns an empty string when specified attribute stores <null>. Currently - FME 2014 build 14234 - it returns "None" in that case.
After confirming the change, I will revise related descriptions (underlined) in this article.
-----
2014-02-14: I noticed that the method in FME 2014 SP1 Beta (build 14255) returns an empty string for <null>. The change of implementation seems to be done for SP1.
-----
2014-02-25: The change about FME Objects Python API has been announced. I revised related descriptions in this article (underlined).

(FME 2014 build 14234)
=====
2015-12-23: I tested again the same scripts shown in this article with FME 2015.1.3.1 build 15573. Contrary to my expectations, the "copyList1" was always faster than others in both Result1 and Result2. The order of others was "copyList2" < "copyList3" < "copyList4" in almost all runs but not always.
=====

There are several Python scripting ways to get or set values of list attribute elements. I tested difference in processing time efficiency among them.
Note: This article describes just an experiment result in my environment (FME 2014 32-bit build 14234, Windows 7 64-bit). The result may be different in other environments and / or conditions.

At first, create a feature which has a list attribute with this script in a PythonCreator. The number of elements is 100000 and every element value is a character string.
-----
import fmeobjects
class FeatureCreator(object):
    def __init__(self):
        pass
     
    def close(self):
        feature = fmeobjects.FMEFeature()
        for i in range(100000):
            feature.setAttribute('_src{%d}' % i, 'string value')
        self.pyoutput(feature)
-----

Then, I measured processing time of these 4 scripts in PythonCaller. All of them copies every element of "_src{}" to "_dest{}". Assume the source list has no <missing> element.
-----
import fmeobjects, time
def copyList1(feature):
    s = time.clock()
    feature.setAttribute('_dest{}', feature.getAttribute('_src{}'))
    feature.setAttribute('_time_diff', '%.3f' % (time.clock() - s))
-----
import fmeobjects, time
def copyList2(feature):
    s = time.clock()
    for i, v in enumerate(feature.getAttribute('_src{}')):
        feature.setAttribute('_dest{%d}' % i, v)
    feature.setAttribute('_time_diff', '%.3f' % (time.clock() - s))
-----
import fmeobjects, time
def copyList3(feature):
    s = time.clock()
    i = 0
    while True:
        value = feature.getAttribute('_src{%d}' % i)
        if value == None:
            break
        feature.setAttribute('_dest{%d}' % i, value)
        i += 1
    feature.setAttribute('_time_diff', '%.3f' % (time.clock() - s))
-----
import fmeobjects, time
def copyList4(feature):
    s = time.clock()
    i = 0
    while True:
        isNull, isMissing, type = feature.getAttributeNullMissingAndType('_src{%d}' % i)
        if isMissing:
            break
        elif isNull:
            feature.setAttributeNullWithType('_dest{%d}' % i, type)
        else:
            feature.setAttribute('_dest{%d}' % i, feature.getAttribute('_src{%d}' % i))
        i += 1
    feature.setAttribute('_time_diff', '%.3f' % (time.clock() - s))
-----
FME 2014 SP1+ (build 14252 or later):
Note: If there were <null> elements in the source list, copyList1, copyList2 and copyList3 will copy <null> elements as empty strings.
FME 2014 without SP*:
Note: If there were <null> elements in the source list, copyList1 and copyList2 will copy <null> elements as empty strings; copyList3 stops copying when it got the first <null> element.
See also "Null in FME 2014: Handling Null with Python / Tcl".

Result 1:  faster <== copyList3 < copyList2 < copyList4 < copyList1 ==> slower












Next, I tested the case where every element is numeric value (really numeric, is not a string which represents number). The source list was created by this script in the PythonCreator.
-----
import fmeobjects
class FeatureCreator(object):
    def __init__(self):
        pass
     
    def close(self):
        feature = fmeobjects.FMEFeature()
        for i in range(100000):
            feature.setAttribute('_src{%d}' % i, 100)
        self.pyoutput(feature)
-----

Result 2: faster <== copyList3 < copyList4 < copyList2 < copyList1 ==> slower











As long as seeing the result, "copyList3" looks always faster than others.
In FME 2014 (without SP*)be aware that "copyList3" cannot be used if the source list can contain <null> elements.

-----
When writing Python script in FME 2014 (without SP*), we always have to be conscious of <null>. Maybe it's troublesome a little. It might be better to convert <null> to a certain value (e.g. empty string etc.) before processing with Python, if possible. I think the NullAttributeMapper is effective to do that.
In FME 2014 SP1+ (build 14252 or later), be aware that FMEFeature.getAttribute method returns an empty string for <null>. When it's necessary to distinguish <null> from an empty string in Python script, consider using FMEFeature.getAttributeNullMissingAndType method.

2014-01-25

NullAttributeMapper Use Case: Assign Value to Missing Attributes

(FME 2014 build 14234)

From this thread. > Community: Merge and Flag

Assume there are these two tables.
FileA:
NamePostcodeDOB
BobNG1 1AA01/01/1970
SueWP1 1AA01/01/1990
FileB:
NamePostcodeDOB
BobNG1 1AA01/01/1970
JohnSW1 1AA01/01/1980

The question was how to merge the tables; the result table must not contain duplicated person's row, and must have additional columns which indicate existence in the original tables.
Required result table should be:
NamePostcodeDOBFileAFlagFileBFlag
BobNG1 1AA01/01/1970YesYes
SueWP1 1AA01/01/1990YesNo
JohnSW1 1AA01/01/1980NoYes

In the Community, an excellent solution using the FeatureMerger has been provided. If I created a workspace according to the solution, the first step for merging tables would be like this.










Immediately after the tables were merged, every "FileBFlag" of NotMerged features and every "FileAFlag" of Unreferenced features are <missing>, i.e. not exist. To satisfy the requirement, finally those <missing> attributes should have a specific value - "No".
The subject I think of here is how to assign "No" to <missing> attributes after merging.

One possible way is to use an AttributeCreator with conditional value setting. "Attribute Is Missing" operator can be used in the setting.











There is another way in FME 2014. That is to use a NullAttributeMapper in order to assign "No" to <missing> FileAFlag and FileBFlag.










Both of them work fine, I cannot say which way is better. But the NullAttributeMapper parameter setting looks simpler than conditional value settings in the AttributeCreator.
I would use the NullAttributeMapper if I encountered a similar requirement.

-----
It's also possible that adding "FileBFlag" (= "No") to NotMatched output and adding "FileAFlag" (= "No") to Unreferenced output in the FeatureMerger using the "Edit Attribute" tool.
If did so, both the AttributeCreator and the NullAttributeMapper would not be necessary no longer, but the process would become difficult to be visible. It's good or not ...

2014-01-21

Edit Attribute Tool for Transformers

(FME 2014 build 14232)

I found a nice update in FME 2014. That is "Edit Attribute" tool for Transformers.
This article is describing about the tool for Writers.
> FME 2014 Sneak Peek: Top 10 Workbench Optimizations > 5 Edit Attribute Value

But this tool can be also used for every Transformer.
=====
2014-02-07: Related documentation > Inserting Attributes to Transformer Outputs
=====
First of all, let's confirm when new attributes created by this tool will be added to the input feature. Before processing or After processing of the transformer? (pre-process or post-process?)
By the analogy from the previous functionality i.e. "Add Attribute" + "Insert Constant",  it's expected that new attributes would be added After processing.














Result:







OK, indeed new attributes have been added After processing of the transformer.

The followings are practical use cases which I thought of immediately. Of course all of them can be replaced with existing transformers, but using the "Edit Attribute" tool can make the workspace more concise.

Calculate Width and Height of Bounding Box in a BoundsExtractor
-- surrogate for an AttributeCreator or two ExpressionEvaluator etc.















Extract First Element of List Attribute in an AttributeSplitter
-- surrogate for a ListIndexer etc.
















Calculate Length of Line Geometry in a VertexCreator
-- surrogate for a LengthCalculator etc.











Interesting. The "Edit Attribute" tool can be used as something like Expanding Transformer Functionality. There should be many effective use cases.
=====
The "Edit Attribute" tool looks like a pretty baby of the AttributeCreator. I love it :-)

2014-01-18

The FMEUserConferencer

The FMEUserConferencer is one of my most favorite Transformers.
I'm planning to join the "transformer". You?

Range-based Attribute Mapping with InlineQuerier

(FME 2014 build 14232)

I remembered by chance that an issue on range-based attribute mapping had been discussed in this thread. > Community: Schema mapper attribute range

Of course the AttributeRangeMapper transformer could be used if the configuration of range-value mapping was static. However, the subject is a case that the range-value mapping will be configured dynamically via an external table at run-time. It's unknown when creating a workspace.
I provided a solution using Python script at that time, but I noticed that the InlineQuerier could be another solution.

Assume "Data" table contains attributes of target features as followings.
IDHeight
1-5
25
315
425
535

And "RangeToLevel" table defines mapping rule of height range and value. For example:
MinHeightMaxHeightNewValue
010Level1
1020Level2
2030Level3

Then, merge those tables using an InlineQuerier.












This is the SQL statement specified to the InlineQuerier. The point is "cross join".
-----
select s.ID, s.Height, t.NewValue
from Data as s left outer join (select a.ID, b.NewValue
    from Data as a cross join RangeToLevel as b
    where b.MinHeight <= a.Height and a.Height < b.MaxHeight) as t on t.ID = s.ID
-----
Result Table:










Here, <null>s have appeared :-)
Yes, the NullAttributeMapper transformer can be used to map every <null> to a preferable value (e.g. "Undefined Level") if necessary.
I think this is also a typical use case of the NullAttributeMapper.

But this SQL statement is also possible to do that. There is always more than one way!
-----
select s.ID, s.Height,
case
    when t.NewValue is null then 'Undefined Level'
    else t.NewValue
end as NewValue
from Data as s left outer join (select a.ID, b.NewValue
    from Data as a cross join RangeToLevel as b
    where b.MinHeight <= a.Height and a.Height < b.MaxHeight) as t on t.ID = s.ID
=====
2014-01-19: If features whose height is out of range can be discarded, or if it's guaranteed that there aren't "out of range" features, this SQL statement can be used simply.
-----
select a.ID, a.Height, b.NewValue
from Data as a cross join RangeToLevel as b
where b.MinHeight <= a.Height and a.Height < b.MaxHeight
-----

2014-01-16

Sample Last N Features

(FME 2014 build 14230)

Congrats FME 2014 release!

The Sampler transformer has "First N Features" option as "Sampling Type" parameter.
Well, is it possible to sample "Last N Features"?

First inspiration is: to reverse the order of input features using a Sorter, and then apply a Sampler with "First N Features" option.
It's easy, but I'm afraid that it would consume huge memory especially when there are very many features, because the Sorter will store all input features while processing. And also, if the input order of features has to be preserved, it will have to be reversed again after sampling.

Creating a custom transformer using an AttributeCreator with "Multiple Feature Attribute Support" option could be a workaround.







"SAMPLING_AMOUNT" is a Published Parameter defined in the custom transformer.
Since the back ground of "Value" setting which used the parameter with the manner as shown is colored by pink, it might be an unexpected usage. But in my test, it worked fine without any error or warning. Sometimes an unexpected usage is also effective.

Test and Result:







-----
# LastNSampler PythonCaller Edition
# Using Queue module
import fmeobjects, Queue

class LastNSampler(object):
    def __init__(self):
        # Create a FIFO (First In, First Out) queue.
        self.q = Queue.Queue(int(FME_MacroValues['SAMPLING_AMOUNT']))
     
    def pop(self, flag): # flag is a value indicating sampled or not.
        feature = self.q.get()
        feature.setAttribute('_sampled', flag)
        self.pyoutput(feature)
     
    def input(self, feature):
        if self.q.full():
            self.pop('no')
        self.q.put(feature)
     
    def close(self):
        while not self.q.empty():
            self.pop('yes')
-----
For what it's worth...

2014-01-13

Detect Invalid DateTime String with DateFormatter

(FME 2014 Beta build 14229)

"Set Invalid Date Attributes To" parameter has been added to the DateFormatter transformer in FME 2014.
In the previous version, the DateFormetter just does nothing if a datetime string is stored in invalid format. So it was necessary to use a regular expression or similar mechanism to determine whether the format is valid or not. Now it can be determined with simpler way.

This is part of a CSV table describing daily rainfall records in Tokyo, 2013.
Note: <empty> Rainfall indicates no rain, 0.0 indicates rained but the value was less than 0.05 mm/day. Those two representations have to be strictly distinguished.
DateRainfall
2013-04-010.0
2013-04-0253.5
2013-04-0360.0
2013-04-04
2013-04-050.0
2013-04-0699.5
2013-04-079.5
2013-04-08
2013-04-09
2013-04-10

The correct date format is "%Y-%m-%d". Assume that I need to validate and change date format for the following processing - e.g. writing the records into a database table.

I think a combination of a DateFormatter and a Tester can be used effectively in this case. That is to say, a Tester can be used to detect invalid records easily if the "Set Invalid Date Attributes To" parameter is specified like this.
=====
2014-01-19: Of course an invalid date string can be replaced with <null> in the DateFormatter, and also the Tester can be used to determine whether an attribute value is <null>.  Using <null> might be more intuitive than replacing with specific string value.
=====

Adding a NullAttributeMapper
When the MDB Writer receives an <empty> Rainfall, it would write <null> into the associated field in the destination table, because non-numeric attribute value (including empty string) cannot be converted to a decimal type value. Then the writer logs a warning message like this.

MS Access Writer: 179 attribute value(s) failed conversion, and were written as NULL values

Although the destination table would be written as expected, I would insert a NullAttributeMapper between the Tester and the Writer in order to convert <empty> to <null> explicitly, so that other more important warnings which have to be detected will not be overlooked.
I think it would be one of general NullAttributeMapper use cases.

2014-01-11

xfMap: Generate Common Feature Type for Different Schemas

Many fundamental geographic datasets are provided by Japanese government in XML format.
> 基盤地図情報 Fundamental Geospatial Data
> 国土数値情報 National  Land Numerical Information (English)
Naturally, XML schema for every data type is strictly defined, but in fact, there are some datasets created in different (incorrect) schema.

This is a simplified example describing administrative areas in certain year. The actual data contains some more attributes and geometry elements (and of course contains Japanese characters), but those are omitted for simplifying. Assume this is the correct schema.
-----
<?xml version="1.0"?>
<Dataset>
    <AdministrativeBoundary>
        <prefectureName>Chiba</prefectureName>
        <cityName>Matsudo</cityName>
        <administrativeAreaCode>12207</administrativeAreaCode>
    </AdministrativeBoundary>
    <AdministrativeBoundary>
        <prefectureName>Chiba</prefectureName>
        <cityName>Kashiwa</cityName>
        <administrativeAreaCode>12217</administrativeAreaCode>
    </AdministrativeBoundary>
</Dataset>
-----
# I'm living in Kashiwa city :-)

This is the same area data created in a different year. Contents are same as above, but schema is incorrect.
-----
<?xml version="1.0"?>
<Dataset>
    <AdministrativeArea>
        <prn>Chiba</prn>
        <cn2>Matsudo</cn2>
        <acc>12207</acc>
    </AdministrativeArea>
    <AdministrativeArea>
        <prn>Chiba</prn>
        <cn2>Kashiwa</cn2>
        <acc>12217</acc>
    </AdministrativeArea>
</Dataset>
-----

I need to read both datasets as same feature type with an XML Reader. Although those schemas are simple, the XML Reader with "Feature Paths" option cannot generate a common feature type from different schemas. How should I do that?

I solved the issue by defining an xfMap like this.
-----
<?xml version="1.0"?>
<xfMap>
    <feature-map>
        <!-- For correct schema -->
        <mapping match="AdministrativeBoundary">
            <feature-type>
                <literal expr="AdministrativeBoundary" />
            </feature-type>
            <structure matched-prefix="no" cardinality="*/+" />
        </mapping>

        <!-- For incorrect schema -->
        <mapping match="AdministrativeArea">
            <feature-type>
                <literal expr="AdministrativeBoundary" />
            </feature-type>
            <attributes>
                <attribute>
                    <name><literal expr="prefectureName" /></name>
                    <value><extract expr="./prn" /></value>
                </attribute>
                <attribute>
                    <name><literal expr="cityName" /></name>
                    <value><extract expr="./cn2" /></value>
                </attribute>
                <attribute>
                    <name><literal expr="administrativeAreaCode" /></name>
                    <value><extract expr="./acc" /></value>
                </attribute>
            </attributes>
        </mapping>
    </feature-map>
</xfMap>
-----

The first <mapping> element is for the correct schema. The functionality is similar to flattening <AdministrativeBoundary> element with "Feature Paths" option.
The second <mapping> element will map XML element values described in incorrect elements to the correct schema.

The XML Reader with the xfMap generates a feature type named "AdministrativeBoundary" like this (FME 2014 Beta build 14227). It can read any dataset created in both correct schema and incorrect schema.











I'm not sure why the government published the data with incorrect schema, and I'm wondering that nobody has noticed that. In fact, the government is also providing datasets in Esri Shape format simultaneously, so I suspect nobody uses XML data. But some parts of the original XML data cannot be stored in Shape datasets by limitation of the format specification. I think it is a loss to the nation if those cannot be used effectively.
FME can read them flexibly with no loss even though the schema is not correct. That's great.

In addition, <schema-type> element can be added to the xfMap to define data types of attributes and the order of them explicitly. For example:
-----
<?xml version="1.0"?>
<xfMap>
    <schema-type>
        <inline>
            <schema-feature type="AdministrativeBoundary">
                <schema-attribute name="fme_geometry{0}" type="xml_no_geom" />
                <schema-attribute name="prefectureName" type="xml_buffer" />
                <schema-attribute name="cityName" type="xml_buffer" />
                <schema-attribute name="administrativeAreaCode" type="xml_buffer" />
            </schema-feature>
        </inline>
    </schema-type>
    <feature-map>
        ...
    </feature-map>
</xfMap>
-----

2014-01-10

Create Raster from Serialized Cell Values

(FME 2014 Beta build 14227)

I have a lot of DEM datasets, each of them contains grid cell values (altitudes) in a standard mesh, i.e. rectangular area.
Mesh size, number of columns and number of rows per a mesh are fixed as followings.
- mesh width = 45 seconds
- mesh height = 30 seconds
- number of rows = 150
- number of columns = 225
Therefore the number of cells is 33750 (= 150 x 225) per a mesh and cell size is 0.2x0.2 seconds.
Cell values will be stored in a list attribute named "_alti{}" - an array containing sequenced altitude values from south-west cell (0, 0) to north-east cell (149, 224). And also each mesh feature has the coordinate of its south-west corner as attributes named "_xmin" and "_ymin".

Consider creating a raster geometry for each mesh in such condition.
My strategy is:
1) Initially create a raster whose origin is (0, 0) and cell size is (1 x 1).
2) Scale the raster by appropriate ratio.
3) Move the raster so that the origin is located at the correct coordinate.

At first, extract cell values with a ListExploder, then create cell center points in 3D with a VertexCreator (2D/3DPointReplacer/Adder have been integrated into it). Since the cell size of initial raster is (1 x 1), the expressions for calculating (x, y) of each point are simple.
And then, create a raster with a NumericRasterizer. The parameter settings are also simple.
After creating raster, recover the original attributes using a FeatureMerger.















In the next Bookmark, scale the raster by appropriate ratio using a Scaler and move it to correct location using an Offsetter. Done!










Of course the raster can be created by the final cell size and location from the beginning. But in my trial, the way above was a little more efficient in running speed, even though having additional transformers for scaling and offsetting.
Above all, I think it's nice that calculation and parameter settings are very simple.

P.S. I feel the "VertexCreator and NumericRasterizer" method is not so efficient, but it seems to be the only way to create a raster geometry based on serialized grid cell values using existing transformers only. I looked for a way to create a raster more efficiently by Python scripting, but couldn't find functions to do that in FME Objects Python API unfortunately.
Note (2014-01-14): Inserting a PointCloudCombiner transformer between the VertexCreator and the NumericRasterizer could be effective to improve efficiency as mentioned below.

=====
2014-01-11: There are examples in which "Scaler and Offsetter" method can be used effectively.
> Community: Shrink / fit an object to another
> Community: Scale by a point
> Advanced Geometric Operation: Create Normal Lines of a Line Segment
=====
2014-01-14: Considerations on improving efficiency
1) Create 3D Points using Python script
As mentioned above FME Objects Python API doesn't provide functions to create a raster. But 3D points can be created easily with Python, and also  it could be more efficient. For example, the ListExploder and the VertexCreator can be replaced with a PythonCaller performing this script.
-----
# Example 1: Create 3D Points
import fmeobjects
class GridCellPointCreator(object):
    def __init__(self):
        pass
     
    def input(self, feature):
        id = feature.getAttribute('_feature_id')
        for i, a in enumerate(feature.getAttribute('_alti{}')):
            x, y, z = i % 225 + 0.5, i / 225 + 0.5, float(a)
            point = fmeobjects.FMEFeature()
            point.setGeometry(fmeobjects.FMEPoint(x, y, z))
            point.setAttribute('_feature_id', id)
            self.pyoutput(point)
     
    def close(self):
        pass
-----
Depending on the condition, creating a 3D multi point geometry could be more efficient.
-----
# Example 2: Create 3D Multi Point
# Unnecessary attributes should be removed before rasterizing.
import fmeobjects
def createGridCellPoints(feature):
    points = fmeobjects.FMEMultiPoint()
    for i, a in enumerate(feature.getAttribute('_alti{}')):
        x, y, z = i % 225 + 0.5, i / 225 + 0.5, float(a)
        points.appendPart(fmeobjects.FMEPoint(x, y, z))
    feature.setGeometry(points)
-----
2) Transform Vector Geometry to a Point Cloud before rasterizing
The NumericRasterizer can also create a raster based on a point cloud. In my trial, transforming 3D points (or 3D multi point) to a point cloud using a PointCloudCombiner transformer before creating a raster was indeed more efficient. Furthermore, since the PointCloudCombiner can transform any geometry other than points to a point cloud, creating a 3D line geometry with Python could be also an option.
Dale and Chris suggested to me the capability of the point cloud technology. Thanks!
=====
2014-01-15
-----
# Example 3: Create 3D Line
import fmeobjects
def create3DLine(feature):
    coords = []
    for i, a in enumerate(feature.getAttribute('_alti{}')):
        coords.append((i % 225 + 0.5, i / 225 + 0.5, float(a)))
    feature.setGeometry(fmeobjects.FMELine(coords))
-----

2014-01-04

Null in FME 2014: Magnitude Relationship

(FME 2014 Beta build 14223)

Happy New Year!
I'll continue to explore the <null> world. My first question in this year is:
Are there differences in magnitude among "nothing" values - <null>, <empty> and <missing>?

The following image is the result of testing with the Sorter. Each input feature has an attribute named "value" which holds <null> or <empty>, or is missing the attribute (i.e. <missing>), and "index" attribute indicates the input order.
Left table shows the input order; right table shows the result of sorting by "value". As the image showing, the Sorter had no influences on the order of features; it seems that there is no difference in magnitude among them, at least in sorting.













Then, I tried cases that there are features having "non-nothing" value as the attribute.

The result of Alphabetic ascending sorting is like this image.
All the "nothing" values is less than any "non-nothing" value. It's as expected.













The next image shows the result of Numeric ascending sorting.
All the "nothing" values seems to be treated as equivalent to 0 in numeric sorting.













The results are similar to the result of sorting non-empty values and empty strings in FME 2013.
I think the Sorter is keeping logical consistency with the previous version and other transformers - e.g. the Tester etc.. But the behavior especially in numeric sorting may look strange intuitively. There would be some cases in which we should mind that.

=====
2014-01-06: The help documentation on the Sorter in FME 2014 (build 14223) describes like this.
"In Alphabetic mode, missing and null attribute values are treated as the empty string when they are sorted. In Numeric mode, non-numeric attribute values are treated as the value 0 when they are sorted. For example, missing, null, the empty string, and the string "apple" are all treated as the value 0 in Numeric mode."
=====

Tip: If it's necessary to distinguish each "nothing" type in sorting, create another attribute which represents the type of value, so that features can be sorted by the type and value.