Software
engineers at the University of Kansas
have developed SmartXAutofill, an
intelligent data entry assistant
for predicting and automating inputs
for eXtensible Markup Language (XML)
and other text forms based on the
contents of historical documents
in the same domain. SmartXAutofill
utilizes an ensemble classifier,
which is a collection of a number
of internal classifiers where each
individual internal classifier predicts
the optimum value for a particular
data field. As the system operates,
the ensemble classifier learns which
individual internal classifier works
better for a particular domain and
adapts to the domain without the
need to develop special classifiers.
The ensemble classifier has proven
that it performs at least as well
as the best individual internal classifier.
The ensemble classifier contains
a voting and weighting system for
inputting values into a particular
data field.
Because the existing technology
can predict, suggest, and automate
data fields, the investigator contributed
in testing whether the same technology
can be used to identify incorrect
data. Given existing data transmitted
by sensors and other instruments,
the investigator studied whether
the ensemble technology can identify
data abnormalities and correctness
in future sensor data transmission.
The solution would be applied in
a project funded by the National
Science Foundation, Polar Radar for
Ice Sheet Measurements (PRISM), using
innovative sensors to measure the
thickness and characteristics of
the ice sheets in Greenland and Antarctica,
with the goal of understanding how
the ice sheets are being affected
by global climate change.
PRISM sensors continuously send
information that is collected and
catalogued. The ensemble classifier
will check the data for correctness
by predicting which values should
be there, and if the actual values
are different, it will flag the data
as possibly corrupted, and allow
an operator to later study it and
determine if it is correct or not.
This technology will allow the PRISM
intelligent systems to automatically
determine the correctness of sensor
and other data, and contributes to
the PRISM project by adding a level
of intelligence and prediction to
the sensor suite.
The ensemble of classification algorithms
were trained and tested on all classification
nodes for text collections from commercial
data that is related to the results from
sensory data by the PRISM project.
Results
Key: |
|
|
|
0 |
Cross
Validation |
|
! |
Number
of Suggestions |
|
* |
Naïve
Bayes |
|
% |
KNN-1 |
|
@ |
KNN-3 |
|
^ |
Ensemble |
Tables
1-4: Representing Accuracy
of Suggestors & Ensemble
for Various Rewards |
|
Table1
displays the results from the
Report Generator after the
ensemble of classification
algorithms was tested and trained.
The ensemble classifier performed
as well as the best individual
internal classifier with an
accuracy of 90%. |
|
|
|
|
|
|
|
CB1 |
0 |
! |
* |
% |
@ |
^ |
|
S1 |
3 |
.7241 |
.9063 |
.9003 |
.9063 |
|
S2 |
3 |
.7217 |
.8953 |
.8920 |
.8981 |
|
S3 |
3 |
.7224 |
.9052 |
.8990 |
.9061 |
|
S4 |
3 |
.7231 |
.9070 |
.9024 |
.9072 |
|
S5 |
3 |
.7228 |
.9058 |
.9001 |
.9064 |
|
Average |
|
.7228 |
.9039 |
.8987 |
.9048 |
Table 2
displays the results from the
Report Generator after the
ensemble of classification
algorithms was tested and trained.
The ensemble classifier performed
as well as the best individual
internal classifier with an
accuracy of 70%. |
|
|
|
|
|
|
|
CB2 |
0 |
! |
* |
% |
@ |
^ |
|
S1 |
3 |
.5345 |
.7296 |
.7167 |
.7235 |
|
S2 |
3 |
.5459 |
.7304 |
.7228 |
.7278 |
|
S3 |
3 |
.5335 |
.7322 |
.7219 |
.7257 |
|
S4 |
3 |
.5387 |
.7344 |
.7272 |
.7296 |
|
S5 |
3 |
.5393 |
.7337 |
.7281 |
.7037 |
|
Average |
|
.5383 |
.7320 |
.7233 |
.72066 |
Table 3
displays the results from the
Report Generator after the
ensemble of classification
algorithms was tested and trained.
The ensemble classifier performed
as well as the best individual
internal classifier with an
accuracy of 60%. |
|
|
|
|
|
|
|
CB3 |
0 |
! |
* |
% |
@ |
^ |
|
S1 |
3 |
.5128 |
.6266 |
.5933 |
.6109 |
|
S2 |
3 |
.5179 |
.6341 |
.6030 |
.6220 |
|
S3 |
3 |
.5076 |
.6384 |
.6020 |
.6210 |
|
S4 |
3 |
.5139 |
.6399 |
.6104 |
.6265 |
|
S5 |
3 |
.5056 |
.6369 |
.5904 |
.6177 |
|
Average |
|
.5115 |
.6351 |
.5998 |
.6196 |
Table 4
displays the results from the
Report Generator after the
ensemble of classification
algorithms was tested and trained.
The ensemble classifier did
not perform as well as the
best individual internal classifier.
The ensemble classifier was
30% accurate for a particular
node. |
|
|
|
|
|
|
|
CB4 |
0 |
! |
* |
% |
@ |
^ |
|
S1 |
3 |
.3183 |
.3504 |
.3579 |
.3485 |
|
S2 |
3 |
.3179 |
.3503 |
.3584 |
.3462 |
|
S3 |
3 |
.3168 |
.3509 |
.3584 |
.3465 |
|
S4 |
3 |
.3132 |
.3495 |
.3576 |
.3467 |
|
S5 |
3 |
.3093 |
.3426 |
.3501 |
.3376 |
|
Average |
|
.3151 |
.3487 |
.3564 |
.3451 |
The ensemble classifier
performed as well as the best classification
algorithm in three out of four domains.
Table 4 created a lower accuracy
for predicting and suggesting all
classification nodes; this may result
from unique nodes being suggested
upon.
|