UML Class Diagrams of PRISM Multi-Agent Subsystem Using XML and FIPA

Using Ensemble Learning to Detect Data Abnormaties in Databases
2005 Research Experience for Undergraduates (REU)
The University of KansasDepartment of Computer Science and Electrical Engineering
Mentors: Drs. P. Gogineni, C. Tsatsoulis, and Miss. D. Lee

Personal Page :: Research Paper (PDF)

Software engineers at the University of Kansas have developed SmartXAutofill, an intelligent data entry assistant for predicting and automating inputs for eXtensible Markup Language (XML) and other text forms based on the contents of historical documents in the same domain. SmartXAutofill utilizes an ensemble classifier, which is a collection of a number of internal classifiers where each individual internal classifier predicts the optimum value for a particular data field. As the system operates, the ensemble classifier learns which individual internal classifier works better for a particular domain and adapts to the domain without the need to develop special classifiers. The ensemble classifier has proven that it performs at least as well as the best individual internal classifier. The ensemble classifier contains a voting and weighting system for inputting values into a particular data field.

Because the existing technology can predict, suggest, and automate data fields, the investigator contributed in testing whether the same technology can be used to identify incorrect data. Given existing data transmitted by sensors and other instruments, the investigator studied whether the ensemble technology can identify data abnormalities and correctness in future sensor data transmission. The solution would be applied in a project funded by the National Science Foundation, Polar Radar for Ice Sheet Measurements (PRISM), using innovative sensors to measure the thickness and characteristics of the ice sheets in Greenland and Antarctica, with the goal of understanding how the ice sheets are being affected by global climate change.

PRISM sensors continuously send information that is collected and catalogued. The ensemble classifier will check the data for correctness by predicting which values should be there, and if the actual values are different, it will flag the data as possibly corrupted, and allow an operator to later study it and determine if it is correct or not. This technology will allow the PRISM intelligent systems to automatically determine the correctness of sensor and other data, and contributes to the PRISM project by adding a level of intelligence and prediction to the sensor suite.

The ensemble of classification algorithms were trained and tested on all classification nodes for text collections from commercial data that is related to the results from sensory data by the PRISM project.

Results

Key:
	0	Cross Validation
	!	Number of Suggestions
	*	Naïve Bayes
	%	KNN-1
	@	KNN-3
	^	Ensemble

Tables 1-4: Representing Accuracy of Suggestors & Ensemble for Various Rewards

Table1 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 90%.

CB1	0	!	*	%	@	^
	S1	3	.7241	.9063	.9003	.9063
	S2	3	.7217	.8953	.8920	.8981
	S3	3	.7224	.9052	.8990	.9061
	S4	3	.7231	.9070	.9024	.9072
	S5	3	.7228	.9058	.9001	.9064
	Average		.7228	.9039	.8987	.9048

Table 2 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 70%.

CB2	0	!	*	%	@	^
	S1	3	.5345	.7296	.7167	.7235
	S2	3	.5459	.7304	.7228	.7278
	S3	3	.5335	.7322	.7219	.7257
	S4	3	.5387	.7344	.7272	.7296
	S5	3	.5393	.7337	.7281	.7037
	Average		.5383	.7320	.7233	.72066

Table 3 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 60%.

CB3	0	!	*	%	@	^
	S1	3	.5128	.6266	.5933	.6109
	S2	3	.5179	.6341	.6030	.6220
	S3	3	.5076	.6384	.6020	.6210
	S4	3	.5139	.6399	.6104	.6265
	S5	3	.5056	.6369	.5904	.6177
	Average		.5115	.6351	.5998	.6196

Table 4 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier did not perform as well as the best individual internal classifier. The ensemble classifier was 30% accurate for a particular node.

CB4	0	!	*	%	@	^
	S1	3	.3183	.3504	.3579	.3485
	S2	3	.3179	.3503	.3584	.3462
	S3	3	.3168	.3509	.3584	.3465
	S4	3	.3132	.3495	.3576	.3467
	S5	3	.3093	.3426	.3501	.3376
	Average		.3151	.3487	.3564	.3451

The ensemble classifier performed as well as the best classification algorithm in three out of four domains. Table 4 created a lower accuracy for predicting and suggesting all classification nodes; this may result from unique nodes being suggested upon.