lOGO

Using Ensemble Learning to Detect Data Abnormaties in Databases
2005 Research Experience for Undergraduates (REU)
The University of KansasDepartment of Computer Science and Electrical Engineering
Mentors: Drs. P. Gogineni, C. Tsatsoulis, and Miss. D. Lee
sp
Personal Page   ::   Research Paper (PDF)
sp

Poster PresentationSoftware engineers at the University of Kansas have developed SmartXAutofill, an intelligent data entry assistant for predicting and automating inputs for eXtensible Markup Language (XML) and other text forms based on the contents of historical documents in the same domain. SmartXAutofill utilizes an ensemble classifier, which is a collection of a number of internal classifiers where each individual internal classifier predicts the optimum value for a particular data field. As the system operates, the ensemble classifier learns which individual internal classifier works better for a particular domain and adapts to the domain without the need to develop special classifiers. The ensemble classifier has proven that it performs at least as well as the best individual internal classifier. The ensemble classifier contains a voting and weighting system for inputting values into a particular data field.

Because the existing technology can predict, suggest, and automate data fields, the investigator contributed in testing whether the same technology can be used to identify incorrect data. Given existing data transmitted by sensors and other instruments, the investigator studied whether the ensemble technology can identify data abnormalities and correctness in future sensor data transmission. The solution would be applied in a project funded by the National Science Foundation, Polar Radar for Ice Sheet Measurements (PRISM), using innovative sensors to measure the thickness and characteristics of the ice sheets in Greenland and Antarctica, with the goal of understanding how the ice sheets are being affected by global climate change.

PRISM sensors continuously send information that is collected and catalogued. The ensemble classifier will check the data for correctness by predicting which values should be there, and if the actual values are different, it will flag the data as possibly corrupted, and allow an operator to later study it and determine if it is correct or not. This technology will allow the PRISM intelligent systems to automatically determine the correctness of sensor and other data, and contributes to the PRISM project by adding a level of intelligence and prediction to the sensor suite.

The ensemble of classification algorithms were trained and tested on all classification nodes for text collections from commercial data that is related to the results from sensory data by the PRISM project.



Results
Key:    
 

0

Cross Validation

 

!

Number of Suggestions

 

*

Naïve Bayes

 

%

KNN-1

 

@

KNN-3

 

^

Ensemble

 

Tables 1-4: Representing Accuracy of Suggestors & Ensemble for Various Rewards
 
Table1 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 90%.
             
CB1

0

!

*

%

@

^

 

S1

3

.7241

.9063

.9003

.9063

 

S2

3

.7217

.8953

.8920

.8981

 

S3

3

.7224

.9052

.8990

.9061

 

S4

3

.7231

.9070

.9024

.9072

 

S5

3

.7228

.9058

.9001

.9064

 

Average

 

.7228

.9039

.8987

.9048

 

Table 2 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 70%.
             
CB2

0

!

*

%

@

^

 

S1

3

.5345

.7296

.7167

.7235

 

S2

3

.5459

.7304

.7228

.7278

 

S3

3

.5335

.7322

.7219

.7257

 

S4

3

.5387

.7344

.7272

.7296

 

S5

3

.5393

.7337

.7281

.7037

 

Average

 

.5383

.7320

.7233

.72066

 

Table 3 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier performed as well as the best individual internal classifier with an accuracy of 60%.
             
CB3

0

!

*

%

@

^

 

S1

3

.5128

.6266

.5933

.6109

 

S2

3

.5179

.6341

.6030

.6220

 

S3

3

.5076

.6384

.6020

.6210

 

S4

3

.5139

.6399

.6104

.6265

 

S5

3

.5056

.6369

.5904

.6177

 

Average

 

.5115

.6351

.5998

.6196

 

Table 4 displays the results from the Report Generator after the ensemble of classification algorithms was tested and trained. The ensemble classifier did not perform as well as the best individual internal classifier. The ensemble classifier was 30% accurate for a particular node.
             
CB4

0

!

*

%

@

^

 

S1

3

.3183

.3504

.3579

.3485

 

S2

3

.3179

.3503

.3584

.3462

 

S3

3

.3168

.3509

.3584

.3465

 

S4

3

.3132

.3495

.3576

.3467

 

S5

3

.3093

.3426

.3501

.3376

 

Average

 

.3151

.3487

.3564

.3451

 
The ensemble classifier performed as well as the best classification algorithm in three out of four domains. Table 4 created a lower accuracy for predicting and suggesting all classification nodes; this may result from unique nodes being suggested upon.

 

 

sp