Next: Experimental results in searching Up: Efficient algorithms for Local Previous: Experimental Results

Accuracy Results

Our dataset is the same standard data-set used by PSIST[3]and other algorithms like ProgRESS,geometric hashing please refer [3] for additional details of the dataset. The dataset consist of 181 superfamilies, and each of the superfamilies have atleast 10 protein structures (the proteins are chosen in such a way that there is less than 30% of sequence homology between any two proteins from the same superfamily.) from the same superfamily, these superfamilies are based on SCOP[2] classification, so our database consists of around 2000 proteins. The query sample is a sample of 176 proteins selected randomly from these 2000 proteins (PSIST used a sample size of 176 proteins so we also sticked with the same sample size). Once the sample is selected we run both our algorithm and PSIST and classify the results based on maximum occuring superfamily and class on the top-20 ranked proteins by our algorithm and PSIST. We selected top-20 because we want also measure the sensitivity of both the algorithms, by sensitivity we mean increasing the number of top ranked proteins should not have effect on the classification since we consider the maximum occuring superfamily and class as the basis of our classification. We ran the experiment several times to measure the average number of positive and false classifications. The results indicate that our algorithm acheives an average accuracy of 84.09%(super family) and 86.93%(class), see Table1 for additional details, Table2,3 show the results of the top ranked proteins for query protein 1c2n, as we can see for this example we have +ve class classification, but PSIST has both -ve classification. Also for a protein 1hsm we could achieve both +ve class and superfamily classification, but PSIST only +ve class. These are only few examples during some runs while we computed the average classfication accuracy for a random sample of 176 proteins from 2000 protein database as mentioned previously. Please see the following URL's for the complete list of at

(for our algorithms accuracy) and

(for accuracy of PSIST). The reader is encouraged to verify the facts and also the the dataset for both our algorithms is the same.

Table 1: Accuracy comparision between PSIST and CG_ALGO

Algorithm	Correct (SF)	Correct(Class)	Top-K	Accuracy(SF)	Accuracy(Class)
PSIST	120	129	K=20	$\frac{120}{176}=68.18\%$	$\frac{129}{176}=73.29\%$
CG_ALGO	148	153	K=20	$\frac{148}{176}=84.09\%$	$\frac{153}{176}=86.93\%$

Further the result of top ranked proteins are illustrated when searched using our algorithm and PSIST.

Table 2: Top scored proteins query 1c2n sf(46626) cl(46456) with our algo

Classfication	Length	cost	pdb	SF	Class
c	44	24.128731	pdb1mbj-	46689	46456
c	34	21.989410	pdb2bby-	46785	46456
c	40	26.387238	pdb1jtb-	47699	46456
c	46	31.961258	pdb1hsn-	47095	46456
c	36	25.842737	pdb1nhm-	47095	46456
c	32	23.051079	pdb1mbe-	46689	46456
	37	26.797998	pdb2cjo-	54292	53931
c	35	25.395412	pdb1uxd-	47413	46456
c	36	26.392666	pdb1aab-	47095	46456
c	39	28.690279	pdb1mbk-	46689	46456
	40	29.537001	pdb1eot-	54117	53931
c	37	27.860340	pdb1etd-	46785	46456
c	46	35.429729	pdb1nhn-	47095	46456
	47	36.750553	pdb1e09-A	55961	53931
c	47	38.046467	pdb2new-	48695	46456
	45	36.703476	pdb1bt7-	50494	48724
c	50	41.156513	pdb1gjt-A	46997	46456
	32	26.897558	pdb4ull-	50203	48724
c	37	31.491304	pdb1a2i-	48695	46456
c	47	40.159309	pdb1wjd-B	46919	46456
c	47	40.571198	pdb1wjd-A	46919	46456
+ve class classification (46456) occurs 16 times False sf classification

Table 3: Top scored proteins query 1C2N sf(46626) cl(46456) using psist

Classfication	Score	pdb	SF	Class
csf	114	1C2N_	46626	46456
csf	84	1COT_	46626	46456
c	63	1PFRA	47240	46456
c	61	1PFRB	47240	46456
	60	1GEQB	51366	51349
c	60	2AV8A	47240	46456
	60	1GEQA	51366	51349
c	60	2AV8B	47240	46456
c	60	1AV8A	47240	46456
c	60	1AV8B	47240	46456
	59	1BL5_	53659	51349
	58	1GRP_	53659	51349
	58	1F8IA	51621	51349
	58	1D3GA	51395	51349
	58	7CEL_	49899	48724
	58	1QOQA	51366	51349
	58	1CW2A	51366	51349
	58	1D3HA	51395	51349
	58	1C8VA	51366	51349
	58	1QOPA	51366	51349
False class classificationFalse sf classification

Table 4: Top scored proteins for query 1hsm sf(47095) cl(46456) using our algo

Classfication	Length	cost	pdb	SF	Class
csf	168	72.649147	pdb1hsn-	47095	46456
c	33	19.644470	pdb2bby-	46785	46456
	31	19.394936	pdb2cjo-	54292	53931
csf	145	95.475159	pdb1nhm-	47095	46456
c	38	26.182356	pdb1ba5-	46689	46456
c	38	28.572844	pdb1edj-	46997	46456
c	39	30.662464	pdb1wtu-B	47729	46456
csf	127	102.177414	pdb1nhn-	47095	46456
csf	107	86.437531	pdb1hmf-	47095	46456
csf	110	89.275589	pdb1hme-	47095	46456
	33	27.142843	pdb1bc6-	54862	53931
	35	29.192152	pdb1grx-	52833	51349
	42	35.985783	pdb2cjn-	54292	53931
c	37	31.731483	pdb1tnt-	46785	46456
	41	35.815174	pdb1mit-	54654	53931
csf	52	46.024429	pdb1hma-	47095	46456
c	36	32.188934	pdb1bqv-	47769	46456
c	46	41.358707	pdb1hue-A	47729	46456
c	42	38.252728	pdb1mbg-	46689	46456
c	35	31.920086	pdb1bdc-	46997	46456
	33	30.160261	pdb1svq-	55753	53931
+ve class classification (46456) occurs 15 times +ve sf classification (47095) occurs 6 times

Table 5: Top scored proteins query 1HSM sf(47095) cl(46456) using psist

Classfication	Score	pdb	SF	Class
csf	77	1HSM_	47095	46456
csf	60	1NHN_	47095	46456
c	56	1PFRB	47240	46456
c	56	1PFRA	47240	46456
c	54	1AV8A	47240	46456
c	54	2AV8A	47240	46456
c	54	2AV8B	47240	46456
c	53	1AV8B	47240	46456
c	52	1I4ZE	47188	46456
c	52	1I4ZC	47188	46456
c	51	1I4ZD	47188	46456
c	51	1I4ZG	47188	46456
c	51	1ITF_	47266	46456
c	50	1I4ZB	47188	46456
c	50	1VLK_	47266	46456
c	50	2LBD_	48508	46456
c	50	4LBD_	48508	46456
	50	1ICRB	55469	53931
	50	1ICUB	55469	53931
	50	1ICUA	55469	53931
+ve class classification (46456) occurs 17 times False sf classification

Subsections

Experimental results in searching for structural motifs

Next: Experimental results in searching Up: Efficient algorithms for Local Previous: Experimental Results

Vamsi Kundeti 2007-10-10