Introduction
Methods
Rank-based trees
Random Rank Forest
Boosting with the LogitBoost cost
Ensemble Algorithm with reduced dimension
Gene expression data and evaluation methods
Datasetsa | Platform | \(N\)b | \(P\) b | \(K\) b | Class sample size | References |
---|---|---|---|---|---|---|
Liver | cDNA | 180 | 85 | 2 | HCC/liver = 104/76 | [41] |
CNS | Affy | 34 | 857 | 2 | CMD/DMD = 25/9 | [42] |
Glioblastoma | Affy | 22 | 1152 | 2 | CO/NO = 7/15 | [43] |
Prostate | Affy | 77 | 339 | 2 | PR/N = 58/19 | [44] |
NHL | cDNA | 42 | 1095 | 2 | DLBCL\(_1\)/DLBCL\(_2\) = 21/21 | [45] |
Breast | Affy | 49 | 1198 | 2 | ER+/ER− = 25/24 | [46] |
SRBCTs | cDNA | 83 | 1069 | 4 | BL/EWS/NB/RMS = 29/11/18/25 | [47] |
Leukemia | Affy | 72 | 2194 | 3 | MLL/ALL/AML = 24/20/28 | [48] |
Lung | Affy | 203 | 1543 | 5 | ADE/SQU/SCC/NO = 139/17/6/21/20 | [49] |
Bladder | Affy | 40 | 1203 | 3 | C1/C2/NO = 9/20/11 | [50] |
ALL | Affy | 248 | 2526 | 6 | TALL/E2A/BCR/TEL/MLL/NO = 15/27/64/20/79/43 | [51] |
TNBC | Affy & RNAseqc | 375 | 2188 | 4 | BL1/BL2/M/LAR = 125/80/67/103 | [37] |
Other SSP methods and algorithm implementation
switchbox
” R package [52], in which the optimal number of gene pairs was selected from a range of values from 2 to 10 with fivefold cross-validation. For multi-class classification, a one-vs-one scheme was used and a classifier was trained for each pair of subclasses [53]. To avoid ties in majority voting, only odd numbers were considered during training. We implemented the NTP method with the “CMScaller
” package [54], which was originally created for classifying colorectal cancer pre-clinical models [4, 55]. The prediction for each sample was determined using the sample’s closest cosine distance to each template. We utilized the “gbm
” R package [56] for implementing our boosting algorithm and the “randomForestSRC
” R package for our random forest algorithm [57]. For the random forest implementation, we adopted the multi-class tree with class-balanced sampling instead of fitting separate one-versus-rest models for each class [27] to improve computational efficiency and prediction performance. We noticed that there are other classical methods available, such as k-nearest neighbor (KNN) and support vector machines (SVM). We did not present the results in Section 4 because the comparison was already presented in Tan et al. [21] and showed that k-TSP works superior or comparable to KNN and SVM (see Tables 3 and 4 in Tan et al. [21], and we have the same conclusion with them). We also tried random forest/boosted trees using single gene features, the results of which are similar to SVM, and we did not include the results due to limited space.Performance measures
Results
Random Rank Forest | Boosting | |||||||
---|---|---|---|---|---|---|---|---|
# of genesa | # of Gene pairsb | # of genes | # of gene pairs | |||||
Meanc | SDc | Mean | SD | Mean | SD | Mean | SD | |
Liver | 84.98 | 0.14 | 728.78 | 36.37 | 70.02 | 13.03 | 159.28 | 48.74 |
CNS | 89.60 | 3.27 | 207.18 | 18.86 | 40.82 | 10.51 | 1.54 | 3.68 |
Glioblastoma | 167.94 | 11.61 | 192.86 | 16.41 | 53.34 | 37.35 | 45.66 | 42.36 |
Prostate | 253.46 | 10.90 | 622.16 | 44.49 | 126.46 | 62.02 | 112.44 | 65.86 |
NHL | 200.56 | 12.09 | 200.80 | 20.51 | 53.78 | 50.04 | 19.54 | 27.68 |
Breast | 231.18 | 15.02 | 256.22 | 25.39 | 50.08 | 35.36 | 33.74 | 35.75 |
SRBCTs | 461.50 | 16.52 | 501.48 | 15.54 | 206.77 | 92.36 | 282.57 | 121.13 |
Leukemia | 327.90 | 18.24 | 316.14 | 23.91 | 109.88 | 49.24 | 117.00 | 49.49 |
Lung | 588.84 | 20.05 | 971.28 | 38.90 | 685.8 | 164.56 | 920.22 | 131.11 |
Bladder | 342.88 | 15.18 | 421.02 | 19.29 | 91.84 | 52.94 | 76.90 | 46.89 |
ALL | 1782.52 | 25.62 | 2923.06 | 59.08 | 379.68 | 104.62 | 407.97 | 150.75 |
TNBC | 49.00 | 0.00 | 1021.98 | 7.79 | 49.00 | 0.00 | 427.84 | 41.63 |
# | If | Then | Else |
---|---|---|---|
1 | NCOR1 > BNIP2 and DEF6 < LY6E | Liver | HCC |
2 | LSM8 > NFS1 and OLFML2B > SMAD7 and SDF2 < MAPK14 | Liver | HCC |
3 | LY6E < NMT1-PLCD3 and BNIP2 < HPGDS | Liver | HCC |
4 | LY6E < TCF4 and DEF6 < B3GNT5 | Liver | HCC |