1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
|
This is the Naive Bayes Classifier, developed by the genomic signal
processing lab led by Professor Gail Rosen at Drexel University.
It uses a method similar to that used in many email spam filters to
score a genetic sample against different genomes, to possibly identify
the closest match. The method is described in the paper <>
REQUIREMENTS
To compile this code you need the following (versions given are the
versions we used, but slightly older or newer versions probably work too):
MLton 20100608
GNU binutils 2.18.1
GNU C Compiler 4.3.2
GNU Make 3.81
Judy 1.0.5
zlib 1.2.3.3
It has been tested extensively on Mac OS X and Linux, on both 32-bit
and 64-bit processors. It probably works on other, similar operating
systems without any changes. 64-bit uses more memory but also allows
larger genomes to be used. No other differences between the 32-bit and
64-bit versions have been observed. It may work on Windows, but that has
not been attempted.
Since it is now written in Standard ML, it may in theory be compilable
with other Standard ML compilers, such as Standard ML of New Jersey,
MLKit, PolyML, etc. We have not attempted this. Some changes would
probably be necessary since the MLton foreign function interface (used
for judy array and gzip support) is different from the interface used
by other compilers.
BUILDING
For all the example commands, the $ indicates the shell prompt. Don't type
the $, just everything after the $. And most of these examples should
not be typed in verbatim (unless you happen to have the genomes for a
unicorn and a wumpus lying around - in that case, lucky you!). Instead
modify the examples to suit your particular circumstances.
Run "make" to build:
$ make
Assuming it completes without any problems, you will have three
programs: count, score, and tabulate. Install them somewhere in your path:
$ cp count score tabulate /usr/local/bin
SETUP
The first step is to set up your genome data. Create a new directory,
for example "genomes", and inside that directory, create a directory
for each genome:
$ mkdir genomes
$ mkdir genomes/Unicorn
$ mkdir genomes/Wumpus
Then you run count on the FASTA files containing the genome (and any
plasmids), for each word size you want to score against:
$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \
-r 15 Unicorn.fasta Unicorn_plasmid.fasta
$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \
-r 13 Unicorn.fasta Unicorn_plasmid.fasta
$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \
-r 15 Wumpus.fasta
$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \
-r 13 Wumpus.fasta
SCORING
Now, run score on your input file. Order 15 usually gives the best
results so we'll try that first:
$ score -a semen_sample.fasta -r 15 -j genomes
For this example, you would get two files:
semen_sample-15-Unicorn.txt
semen_sample-15-Wumpus.txt
TABULATION
For easy import into a spreadsheet, you can run tabulate to put it in
CSV format:
$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt
This will create the files:
semen_sample-15-0.csv.gz
semen_sample-15-1.csv.gz
semen_sample-15-2.csv.gz
and so on. The exact number of files will depend on how big your input
file is.
FURTHER INFORMATION
Each command has a --help option, which may be helpful.
BUGS
count and score load the entire genome into memory. For large genomes this
requires a stupendous amount of memory.
LICENSE
It has been licensed under the <> license.
See the LICENSE file for details.
FEEDBACK
Any feedback should be directed to gailr@gmail.com.
|