From b632667ce57af89691407bb8668e1512775278ae Mon Sep 17 00:00:00 2001 From: Calvin Date: Fri, 15 Mar 2013 15:26:20 -0400 Subject: nbc added --- src/nbc/README | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 src/nbc/README (limited to 'src/nbc/README') diff --git a/src/nbc/README b/src/nbc/README new file mode 100644 index 0000000..d2a3688 --- /dev/null +++ b/src/nbc/README @@ -0,0 +1,114 @@ +This is the Naive Bayes Classifier, developed by the genomic signal +processing lab led by Professor Gail Rosen at Drexel University. + +It uses a method similar to that used in many email spam filters to +score a genetic sample against different genomes, to possibly identify +the closest match. The method is described in the paper <> + +REQUIREMENTS + +To compile this code you need the following (versions given are the +versions we used, but slightly older or newer versions probably work too): + +MLton 20100608 +GNU binutils 2.18.1 +GNU C Compiler 4.3.2 +GNU Make 3.81 +Judy 1.0.5 +zlib 1.2.3.3 + +It has been tested extensively on Mac OS X and Linux, on both 32-bit +and 64-bit processors. It probably works on other, similar operating +systems without any changes. 64-bit uses more memory but also allows +larger genomes to be used. No other differences between the 32-bit and +64-bit versions have been observed. It may work on Windows, but that has +not been attempted. + +Since it is now written in Standard ML, it may in theory be compilable +with other Standard ML compilers, such as Standard ML of New Jersey, +MLKit, PolyML, etc. We have not attempted this. Some changes would +probably be necessary since the MLton foreign function interface (used +for judy array and gzip support) is different from the interface used +by other compilers. + +BUILDING + +For all the example commands, the $ indicates the shell prompt. Don't type +the $, just everything after the $. And most of these examples should +not be typed in verbatim (unless you happen to have the genomes for a +unicorn and a wumpus lying around - in that case, lucky you!). Instead +modify the examples to suit your particular circumstances. + +Run "make" to build: + +$ make + +Assuming it completes without any problems, you will have three +programs: count, score, and tabulate. Install them somewhere in your path: + +$ cp count score tabulate /usr/local/bin + +SETUP + +The first step is to set up your genome data. Create a new directory, +for example "genomes", and inside that directory, create a directory +for each genome: + +$ mkdir genomes +$ mkdir genomes/Unicorn +$ mkdir genomes/Wumpus + +Then you run count on the FASTA files containing the genome (and any +plasmids), for each word size you want to score against: + +$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \ + -r 15 Unicorn.fasta Unicorn_plasmid.fasta +$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \ + -r 13 Unicorn.fasta Unicorn_plasmid.fasta +$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \ + -r 15 Wumpus.fasta +$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \ + -r 13 Wumpus.fasta + +SCORING + +Now, run score on your input file. Order 15 usually gives the best +results so we'll try that first: + +$ score -a semen_sample.fasta -r 15 -j genomes + +For this example, you would get two files: + semen_sample-15-Unicorn.txt + semen_sample-15-Wumpus.txt + +TABULATION + +For easy import into a spreadsheet, you can run tabulate to put it in +CSV format: + +$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt + +This will create the files: + semen_sample-15-0.csv.gz + semen_sample-15-1.csv.gz + semen_sample-15-2.csv.gz +and so on. The exact number of files will depend on how big your input +file is. + +FURTHER INFORMATION + +Each command has a --help option, which may be helpful. + +BUGS + +count and score load the entire genome into memory. For large genomes this +requires a stupendous amount of memory. + +LICENSE + +It has been licensed under the <> license. +See the LICENSE file for details. + +FEEDBACK + +Any feedback should be directed to gailr@gmail.com. -- cgit v1.2.3