diff options
Diffstat (limited to 'src')
| -rw-r--r-- | src/nbc/README | 115 | 
1 files changed, 2 insertions, 113 deletions
| diff --git a/src/nbc/README b/src/nbc/README index d2a3688..c4d7853 100644 --- a/src/nbc/README +++ b/src/nbc/README @@ -1,114 +1,3 @@ -This is the Naive Bayes Classifier, developed by the genomic signal -processing lab led by Professor Gail Rosen at Drexel University. +This folder contains code to count kmers, from the Naive Bayesian Classifier:  -It uses a method similar to that used in many email spam filters to -score a genetic sample against different genomes, to possibly identify -the closest match. The method is described in the paper <> - -REQUIREMENTS - -To compile this code you need the following (versions given are the -versions we used, but slightly older or newer versions probably work too): - -MLton 20100608 -GNU binutils 2.18.1 -GNU C Compiler 4.3.2 -GNU Make 3.81 -Judy 1.0.5 -zlib 1.2.3.3 - -It has been tested extensively on Mac OS X and Linux, on both 32-bit -and 64-bit processors. It probably works on other, similar operating -systems without any changes. 64-bit uses more memory but also allows -larger genomes to be used. No other differences between the 32-bit and -64-bit versions have been observed. It may work on Windows, but that has -not been attempted. - -Since it is now written in Standard ML, it may in theory be compilable -with other Standard ML compilers, such as Standard ML of New Jersey, -MLKit, PolyML, etc. We have not attempted this. Some changes would -probably be necessary since the MLton foreign function interface (used -for judy array and gzip support) is different from the interface used -by other compilers. - -BUILDING - -For all the example commands, the $ indicates the shell prompt. Don't type -the $, just everything after the $. And most of these examples should -not be typed in verbatim (unless you happen to have the genomes for a -unicorn and a wumpus lying around - in that case, lucky you!). Instead -modify the examples to suit your particular circumstances. - -Run "make" to build: - -$ make - -Assuming it completes without any problems, you will have three -programs: count, score, and tabulate. Install them somewhere in your path: - -$ cp count score tabulate /usr/local/bin - -SETUP - -The first step is to set up your genome data. Create a new directory, -for example "genomes", and inside that directory, create a directory -for each genome: - -$ mkdir genomes -$ mkdir genomes/Unicorn -$ mkdir genomes/Wumpus - -Then you run count on the FASTA files containing the genome (and any -plasmids), for each word size you want to score against: - -$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \ -	-r 15 Unicorn.fasta Unicorn_plasmid.fasta -$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \ -	-r 13 Unicorn.fasta Unicorn_plasmid.fasta -$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \ -	-r 15 Wumpus.fasta -$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \ -	-r 13 Wumpus.fasta - -SCORING - -Now, run score on your input file. Order 15 usually gives the best -results so we'll try that first: - -$ score -a semen_sample.fasta -r 15 -j genomes - -For this example, you would get two files: -	semen_sample-15-Unicorn.txt -	semen_sample-15-Wumpus.txt - -TABULATION - -For easy import into a spreadsheet, you can run tabulate to put it in -CSV format: - -$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt - -This will create the files: -	semen_sample-15-0.csv.gz -	semen_sample-15-1.csv.gz -	semen_sample-15-2.csv.gz -and so on. The exact number of files will depend on how big your input -file is. - -FURTHER INFORMATION - -Each command has a --help option, which may be helpful. - -BUGS - -count and score load the entire genome into memory. For large genomes this -requires a stupendous amount of memory. - -LICENSE - -It has been licensed under the <> license. -See the LICENSE file for details. - -FEEDBACK - -Any feedback should be directed to gailr@gmail.com. +http://nbc.ece.drexel.edu/ | 
