diff options
author | Calvin <calvin@EESI> | 2013-05-28 10:42:44 -0400 |
---|---|---|
committer | Calvin <calvin@EESI> | 2013-05-28 10:42:44 -0400 |
commit | 2f33e34ae06b96c3f3e4456ce960172903f60bfb (patch) | |
tree | 04f46783a85dbbab551b4f249b613c51c4ca2a7a /src | |
parent | c5fe7c606d746275bfaf168748259ca151fb4434 (diff) |
removed unused nbc code
Diffstat (limited to 'src')
-rw-r--r-- | src/nbc/README | 115 |
1 files changed, 2 insertions, 113 deletions
diff --git a/src/nbc/README b/src/nbc/README index d2a3688..c4d7853 100644 --- a/src/nbc/README +++ b/src/nbc/README @@ -1,114 +1,3 @@ -This is the Naive Bayes Classifier, developed by the genomic signal -processing lab led by Professor Gail Rosen at Drexel University. +This folder contains code to count kmers, from the Naive Bayesian Classifier: -It uses a method similar to that used in many email spam filters to -score a genetic sample against different genomes, to possibly identify -the closest match. The method is described in the paper <> - -REQUIREMENTS - -To compile this code you need the following (versions given are the -versions we used, but slightly older or newer versions probably work too): - -MLton 20100608 -GNU binutils 2.18.1 -GNU C Compiler 4.3.2 -GNU Make 3.81 -Judy 1.0.5 -zlib 1.2.3.3 - -It has been tested extensively on Mac OS X and Linux, on both 32-bit -and 64-bit processors. It probably works on other, similar operating -systems without any changes. 64-bit uses more memory but also allows -larger genomes to be used. No other differences between the 32-bit and -64-bit versions have been observed. It may work on Windows, but that has -not been attempted. - -Since it is now written in Standard ML, it may in theory be compilable -with other Standard ML compilers, such as Standard ML of New Jersey, -MLKit, PolyML, etc. We have not attempted this. Some changes would -probably be necessary since the MLton foreign function interface (used -for judy array and gzip support) is different from the interface used -by other compilers. - -BUILDING - -For all the example commands, the $ indicates the shell prompt. Don't type -the $, just everything after the $. And most of these examples should -not be typed in verbatim (unless you happen to have the genomes for a -unicorn and a wumpus lying around - in that case, lucky you!). Instead -modify the examples to suit your particular circumstances. - -Run "make" to build: - -$ make - -Assuming it completes without any problems, you will have three -programs: count, score, and tabulate. Install them somewhere in your path: - -$ cp count score tabulate /usr/local/bin - -SETUP - -The first step is to set up your genome data. Create a new directory, -for example "genomes", and inside that directory, create a directory -for each genome: - -$ mkdir genomes -$ mkdir genomes/Unicorn -$ mkdir genomes/Wumpus - -Then you run count on the FASTA files containing the genome (and any -plasmids), for each word size you want to score against: - -$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \ - -r 15 Unicorn.fasta Unicorn_plasmid.fasta -$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \ - -r 13 Unicorn.fasta Unicorn_plasmid.fasta -$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \ - -r 15 Wumpus.fasta -$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \ - -r 13 Wumpus.fasta - -SCORING - -Now, run score on your input file. Order 15 usually gives the best -results so we'll try that first: - -$ score -a semen_sample.fasta -r 15 -j genomes - -For this example, you would get two files: - semen_sample-15-Unicorn.txt - semen_sample-15-Wumpus.txt - -TABULATION - -For easy import into a spreadsheet, you can run tabulate to put it in -CSV format: - -$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt - -This will create the files: - semen_sample-15-0.csv.gz - semen_sample-15-1.csv.gz - semen_sample-15-2.csv.gz -and so on. The exact number of files will depend on how big your input -file is. - -FURTHER INFORMATION - -Each command has a --help option, which may be helpful. - -BUGS - -count and score load the entire genome into memory. For large genomes this -requires a stupendous amount of memory. - -LICENSE - -It has been licensed under the <> license. -See the LICENSE file for details. - -FEEDBACK - -Any feedback should be directed to gailr@gmail.com. +http://nbc.ece.drexel.edu/ |