summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--CHANGELOG6
-rw-r--r--src/nbc/README115
2 files changed, 7 insertions, 114 deletions
diff --git a/CHANGELOG b/CHANGELOG
index ff8cdfb..9768f10 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -5,7 +5,11 @@ May 16th 2013, 1.0.1 Released
- remove old multifasta_to_otu matlab scripts
- minor documentation changes
-May 20th 2013, 1.0.1 Released
+May 20th 2013, 1.0.2 Released
- correted usage strings
- install manpages globally (to {PREFIX}/share/man/man1)
- added native gzip decoding for quikr and multifasta_to_otu
+
+Current Development Branch
+- Licensing files and documentation added
+- Removed unused code from the NBC project
diff --git a/src/nbc/README b/src/nbc/README
index d2a3688..c4d7853 100644
--- a/src/nbc/README
+++ b/src/nbc/README
@@ -1,114 +1,3 @@
-This is the Naive Bayes Classifier, developed by the genomic signal
-processing lab led by Professor Gail Rosen at Drexel University.
+This folder contains code to count kmers, from the Naive Bayesian Classifier:
-It uses a method similar to that used in many email spam filters to
-score a genetic sample against different genomes, to possibly identify
-the closest match. The method is described in the paper <>
-
-REQUIREMENTS
-
-To compile this code you need the following (versions given are the
-versions we used, but slightly older or newer versions probably work too):
-
-MLton 20100608
-GNU binutils 2.18.1
-GNU C Compiler 4.3.2
-GNU Make 3.81
-Judy 1.0.5
-zlib 1.2.3.3
-
-It has been tested extensively on Mac OS X and Linux, on both 32-bit
-and 64-bit processors. It probably works on other, similar operating
-systems without any changes. 64-bit uses more memory but also allows
-larger genomes to be used. No other differences between the 32-bit and
-64-bit versions have been observed. It may work on Windows, but that has
-not been attempted.
-
-Since it is now written in Standard ML, it may in theory be compilable
-with other Standard ML compilers, such as Standard ML of New Jersey,
-MLKit, PolyML, etc. We have not attempted this. Some changes would
-probably be necessary since the MLton foreign function interface (used
-for judy array and gzip support) is different from the interface used
-by other compilers.
-
-BUILDING
-
-For all the example commands, the $ indicates the shell prompt. Don't type
-the $, just everything after the $. And most of these examples should
-not be typed in verbatim (unless you happen to have the genomes for a
-unicorn and a wumpus lying around - in that case, lucky you!). Instead
-modify the examples to suit your particular circumstances.
-
-Run "make" to build:
-
-$ make
-
-Assuming it completes without any problems, you will have three
-programs: count, score, and tabulate. Install them somewhere in your path:
-
-$ cp count score tabulate /usr/local/bin
-
-SETUP
-
-The first step is to set up your genome data. Create a new directory,
-for example "genomes", and inside that directory, create a directory
-for each genome:
-
-$ mkdir genomes
-$ mkdir genomes/Unicorn
-$ mkdir genomes/Wumpus
-
-Then you run count on the FASTA files containing the genome (and any
-plasmids), for each word size you want to score against:
-
-$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \
- -r 15 Unicorn.fasta Unicorn_plasmid.fasta
-$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \
- -r 13 Unicorn.fasta Unicorn_plasmid.fasta
-$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \
- -r 15 Wumpus.fasta
-$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \
- -r 13 Wumpus.fasta
-
-SCORING
-
-Now, run score on your input file. Order 15 usually gives the best
-results so we'll try that first:
-
-$ score -a semen_sample.fasta -r 15 -j genomes
-
-For this example, you would get two files:
- semen_sample-15-Unicorn.txt
- semen_sample-15-Wumpus.txt
-
-TABULATION
-
-For easy import into a spreadsheet, you can run tabulate to put it in
-CSV format:
-
-$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt
-
-This will create the files:
- semen_sample-15-0.csv.gz
- semen_sample-15-1.csv.gz
- semen_sample-15-2.csv.gz
-and so on. The exact number of files will depend on how big your input
-file is.
-
-FURTHER INFORMATION
-
-Each command has a --help option, which may be helpful.
-
-BUGS
-
-count and score load the entire genome into memory. For large genomes this
-requires a stupendous amount of memory.
-
-LICENSE
-
-It has been licensed under the <> license.
-See the LICENSE file for details.
-
-FEEDBACK
-
-Any feedback should be directed to gailr@gmail.com.
+http://nbc.ece.drexel.edu/