README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299

SelectiveWholeGenomeAmplification
============================

SWGA is a tool for choosing primers for the selective amplification of a
target genome from a sample containing a mixture of target and contaminating
DNA (i.e. pathogen genome from infected host blood) [cite relevant paper]. It
does so by identifying short, recurring motifs in a target sequence file and
scoring sets of motifs based on selectivity for and even distribution in the
target sequence against a background sequence file.

PI: http://brisson.bio.upenn.edu/

## Table of Contents

* [Requirements](#requirements)
* [Setup](#setup)
* [Example Usage](#example-usage)
  * [SWGA User Interface](#sga-user-interface)
  * [Setting Tunable Parameters](#setting-tunable-parameters)
  * [Running Individual Steps](#running-individual-steps)
  * [Manually Scoring Specific Mer Combinations From List ](#manually-scoring-specific-mer-combinations-from-list)
  * [Manually Score All Combinations From List](#manually-score-all-combinations-from-list)
  * [Manually Rescore All Combinations From Previously Scored File](#manually-rescore-all-combinations-from-previously-scored-file)
* [Table of Tunable Parameters](#tunable-parameters)
* [Equations](#equations)
  * [Mer Selectivity](#mer-selectivity)
  * [Scoring Combinations](#score-combinations)
    * [Default Scoring Function](#default-scoring-function)
    * [Custom Scoring Function](#custom-scoring-function)
* [Filters](#filters)
* [Output](#output)
  * [Select Mers](#select_merspy-output)
  * [Score Mers](#score_merspy-output)
* [Post Processing](#post-processing)

## Requirements
To use this you'll need:

 - A Unix environment
 	- GNU/Linux works out of the box. Debian + SUSE tested.
	- Cygwin has been tested. gcc and python are required, might need to run rebasall
	- OS/X support. not yet!	
 - [dna-utils](http://github.com/mutantturkey/dna-utils/)
 - bash or compliant shell.
 - python 2.7.x
 
## Setup

    git clone git@github.com:mutantturkey/SelectiveWholeGenomeAmplification.git
    cd SelectiveWholeGenomeAmplification
    make
    sudo make install

## Example Usage
Standard use of (SGA) SelectiveWholeGenomeAmplification is easy. it takes two arguments,
the foreground and background


    SelectiveWholeGenomeAmplification PfalciparumGenome.fasta HumanGenome.fasta;
    less PfalciparumGenome_HumanGenome/final_mers

### SWGA User Interface
SWGA also comes with a easy to use user prompt called SelectiveWholeGenomeAmplificationUI.
It allows for a less experienced user to use SWGA without issue. to run this
all you need to do is run SelectiveGenomeAmiplifcationUI and you'll see a
series of prompts asking the user about tunables like below

    Where would you like to temporary files to be stored? (Default=$output_directory/.tmp): 
    Where would you like to count files to be stored? (Default=$output_directory/.tmp): 
    maximum mer size you would like to pick? (Default=12): 10
    minimum mer size you would like to pick? (Default=6): 7
    eliminate mers that appear less frequently on average than this number ? (Default=50000): 25000
    .....
    Input the path to your foreground file:target.fa  
    Input the path to your background file:humangenome.fa 
    Would you like to output your inserted variables to a string you can later paste? (Y/N/Default=y): n
    Run SelectiveWholeGenomeAmplification? (Y/N/Default=y): y

### Setting Tunable Parameters

SGA allows for many tunable parameters, which are all explained in the chart
below.  For user customizable variables, they need to be passed in as
environmental variables like so:

    max_mer_distance=5000 max_select=6 min_mer_range=6 max_mer_range=12 \
    SelectiveWholeGenomeAmplification.sh PfalciparumGenome.fasta half.fasta 


### Running individual steps

By default SelectiveWholeGenomeAmplification runs all four steps, but you can
specify the program to run other steps, like in these examples.

    current_run=run_1 SelectiveWholeGenomeAmplification target.fasta bg.fasta score

    current_run=run_1 SelectiveWholeGenomeAmplification target.fasta bg.fasta select score

    current_run=run_1 SelectiveWholeGenomeAmplification target.fasta bg.fasta 3 4 

valid steps are these:

- count (1)
- filter (2)
- select (3)
- score (4)

This function does not try to be smart, so use it wisely.

### Manually scoring specific mer combinations from list

Users can manually score combinations of mers they choose using the
score\_mers.py script.

    score_mers.py -f foreground.fa -b background.fa -c combination file -o output


The combination file should look like this:

    ACGATATAT TACATAGA TATATATAT ACGTACCAT ATATTA
    AAATTATCAGT ATACATA ATATACAT ATATACATA ACATA
    ATATACATA ATCATGATA CCAGATACATAT

each row is combination to be scored.


### Manually score all combinations from list

Users can manually score all  combinations of mers they choose using the
score\_mers.py script.

    score_mers.py -f foreground.fa -b background.fa -m mer file -o output


The mer file should look like this:

    ATATAT
    TACATA
    TACATAGCA
    TATAGAATAC
    CGTAGATA
    TAGAAT

each row is a separate mer. do not put multiple mers on one line.

### Manually rescore all combinations from previously scored file

Users can manually rescore all combinations of mers they previously used in the
score\_mers.py script. This allows users to test different score functions easily
with the same combinations.

An example would be this:

    score_func=nb_primers**2 score_mers.py -f fg.fa -b bg.fa -r fg_bg/run_1/all-scores -o primers_squared_scores


## Tunable Parameters

variable | default | notes
:---- | :---- | ---- | :----
current\_run | Not Enabled | specify the run you want to run steps on
min\_mer\_range | 6  | minimum mer size to use
max\_mer\_range | 12 | maximum mer size to use 
max\_mer\_distance | 5000 | maximum distance between mers in foreground
min\_melting\_temp | 0° | minimum melting temp of mers
max\_melting\_temp | 30° | maximum melting temp of mers
min\_foreground\_binding\_average | 50000 | eliminate mers that appear less frequently than the average  (length of foreground / # of occurrances)
min\_bg\_ratio | Not Enabled | eliminate mers where the background ration is less than the minimum
ignore\_mers | Not Enabled | mers to explicitly ignore, space separated ex. ignore\_mers="ACAGTA ACCATAA ATATATAT"
ignore\_all\_mers\_from\_files | Not Enabled | ignore any mers found in these files. space separated.
output\_directory | $foreground\_$background/ | ex. if fg is Bacillus.fasta and  bg is HumanGenome.fasta then folder would be $PWD/Bacillus.fasta\_HumanGenome\_output.fasta/
counts\_directory | $output\_directory/.tmp | directory for counts directory
tmp\_directory | $output\_directory/.tmp | temporary files directory
max\_select | 15 | maximum number of mers to pick
max\_check | 35  | maximum number of mers to select (check the top #)
foreground | Not Enabled | path of foreground file
background | Not Enabled | path of background file
max\_consecutive\_binding | 4 | The maximum number of consecutive binding nucleotides in homodimer and heterodimers
fg\_weight | 0 | How much extra weight to give higher frequency mers in fg. see "equations" (between 0 and 1)
primer\_weight | 0 | How much extra weight to give to sets with a higher number of primers. (between 0 and 1)
output\_top\_nb | 10000 | How many scores do you want to output in your sorted output file?
score\_func | Not Enabled | see the [custom scoring](#custom-scoring-function) section
sort\_by | min | How do you want to rank top-scores? min means smaller is better, max is larger. 'min' or 'max'

## Equations

Here's what we are using to determine our scoring and selectivity

### Mer Selectivity

Our selectivity is what we use to determine what top $max\_check mers are checked later
on in our scoring function. Currently we use this formula:

By default our fg\_weight is zero. This gives no extra weight to more
frequently occurring mers, but can be set higher with the fg\_weight
environmental variable if you wish to do so.

    hit = abundance of primer X (ex. 'ATGTA') in background

    (foreground hit / background hit) * (foreground hit ^ fg_weight)


### Scoring combinations 

All variables used in our scoring function are described here:

    fg_pts = an array of all the points of each mer in the combination, and sequence ends
    fg_mean_dist = mean distance between each point in fg_pts
    fg_stddev = standard deviation of distance between each point in fg_pts

    nb_primers = number of primers in a combination
    primer_weight = extra weight for sets with higher primers

    bg_ratio = length of background / number of times primer was in background

#### Default scoring function

The default scoring function is this:

    mer_score = (nb_primers**primer_weight) * (fg_mean_dist * fg_std_dist) / bg_ratio

#### Custom scoring function

We support custom scoring via python's exec methods. This means that you can
destroy your system, blow up the universe, implode your hard drive, all within
the confines of this exec. That means don't do anything crazy. Stick to basic arithmetic.
 
This is a security hole.

you can specify it like any other parameter like so:

    # the default function
    score_func="(nb_primers**primer_weight) * (fg_mean_dist * fg_std_dist) / bg_ratio"

You need to use **valid** python code. 

## Filters

There are several filters that our mers go through, to eliminate ones that
won't fit our needs. They are all configurable via the tunable parameters. If
you look in a output directory, you'll see a folder called "passes-filter".
This contains a file for each of the different steps in the pipeline, and the
contents of each file is what 'passes' that filter.

For example, if you ignored the mer 'AAAAA', then in
passes-filter/1-$foreground-ignore-mers there would be no line containing that.

The filter system works like a big pipe, whatever gets filtered out won't make
it to the next step. the order is like this


    All mers -> ignore_mers -> ignore_all_mers -> average_binding -> non_melting -> consecutive_binding


## Output

The file structure outputted by default is this:

    $foreground_$background
    └── run_1 # current_run
        ├── passes-filter # filter folder for filtering steps
        │   ├── 1-$foreground-ignore-mers
        │   ├── 2-$foreground-ignore-all-mers
        │   ├── 3-$foreground-average-binding
        │   ├── 4-$foreground-non-melting
        │   └── 5-$foreground-consecutive-binding
        ├── $foreground-filtered-counts # final filtered mers used for select_mers.py
        ├── parameters # parameters used in the run
        ├── selected-mers # final filtered mers used for select_mers.py
        ├── selected-mers # final filtered mers used for select_mers.py
        ├── all-scores    # file outputted by score_mers.py (all the scores generated)
        └── top-scores    # the sorted top $output_top_nb scores from all-scores

### select\_mers.py output

Select mers outputs a tab delimited file, with 4 columns: mer, foreground count,
background count, and the mer selectivity value. (higher is better)

    CTAACTTAGGTC  1572  155  10.14194
    CTAACATAGGTC  1479  132  11.20455
    GACCTATGTTAG  1479  132  11.20455


### score\_mers.py output

score mers outputs a tab delimited file with 6 columns:

    nb_primers  Combination  Score  FG_mean_dist  FG_stdev_dist  BG_ratio

## Post Processing

To get a more detailed look at each scored combination we provide the
output\_full\_genome.py script. This script will output all of the points in a 
selected set along with some metadata, including position, what sequence it is in,
what strand and what mer it is.

    output_full_genome.py -f fg.fa -s fg.fa_bg.fa/run_12/top-scores -n 15 -o sets

this will output one file for eat of the the top 15 sets in top-scores, in the
folder sets.