README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

SelectiveGenomeAmplification
============================

PI: http://brisson.bio.upenn.edu/


## Requirements
To use this you'll need:

 - A unix environment
 - kmer_total_count, a kmer counter available here: http://github.com/mutantturkey/dna-utils/
 - bash or compliant shell.
 
 
## Setup

    git clone git@github.com:mutantturkey/SelectiveGenomeAmplification.git
    cd SelectiveGenomeAmplification
    make
    sudo make install

## Usage Examples
Standard use of (SGA) SelectiveGenomeAmplification is easy. it takes two arguments,
the foreground and background


    SelectiveGenomeAmplification PfalciparumGenome.fasta HumanGenome.fasta;
    less PfalciparumGenome_HumanGenome/final_mers

SGA allows for many tunable parameters, which are all explained in the chart
below.  For user customizable variables, they need to be passed in as
environmental variables like so:

    max_mer_distance=5000 max_select=6 min_mer_range=6 max_mer_range=12 \
    SelectiveGenomeAmplification.sh PfalciparumGenome.fasta half.fasta 

SGA also comes with a easy to use user prompt called SelectiveGenomeAmplificationUI.
It allows for a less expereienced user to use
SGA without issue.

### Running individual steps

By default SelectiveGenomeAmplification runs all four steps, but you can
specify the program to run other steps, like in these examples.

    current_run=run_1 SelectiveGenomeAmplification target.fasta bg.fasta score

    current_run=run_1 SelectiveGenomeAmplification target.fasta bg.fasta select score

    current_run=run_1 SelectiveGenomeAmplification target.fasta bg.fasta 3 4 

valid steps are these:

- count (1)
- filter (2)
- select (3)
- score (4)

This function does not try to be smart, so use it wisely.

### Manually scoring specific mer combinations from file

Users can manually score combinations of mers they choose using the
score\_mers.py script.

    score_mers.py -f foreground.fa -b background.fa -c combination file -o output


The combination file should look like this:

    ACGATATAT TACATAGA TATATATAT ACGTACCAT ATATTA
    AAATTATCAGT ATACATA ATATACAT ATATACATA ACATA
		ATATACATA ATCATGATA CCAGATACATAT

each row is combination to be scored.


### Manually score all combinations from file
Users can manually score all  combinations of mers they choose using the
score\_mers.py script.

    score_mers.py -f foreground.fa -b background.fa -m mer file -o output


The mer file should look like this:

    ATATAT
		TACATA
		TACATAGCA
		TATAGAATAC
		CGTAGATA
		TAGAAT

each row is a seperate mer. do not put multiple mers on one line.


## Customizable variables

range of mers, min and max 

variable | default | notes
:---- | :---- | ---- | :----
current\_run | Not Enabled | specify the run you want to run steps on
min\_mer\_range | 6  | minimum mer size to use
max\_mer\_range | 12 | maximum mer size to use 
max\_mer\_distance | 5000 | maximum distance between mers in foreground
output\_directory | $foreground\_$background/ | ex. if fg is Bacillus.fasta and  bg is HumanGenome.fasta then folder would be $PWD/Bacillus.fasta\_HumanGenome\_output.fasta/
counts\_directory | $output\_directory/.tmp | directory for counts directory
tmp\_directory | $output\_directory/.tmp | temporary files directory
max\_melting\_temp | 30° | maximum melting temp of mers
min\_melting\_temp | 0° | minimum melting temp of mers
min\_foreground\_binding\_average | 50000 | elminate mers that appear less frequently than the average  (length of foreground / # of occurances)
max\_select | 15 | maximum number of mers to pick
max\_check | 35  | maximum number of mers to select (check the top #)
ignore\_mers | Not Enabled | mers to explicitly ignore, space seperated ex. ignore\_mers="ACAGTA ACCATAA ATATATAT"
ignore\_all\_mers\_from\_files | Not Enabled | ignore any mers found in these files. space seperated.
foreground | Not Enabled | path of foreground file
background | Not Enabled | path of background file
max\_consecutive\_binding | 4 | The maxium number of consecutive binding nucleotides in homodimer and heterodimers
fg\_weight | 0 | How much extra weight to give higher frequency mers in fg. see "equations" (between 0 and 1)
primer\_weight | 0 | How much extra weight to give to sets with a higher number of priemrs. (between 0 and 1)

## Equations

Here's what we are using to determine our scoring and selectivity

### Selecivity

Our selectivity is what we use to determine what top $max\_check mers are checked later
on in our scoring function. Currently we use this formula:

By default our fg\_weight is zero. This gives no extra weight to more
frequently occuring mers, but can be set higher with the fg\_weight
environmental variable if you wish to do so.

    hit = abundance of primer X (ex. 'ATGTA') in background

    (foreground hit / background hit) * (foreground hit ^ fg_weight)


### Score function

The scoring function is this:

    fg_pts = all the points of each mer in the combination, and sequence ends4
		fg_mean_dist = mean distance between each point in fg_pts
		fg_stddev = standard deviation of distance between each point in fg_pts

		nb_primers = number of primers in a combination
		primer_weight = extra weight for sets with higher primers

		bg_ratio = length of background / number of times primer was in background

    mer_score = (nb_primers**primer_weight) * (fg_mean_dist * fg_std_dist) / bg_ratio

## Output

The file structure outputted by default is this:
		$foreground_$background/	
		└── run_1 # current_run
		    ├── filter # filter folder for filtering steps
		    │   ├── 1-$foreground-ignore-mers
		    │   ├── 2-$foreground-ignore-all-mers
		    │   ├── 3-$foreground-average-binding
		    │   ├── 4-$foreground-non-melting
		    │   └── 5-$foreground-consecutive-binding
		    ├── $foreground-filtered-counts # final filtered mers used for selected_mers.py
		    ├── parameters # parameters used in the run
		    ├── selected-mers # final filtered mers used for selected_mers.py
		    └── scores-output # file outputted by score_mers.py