GetBEDnarrowPeakgz
Reads a gz compressed BED narrow peak file from a web address or local file and returns a pandas dataframe.
GetBEDnarrowPeakgz(URL_or_PATH_TO_file)
-
URL_or_PATH_TO_file
source of input bed. Either a web link or a path to a local file. -
returns
a pandas dataframe of the inpud bed.
>>> import AGEpy as age
>>> eCLIP_1_bednarrowPeak="https://www.encodeproject.org/files/ENCFF066PCT/@@download/ENCFF066PCT.bed.gz"
>>> bed=age.GetBEDnarrowPeakgz(eCLIP_1_bednarrowPeak)
>>> print bed.head()
chrom chromStart chromEnd name score strand signalValue \
0 chr7 139371278 139371296 Peak_0 1000 + 5.09062636514014
1 chr7 139371257 139371278 Peak_1 1000 + 5.0840236303159
2 chr7 155781335 155781431 Peak_2 1000 + 3.70481328524336
3 chr7 87156569 87156676 Peak_3 1000 + 3.95023151551588
4 chr7 105073472 105073521 Peak_4 1000 + 4.14556204062503
-log10(pValue) -log10(qvalue) peak
0 48.9834262537309 -1 -1
1 48.7463712698062 -1 -1
2 42.6519289009201 -1 -1
3 37.7848384917051 -1 -1
4 34.0756845242392 -1 -1
writeBED
Writes a bed dataframe into a bed file.
writeBED(inBED, file_path)
inBED
a pandas dataframe with the contents of the bed file to be written.-
file_path
path to target file. -
returns
nothing.
>>> import AGEpy as age
>>> print bed.head()
chrom chromStart chromEnd name score strand signalValue \
0 chr7 139371278 139371296 Peak_0 1000 + 5.09062636514014
1 chr7 139371257 139371278 Peak_1 1000 + 5.0840236303159
2 chr7 155781335 155781431 Peak_2 1000 + 3.70481328524336
3 chr7 87156569 87156676 Peak_3 1000 + 3.95023151551588
4 chr7 105073472 105073521 Peak_4 1000 + 4.14556204062503
-log10(pValue) -log10(qvalue) peak
0 48.9834262537309 -1 -1
1 48.7463712698062 -1 -1
2 42.6519289009201 -1 -1
3 37.7848384917051 -1 -1
4 34.0756845242392 -1 -1
>>> age.writeBED(bed,"/path/to/file.bed")
dfTObedtool
Transforms a pandas dataframe into a bedtool. Requires bedtools
to be in your path
.
dfTObedtool(df)
df
a pandas dataframe.returns
a bedtool.
>>> import AGEpy as age
>>> print bed.head()
chrom chromStart chromEnd name score strand signalValue \
0 chr7 139371278 139371296 Peak_0 1000 + 5.09062636514014
1 chr7 139371257 139371278 Peak_1 1000 + 5.0840236303159
2 chr7 155781335 155781431 Peak_2 1000 + 3.70481328524336
3 chr7 87156569 87156676 Peak_3 1000 + 3.95023151551588
4 chr7 105073472 105073521 Peak_4 1000 + 4.14556204062503
-log10(pValue) -log10(qvalue) peak
0 48.9834262537309 -1 -1
1 48.7463712698062 -1 -1
2 42.6519289009201 -1 -1
3 37.7848384917051 -1 -1
4 34.0756845242392 -1 -1
>>> bedtool=age.dfTObedtool(bed)
>>> print bedtool.head()
chr7 139371278 139371296 Peak_0 1000 + 5.09062636514014 48.9834262537309 -1 -1
chr7 139371257 139371278 Peak_1 1000 + 5.0840236303159 48.7463712698062 -1 -1
chr7 155781335 155781431 Peak_2 1000 + 3.70481328524336 42.6519289009201 -1 -1
chr7 87156569 87156676 Peak_3 1000 + 3.95023151551588 37.7848384917051 -1 -1
chr7 105073472 105073521 Peak_4 1000 + 4.14556204062503 34.0756845242392 -1 -1
chr7 128761857 128761952 Peak_5 1000 + 4.02131461357736 33.9350181783027 -1 -1
chr7 121296414 121296454 Peak_6 1000 + 3.50632247892067 30.2512926812531 -1 -1
chr7 139368342 139368352 Peak_7 1000 + 4.41912711395099 29.6666535015756 -1 -1
chr7 87155583 87155635 Peak_8 1000 + 4.08769554637519 29.3752024210392 -1 -1
chr7 105540000 105540028 Peak_9 1000 + 4.2212263105571 29.0451450847765 -1 -1
>>> print type(bed)
<class 'pandas.core.frame.DataFrame'>
>>> print type(bedtool)
<class 'pybedtools.bedtool.BedTool'>
GetPeaksExons
Annotates a bedtool, BED narrow peak.
GetPeaksExons(bed,parsedGTF)
bed
a pandas dataframe in bed format-
parsedGTF
a parsed GTF file as outputed by parseGTF() -
returns
a Pandas dataframe
>>> import AGEpy as age
>>> print bed.head()
chrom chromStart chromEnd name score strand signalValue \
0 chr7 139371278 139371296 Peak_0 1000 + 5.09062636514014
1 chr7 139371257 139371278 Peak_1 1000 + 5.0840236303159
2 chr7 155781335 155781431 Peak_2 1000 + 3.70481328524336
3 chr7 87156569 87156676 Peak_3 1000 + 3.95023151551588
4 chr7 105073472 105073521 Peak_4 1000 + 4.14556204062503
-log10(pValue) -log10(qvalue) peak
0 48.9834262537309 -1 -1
1 48.7463712698062 -1 -1
2 42.6519289009201 -1 -1
3 37.7848384917051 -1 -1
4 34.0756845242392 -1 -1
>>> GTF=age.readGTF("/beegfs/group_bit/data/projects/departments/Bioinformatics/bit_RNAseq_eCLIP/downloads/gencode.v24.primary_assembly.annotation.gtf")
>>> print GTF.head()
seqname source feature start end score strand frame \
0 chr1 HAVANA gene 11869 14409 . + .
1 chr1 HAVANA transcript 11869 14409 . + .
2 chr1 HAVANA exon 11869 12227 . + .
3 chr1 HAVANA exon 12613 12721 . + .
4 chr1 HAVANA exon 13221 14409 . + .
attribute
0 gene_id "ENSG00000223972.5"; gene_type "transc..."
1 gene_id "ENSG00000223972.5"; transcript_id "EN..."
2 gene_id "ENSG00000223972.5"; transcript_id "EN..."
3 gene_id "ENSG00000223972.5"; transcript_id "EN..."
4 gene_id "ENSG00000223972.5"; transcript_id "EN..."
>>> GTFpa=age.parseGTF(GTF)
>>> print GTFpa.head()
seqname source feature start end score strand frame gene_status \
0 chr1 HAVANA gene 11869 14409 . + . KNOWN
1 chr1 HAVANA transcript 11869 14409 . + . KNOWN
2 chr1 HAVANA exon 11869 12227 . + . KNOWN
3 chr1 HAVANA exon 12613 12721 . + . KNOWN
4 chr1 HAVANA exon 13221 14409 . + . KNOWN
havana_gene ... exon_id transcript_id \
0 OTTHUMG00000000961.2 ... NaN NaN
1 OTTHUMG00000000961.2 ... NaN ENST00000456328.2
2 OTTHUMG00000000961.2 ... ENSE00002234944.1 ENST00000456328.2
3 OTTHUMG00000000961.2 ... ENSE00003582793.1 ENST00000456328.2
4 OTTHUMG00000000961.2 ... ENSE00002312635.1 ENST00000456328.2
exon_number ont havana_transcript ccdsid transcript_name \
0 NaN NaN NaN NaN NaN
1 NaN NaN OTTHUMT00000362751.1 NaN DDX11L1-002
2 1 NaN OTTHUMT00000362751.1 NaN DDX11L1-002
3 2 NaN OTTHUMT00000362751.1 NaN DDX11L1-002
4 3 NaN OTTHUMT00000362751.1 NaN DDX11L1-002
gene_type transcript_status gene_name
0 transcribed_unprocessed_pseudogene NaN DDX11L1
1 transcribed_unprocessed_pseudogene KNOWN DDX11L1
2 transcribed_unprocessed_pseudogene KNOWN DDX11L1
3 transcribed_unprocessed_pseudogene KNOWN DDX11L1
4 transcribed_unprocessed_pseudogene KNOWN DDX11L1
>>> bedAn=age.GetPeaksExons(bed,GTFpa)
>>> print bedAn.head()
chrom chromStart chromEnd name score strand signalValue \
0 chr7 155781335 155781431 Peak_2 1000 + 3.704813
1 chr7 155781335 155781431 Peak_2 1000 + 3.704813
2 chr7 121296414 121296454 Peak_6 1000 + 3.506322
3 chr7 87155538 87155583 Peak_16 1000 + 4.077391
4 chr7 107904733 107904812 Peak_17 1000 + 3.674368
-log10(pValue) -log10(qvalue) peak ... \
0 42.651929 -1 -1 ...
1 42.651929 -1 -1 ...
2 30.251293 -1 -1 ...
3 22.798739 -1 -1 ...
4 21.118496 -1 -1 ...
gene_id exon_id_count exon_id norm. mean -log10(pValue) \
0 ENSG00000184863.10 1 42.651929
1 ENSG00000184863.10 1 42.651929
2 ENSG00000106034.17 1 30.251293
3 ENSG00000135164.18 3 2951.868281
4 ENSG00000091140.12 1 21.118496
exon_id signalValue transcript_id_count \
0 3.704813 1
1 3.704813 1
2 3.506322 1
3 42.703999 3
4 3.674368 1
transcript_id norm. mean -log10(pValue) transcript_id signalValue \
0 42.651929 3.704813
1 42.651929 3.704813
2 30.251293 3.506322
3 2951.868281 42.703999
4 21.118496 3.674368
gene_id_count gene_id norm. mean -log10(pValue) gene_id signalValue
0 4 116.619012 17.830941
1 4 116.619012 17.830941
2 2 30.251293 2.144090
3 8 3300.707425 73.902289
4 5 135.139064 22.210269
gene_id_count: number of intervals overlapping this gene
transcript_id_count: number of intervals overlapping this transcript
exon_id_count: number of intervals overlapping this exon
AnnotateBED
Annotates a bedtool, BED narrow peak.
AnnotateBED(bed,GTF, genome_file, bedcols=None, promoter=[1000,200])
bed
either a /path/to/file.bed or a Pandas dataframe in bed format. /path/to/file.bed implies bedcols.GTF
/path/to/file.gtfgenome_file
/path/to/file.genome - a tab separated values of chr name and size informationbedcols
a comma separated string of column headers to use when reading in a bed file. eg: "chr,start,end,name"-
promoter
a list containing the upstream start of the promoter region from the TSS and the downstream end of the promoter region from the TSS. -
returns
a Pandas dataframe with the annotated bed file. exons and promoters will be reported as well in the annotated_gene_features column.
```python
>>> import AGEpy as age
>>> print bed.head()
chr start end name signal value strand fold change \
0 2 175167300 175167740 chip1_peak_2 58.993528 . 2.965444
1 2 27052080 27052410 chip1_peak_3 154.897096 . 7.786255
2 1 243719300 243719630 chip1_peak_4 99.776458 . 5.015490
3 17 2564650 2564980 chip1_peak_5 72.892502 . 3.664107
4 7 44999240 44999570 chip1_peak_6 106.434435 . 5.350169
p-value Benjamini-Hochberg FDR enriched in marker
0 5.747544e-15 4.044835e-08 control K27AC
1 2.197614e-14 8.934691e-08 control K27AC
2 2.915657e-14 8.934691e-08 control K27AC
3 3.173957e-14 8.934691e-08 control K27AC
4 3.871249e-14 9.081308e-08 control K27AC
>>> bed=AnnotateBED(bed,"hg38.83.gtf","hg38.83.genome")
>>> print bed.head()
chr start end name signal value strand fold change \
0 1 789880 791350 chip2_peak_44728 172.757426 . 8.473977
1 1 820750 822710 chip1_peak_22461 148.812870 . 11.672676
2 1 905550 905850 chip1_peak_1792 289.437404 . 13.231699
3 1 913500 913800 chip1_peak_4243 43.508330 . 1.988994
4 1 960150 960450 chip1_peak_1666 67.008675 . 3.063317
p-value Benjamini-Hochberg FDR enriched in marker \
0 6.043877e-06 0.000314 stretch H3K9
1 5.292319e-07 0.000057 control H3K9
2 9.544848e-07 0.004798 control H3K27
3 4.932846e-06 0.010117 control H3K27
4 8.347840e-07 0.004535 control H3K27
annotated_gene_features
0 RP5-857K21.4; RP11-206L10.9
1 RP5-857K21.4
2 RP11-54O7.16
3 RP11-54O7.1: exon; RP11-54O7.2: promoter; RP11...
4 NOC2L: promoter; KLHL17: promoter