Skip to main content

COnTROT(综合转录组学组织工具)是一个程序,它将下载和组织 GEO 中与搜索结果(通常是生物体)相关的所有表达数据。

项目描述

CONTORT

目的:

COnTROT(综合转录组组织工具)是一个程序,它将下载和组织与搜索结果相关的 GEO 中的所有表达数据。这将从 NCBI 的 GDS 搜索结果中识别并从 NCBI FTP 下载 GEO GDS、GSE、GSM 和 GPL 目录和文件。然后将组织系列数据以保留与提供的 GenBank 文件中存在的任何基因注释的匹配。数据将被组织、平均居中,然后使用基因注释连接到一个文本文件中,该文件可以在 Excel 中轻松操作或打开。

结果将是两个文件:一个具有数据均值居中的文件,另一个没有均值居中的文件。

用户必须提供 GenBank 文件以用于来自 NCBI 的注释。

COnTORT 将请求搜索词并使用它来搜索 NCBI GEO 并下载 NCBI GEO 结果。

请注意,CONTORT 只能下载和合并导入到 NCBI GEO 的标准化和处理过的数据。

COnTORT 已在 Linux(Ubuntu 和 CentOS)、MacOS (>10.13) 和 Windows 10 上进行了测试。

输入 :

  • 您的生物体的 GenBank (.gbff),从 NCBI 下载。

我们鼓励您在每次运行此脚本时创建一个新目录。

用法:

  1. 您可以使用 pip 安装,然后运行:

     pip install contort
     contort
    
  2. 您可以下载 git 存储库并运行原始脚本:

     git clone https://github.com/GLBRC/contort.git
     python contort.py
    

要求:

  • 蟒蛇 3
  • Python 模块 argparse、Bio、ftplib、functools、GEOparse、os、pandas、re、shutil、sys、subprocess、sys、tkinter、time、urllib

输出 :

COnTORT_organized_transcriptomic_data.txt 是主要输出。

主要输出是一个文本文件,其中包含前五列的基因注释和其余列中的平均居中表达数据。第一行包含引用每个实验的特定 GEO ID 的标题。请注意,GenBank 文件中不存在的注释以“N/A”列出。

Locus_Tag    Old_Locus_Tag    Gene_Name    Gene_Synonyms    Product    GSM_ID_1    GSM_ID_2
RSP_0002     N/A              spbB         N/A              H-NS       12.0         4.0

There will be two files:  one where the results are mean-centered for each experiment and one where no mean-centering is performed.

为所有文件创建子目录并组织目录:

创建的目录:

- geo                     - the downloaded GDS, GSE, GSM and GPL directories
- GEOannotate_results     - the results of running GEOparse to organize the annotation and data
- GeneOrf_match_output    - the results of matching the GEOparse results to the gene IDs from the GFF
- mean_centered_results   - the mean centered expression data for each experiment
- FTP_files               - files used to download the data from GEO via FTP
- log_files               - all log files from each step as well as other saved files

管道中使用的步骤和命令概述:

运行查找地理地址(GDSfile):

Parse GEOfile

Opens and parses the GDSfile result txt file from the search term provided by the user.

Creates new files:
    - GEO_FTP_Addresses.txt     - all GDS, GSE, GSM, GPL addresses in the file
    - GEO_FTP_directories.txt   - GDS, GSE, GSM, GPL directories that will be downloaded

run_is_ftp_dir(ftp_handle,名称,guess_by_extension=True):

QC for FTP

Determines if an item listed on the FTP server is a directory or not by 
looking for a "." in the fourth position. If it has that, it is nearly 
always a file and not a directory.

run_make_parent_dir( fpath ):

Make directories to match the FTP
Creates the directories in the local directory to match the FTP directories

run_download_FTP_file(ftp_handle, name, dest, overwrite):

Copy the FTP files to the local directory
Copy FTP files into the respective directories on the local directory

run_mirror_ftp_dir(ftp_handle,名称,覆盖,guess_by_extension):

Replicate the directories
Replicates a direcotry from the FTP server onto the local drive recusively

download_ftp_tree(ftp_handle,路径,目的地,覆盖=假,guess_by_extension=真):

Performs the actions to download files from the NCBI FTP

Perform the actions
Downloads an entire directory tree from an ftp server to the local destination
Will NOT overwrite files if present in the local directory

NCBI GEO 的默认 FTP 设置:

server = 'ftp.ncbi.nlm.nih.gov'
user = 'anonymous'
password = ''
destination = user input
sources of files = from the runFindGEOAddresses module

地理注释():

Organize the GEO series files

This first finds and copies all soft.gz files for all GSE (series) data in the 
GEO download from the previous steps. Using GEOparse, the metadata for each gene is 
collected and concatentated with the normalized data present for each series.

The new files are written for downstream steps and the copied soft.gz files are deleted.

gffMatch(GBFF):

Match the gene annotations from the GenBank file to the organized expression data

Using dictionaries created from the GenBank file for the orgainsm, this
script will search for matches to the gene annotation in the GenBank file and
retain only those data with matches. This will make new files for each GEOquery
output with columns representing gene annotations and then the log2 normalized data
from the GEO series files. The data are then mean centered and joined together
with the gene annotations annotation as the key. All blanks are retained for consistency.
This file is written and can be used in Excel or R for further analysis.

清理(​​cwd):

Clean up the directory

Organize the files into folders for a cleaner directory

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

contort-1.3.0.tar.gz (13.9 kB 查看哈希)

已上传 source

内置分布

contort-1.3.0-py3-none-any.whl (13.7 kB 查看哈希)

已上传 py3