Skip to main content

让下载科学数据变得更加容易

项目描述

数据下载器

让下载科学数据变得更加容易

介绍

data-downloader 是一个非常方便和强大的数据下载包,用于使用 HTTP、HTTPS 检索文件。目前包括下载模型downloader和url解析模型parse_urls。如前所述httpx,它提供了一种以同步和异步方式访问网站的方法,您可以同时下载多个文件。

data-downloader 具有许多使检索文件变得容易的功能,包括:

  • 如果网站支持恢复,则可以在重新执行代码时自动恢复中止的下载(当向提供 Range 标头的服务器发送 HEAD 请求时,状态代码为 216 或 416)
  • 下载单个文件很慢时可以同时下载多个文件。提供了两种方法来实现这个功能:
    • async_download_datas(推荐)功能可以同时下载超过100个文件使用异步请求httpx
    • mp_download_datas功能取决于您的计算机 CPU 作为使用multiprocessing
  • .netrc通过文件或authorize_from_browser参数提供一种便捷的方式来管理您的用户名和密码。当网站需要用户名和密码时,无需每次下载都提供
  • 提供一种方便的方式来解析 url。
    • from_urls_file:从仅包含 url 的文件中解析数据的 url
    • from_sentinel_meta4:从从https://scihub.copernicus.eu/dhus下载的哨兵products.meta4文件中解析 url
    • from_EarthExplorer_order: 从 EarthExplorer 中的订单解析 url(同bulk-downloader
    • from_html: 从 html 网站解析 url

1.安装

建议使用最新版本的 pip 安装data_downloader

pip install data_downloader

2.下载器使用

所有下载功能都在data_downloader.downloader. 所以一开始就导入downloader

from data_downloader import downloader

2.1 网路

如果网站需要登录,您可以.netrc在您家中的文件中添加一条记录,该文件包含您的登录信息,以避免每次下载数据时都提供用户名和密码。

查看文件中的现有主机.netrc

netrc = downloader.Netrc()
print(netrc.hosts)

添加记录

netrc.add(self, host, login, password, account=None, overwrite=False)

如果要更新记录,请设置 tha 参数overwrite=True

对于 NASA 数据用户:

netrc.add('urs.earthdata.nasa.gov','your_username','your_password')

downloader.get_url_host(url)当您不知道网站的主机时,您可以使用获取主机名:

host = downloader.get_url_host(url)

删除记录

netrc.remove(self, host)

清除所有记录

netrc.clear()

例子:

In [2]: netrc = downloader.Netrc()
In [3]: netrc.hosts
Out[3]: {}

In [4]: netrc.add('urs.earthdata.nasa.gov','username','passwd') 

In [5]: netrc.hosts
Out[5]: {'urs.earthdata.nasa.gov': ('username', None, 'passwd')}

In [6]: netrc
Out[6]:
machine urs.earthdata.nasa.gov
	login username
	password passwd

# This command only for linux user
In [7]: !cat ~/.netrc
machine urs.earthdata.nasa.gov
	login username
	password passwd

In [8]: url = 'https://gpm1.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FGPM_L3%2FGPM_3IMERGM.06%2F2000%2F3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5&FORMAT=bmM0Lw&BBOX=31.904%2C99.492%2C35.771%2C105.908&LABEL=3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5.SUB.nc4&SHORTNAME=GPM_3IMERGM&SERVICE=L34RS_GPM&VERSION=1.02&DATASET_VERSION=06&VARIABLES=precipitation'

In [9]: downloader.get_url_host(url)
Out[9]: 'gpm1.gesdisc.eosdis.nasa.gov'

In [10]: netrc.add(downloader.get_url_host(url),'username','passwd')

In [11]: netrc
Out[11]:
machine urs.earthdata.nasa.gov
        login username
        password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
        login username
        password passwd

In [12]: netrc.add(downloader.get_url_host(url),'username','new_passwd')
>>> Warning: test_host existed, nothing will be done. If you want to overwrite the existed record, set overwrite=True

In [13]: netrc
Out[13]:
machine urs.earthdata.nasa.gov
        login username
        password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
        login username
        password passwd

In [14]: netrc.add(downloader.get_url_host(url),'username','new_passwd',overwrite=True)

In [15]: netrc
Out[15]:
machine urs.earthdata.nasa.gov
        login username
        password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
        login username
        password new_passwd

In [16]: netrc.remove(downloader.get_url_host(url))

In [17]: netrc
Out[17]:
machine urs.earthdata.nasa.gov
        login username
        password passwd

In [18]: netrc.clear()

In [19]: netrc.hosts
Out[19]: {}

2.2 下载数据

此功能是为下载单个文件而设计的。如果您有很多文件要下载download_datas,请尝试使用mp_download_datas或运行async_download_datas

downloader.download_data(url, folder=None, authorize_from_browser=False, file_name=None, client=None, allow_redirects=False, retry=0)

参数:

url: str
    url of web file
folder: str
    the folder to store output files. Default is current folder.
authorize_from_browser: bool
    whether to load cookies used by your web browser for authorization.
    This means you can use python to download data by logining in to website 
    via browser (So far the following browsers are supported: Chrome,Firefox, 
    Opera, Edge, Chromium"). It will be very usefull when website doesn't support
    "HTTP Basic Auth". Default is False.
file_name: str
    the file name. If None, will parse from web response or url.
    file_name can be the absolute path if folder is None.
client: httpx.Client() object
    client maintaining connection. Default is None
allow_redirects: bool
    Enables or disables HTTP redirects
retry: int 
    number of reconnects when status code is 503

例子:

In [6]: url = 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_201
   ...: 41211.geo.unw.tif'
   ...:  
   ...: folder = 'D:\\data'
   ...: downloader.download_data(url,folder)

20141117_20141211.geo.unw.tif:   2%|                   | 455k/22.1M [00:52<42:59, 8.38kB/s]

2.3 下载数据

从包含 url 的对象列表中下载数据。此功能将逐个下载文件。

downloader.download_datas(urls, folder=None, authorize_from_browser=False, file_names=None):

参数:

urls:  iterator
    iterator contains urls
folder: str
    the folder to store output files. Default is current folder.
authorize_from_browser: bool
    whether to load cookies used by your web browser for authorization.
    This means you can use python to download data by logining in to website 
    via browser (So far the following browsers are supported: Chrome,Firefox, 
    Opera, Edge, Chromium"). It will be very usefull when website doesn't support
    "HTTP Basic Auth". Default is False.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program to parse
    them from website. file_names can cantain the absolute paths if folder is None.

例子:

In [12]: from data_downloader import downloader 
    ...:  
    ...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20
    ...: 141211.geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221
    ...: .geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
    ...: .geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221
    ...: .geo.cc.tif']  
    ...:  
    ...: folder = 'D:\\data' 
    ...: downloader.download_datas(urls,folder)

20141117_20141211.geo.unw.tif:   6%|           | 1.37M/22.1M [03:09<2:16:31, 2.53kB/s]

2.4 mp_download_datas

使用多处理同时下载文件。不支持恢复的网站可能下载不完整。您可以download_datas改用

downloader.mp_download_datas(urls, folder=None,  authorize_from_browser=False, file_names=None,ncore=None, desc='')

参数:

urls:  iterator
    iterator contains urls
folder: str
    the folder to store output files. Default is current folder.
authorize_from_browser: bool
    whether to load cookies used by your web browser for authorization.
    This means you can use python to download data by logining in to website 
    via browser (So far the following browsers are supported: Chrome,Firefox, 
    Opera, Edge, Chromium"). It will be very usefull when website doesn't support
    "HTTP Basic Auth". Default is False.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program to parse
    them from website. file_names can cantain the absolute paths if folder is None.
ncore: int
    Number of cores for parallel processing. If ncore is None then the number returned
    by os.cpu_count() is used. Default is None.
desc: str
    description of data downloading

例子:

In [12]: from data_downloader import downloader 
    ...:  
    ...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20
    ...: 141211.geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221
    ...: .geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
    ...: .geo.unw.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317
    ...: .geo.cc.tif', 
    ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221
    ...: .geo.cc.tif']  
    ...:  
    ...: folder = 'D:\\data' 
    ...: downloader.mp_download_datas(urls,folder)

 >>> 12 parallel downloading
 >>> Total | :   0%|                                         | 0/7 [00:00<?, ?it/s]
20141211_20150128.geo.cc.tif:  15%|██▊                | 803k/5.44M [00:00<?, ?B/s]

2.5 async_download_datas

使用异步模式同时下载文件。不支持恢复的网站可能会导致下载不完整。您可以download_datas改用

downloader.async_download_datas(urls, folder=None, authorize_from_browser=False, file_names=None, limit=30, desc='', allow_redirects=False,  retry=0)

参数:

urls:  iterator
    iterator contains urls
folder: str
    the folder to store output files. Default is current folder.
authorize_from_browser: bool
    whether to load cookies used by your web browser for authorization.
    This means you can use python to download data by logining in to website 
    via browser (So far the following browsers are supported: Chrome,Firefox, 
    Opera, Edge, Chromium"). It will be very usefull when website doesn't support
    "HTTP Basic Auth". Default is False.
file_names: iterator
    iterator contains names of files. Leaving it None if you want the program
    to parse them from website. file_names can cantain the absolute paths if folder is None.
limit: int
    the number of files downloading simultaneously
desc: str
    description of datas downloading
allow_redirects: bool
    Enables or disables HTTP redirects
retry: int
    number of reconnections when status code is 503

例子:

In [3]: from data_downloader import downloader 
   ...:  
   ...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049
   ...: _131313/interferograms/20141117_20141211/20141117_20141211.geo.unw.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141024_20150221/20141024_20150221.geo.unw.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141024_20150128/20141024_20150128.geo.cc.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141024_20150128/20141024_20150128.geo.unw.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141211_20150128/20141211_20150128.geo.cc.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141117_20150317/20141117_20150317.geo.cc.tif', 
   ...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
   ...: 3/interferograms/20141117_20150221/20141117_20150221.geo.cc.tif']  
   ...:  
   ...: folder = 'D:\\data' 
   ...: downloader.async_download_datas(urls,folder,limit=3,desc='interferograms')

>>> Total | Interferograms :   0%|                          | 0/7 [00:00<?, ?it/s]
    20141024_20150221.geo.unw.tif:  11%|    | 2.41M/21.2M [00:11<41:44, 7.52kB/s]
    20141117_20141211.geo.unw.tif:   9%|    | 2.06M/22.1M [00:11<25:05, 13.3kB/s]
    20141024_20150128.geo.cc.tif:  36%|██▏   | 1.98M/5.42M [00:12<04:17, 13.4kB/s] 
    20141117_20150317.geo.cc.tif:   0%|               | 0.00/5.44M [00:00<?, ?B/s]
    20141117_20150221.geo.cc.tif:   0%|               | 0.00/5.47M [00:00<?, ?B/s]
    20141024_20150128.geo.unw.tif:   0%|              | 0.00/23.4M [00:00<?, ?B/s]
    20141211_20150128.geo.cc.tif:   0%|               | 0.00/5.44M [00:00<?, ?B/s]

2.6 状态_ok

同时检测给定链接是否可访问。

downloader.status_ok(urls, limit=200, authorize_from_browser=False, timeout=60)

参数

urls: iterator
    iterator contains urls
limit: int
    the number of urls connecting simultaneously
authorize_from_browser: bool
    whether to load cookies used by your web browser for authorization.
    This means you can use python to download data by logining in to website 
    via browser (So far the following browsers are supported: Chrome,Firefox, 
    Opera, Edge, Chromium"). It will be very usefull when website doesn't support
    "HTTP Basic Auth". Default is False.
timeout: int
    Request to stop waiting for a response after a given number of seconds

返回:

结果列表(真或假)

例子:

 [ 1 ] 中: from data_downloader import downloader
   ... :将 numpy导入 np ... ... urls = np array ([ 'https://www.baidu.com' , ... : 'https://www.bai.com/wrongurl' , ... : 'https://cn.bing.com/' , ... : 'https://bing.com/wrongurl' , ... : 'https://bing.com/' ] ) ... : ... :  
    
      
    
    
    
     
    
    status_ok  = 下载器status_ok ( urls ) 
   ... :  urls_accessable  =  urls [ status_ok ] 
   ... : 打印( urls_accessable )

[ 'https://www.baidu.com'  'https://cn.bing.com/'  'https://bing.com/' ]

3. parse_url 用法

提供了一种从各种媒体获取 URL 的非常简单的方法

导入:

from data_downloader import parse_urls

3.1 from_urls_file

从仅包含 url 的文件中解析 url

parse_urls.from_urls_file(url_file)

参数:

url_file: str
    path to file which only contains urls 

返回:

列表包含 url

3.2 from_sentinel_meta4

从从https://scihub.copernicus.eu/dhus下载的哨兵products.meta4文件中 解析 url

parse_urls.from_sentinel_meta4(url_file)

参数:

url_file: str
    path to products.meta4

返回:

列表包含 url

3.3 from_html

从 html 网站解析 url

parse_urls.from_html(url, suffix=None, suffix_depth=0, url_depth=0)

参数:

url: str
    the website contains datas
suffix: list, optional
    data format. suffix should be a list contains multipart. 
    if suffix_depth is 0, all '.' will parsed. 
    Examples: 
        when set 'suffix_depth=0':
            suffix of 'xxx8.1_GLOBAL.nc' should be ['.1_GLOBAL', '.nc']
            suffix of 'xxx.tar.gz' should be ['.tar', '.gz']
        when set 'suffix_depth=1':
            suffix of 'xxx8.1_GLOBAL.nc' should be ['.nc']
            suffix of 'xxx.tar.gz' should be ['.gz']
suffix_depth: integer
    Number of suffixes
url_depth: integer
    depth of url in website will parsed

返回:

列表包含 url

例子:

from downloader import parse_urls

url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'
urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1)
urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1)
print(len(urls_all)-len(urls))

3.4 from_EarthExplorer_order

从 earthexplorer 中的订单解析 url。

参考:批量下载器

parse_urls.from_EarthExplorer_order(username=None, passwd=None, email=None,
                                    order=None, url_host=None)

参数:

username, passwd: str, optional
    your username and passwd to login in EarthExplorer. Chould be
    None when you have save them in .netrc
email: str, optional
    email address for the user that submitted the order
order: str or dict
    which order to download. If None, all orders retrieved from 
    EarthExplorer will be used.
url_host: str
    if host is not USGS ESPA

返回:

{orderid: urls} 格式的字典

例子:

from pathlib import Path
from data_downloader import downloader, parse_urls
folder_out = Path('D:\\data')
urls_info = parse_urls.from_EarthExplorer_order(
            'your username', 'your passwd')
for odr in urls_info.keys():
    folder = folder_out.joinpath(odr)
    if not folder.exists():
        folder.mkdir()
    urls = urls_info[odr]
    downloader.download_datas(urls, folder)

项目详情