让下载科学数据变得更加容易
项目描述
数据下载器
让下载科学数据变得更加容易
介绍
data-downloader 是一个非常方便和强大的数据下载包,用于使用 HTTP、HTTPS 检索文件。目前包括下载模型downloader和url解析模型parse_urls。如前所述httpx,它提供了一种以同步和异步方式访问网站的方法,您可以同时下载多个文件。
data-downloader 具有许多使检索文件变得容易的功能,包括:
- 如果网站支持恢复,则可以在重新执行代码时自动恢复中止的下载(当向提供 Range 标头的服务器发送 HEAD 请求时,状态代码为 216 或 416)
- 下载单个文件很慢时可以同时下载多个文件。提供了两种方法来实现这个功能:
async_download_datas(推荐)功能可以同时下载超过100个文件使用异步请求httpxmp_download_datas功能取决于您的计算机 CPU 作为使用multiprocessing包
.netrc通过文件或authorize_from_browser参数提供一种便捷的方式来管理您的用户名和密码。当网站需要用户名和密码时,无需每次下载都提供- 提供一种方便的方式来解析 url。
from_urls_file:从仅包含 url 的文件中解析数据的 urlfrom_sentinel_meta4:从从https://scihub.copernicus.eu/dhus下载的哨兵products.meta4文件中解析 urlfrom_EarthExplorer_order: 从 EarthExplorer 中的订单解析 url(同bulk-downloader)from_html: 从 html 网站解析 url
1.安装
建议使用最新版本的 pip 安装data_downloader。
pip install data_downloader
2.下载器使用
所有下载功能都在data_downloader.downloader. 所以一开始就导入downloader。
from data_downloader import downloader
2.1 网路
如果网站需要登录,您可以.netrc在您家中的文件中添加一条记录,该文件包含您的登录信息,以避免每次下载数据时都提供用户名和密码。
查看文件中的现有主机.netrc:
netrc = downloader.Netrc()
print(netrc.hosts)
添加记录
netrc.add(self, host, login, password, account=None, overwrite=False)
如果要更新记录,请设置 tha 参数overwrite=True
对于 NASA 数据用户:
netrc.add('urs.earthdata.nasa.gov','your_username','your_password')
downloader.get_url_host(url)当您不知道网站的主机时,您可以使用获取主机名:
host = downloader.get_url_host(url)
删除记录
netrc.remove(self, host)
清除所有记录
netrc.clear()
例子:
In [2]: netrc = downloader.Netrc()
In [3]: netrc.hosts
Out[3]: {}
In [4]: netrc.add('urs.earthdata.nasa.gov','username','passwd')
In [5]: netrc.hosts
Out[5]: {'urs.earthdata.nasa.gov': ('username', None, 'passwd')}
In [6]: netrc
Out[6]:
machine urs.earthdata.nasa.gov
login username
password passwd
# This command only for linux user
In [7]: !cat ~/.netrc
machine urs.earthdata.nasa.gov
login username
password passwd
In [8]: url = 'https://gpm1.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FGPM_L3%2FGPM_3IMERGM.06%2F2000%2F3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5&FORMAT=bmM0Lw&BBOX=31.904%2C99.492%2C35.771%2C105.908&LABEL=3B-MO.MS.MRG.3IMERG.20000601-S000000-E235959.06.V06B.HDF5.SUB.nc4&SHORTNAME=GPM_3IMERGM&SERVICE=L34RS_GPM&VERSION=1.02&DATASET_VERSION=06&VARIABLES=precipitation'
In [9]: downloader.get_url_host(url)
Out[9]: 'gpm1.gesdisc.eosdis.nasa.gov'
In [10]: netrc.add(downloader.get_url_host(url),'username','passwd')
In [11]: netrc
Out[11]:
machine urs.earthdata.nasa.gov
login username
password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
login username
password passwd
In [12]: netrc.add(downloader.get_url_host(url),'username','new_passwd')
>>> Warning: test_host existed, nothing will be done. If you want to overwrite the existed record, set overwrite=True
In [13]: netrc
Out[13]:
machine urs.earthdata.nasa.gov
login username
password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
login username
password passwd
In [14]: netrc.add(downloader.get_url_host(url),'username','new_passwd',overwrite=True)
In [15]: netrc
Out[15]:
machine urs.earthdata.nasa.gov
login username
password passwd
machine gpm1.gesdisc.eosdis.nasa.gov
login username
password new_passwd
In [16]: netrc.remove(downloader.get_url_host(url))
In [17]: netrc
Out[17]:
machine urs.earthdata.nasa.gov
login username
password passwd
In [18]: netrc.clear()
In [19]: netrc.hosts
Out[19]: {}
2.2 下载数据
此功能是为下载单个文件而设计的。如果您有很多文件要下载download_datas,请尝试使用mp_download_datas或运行async_download_datas
downloader.download_data(url, folder=None, authorize_from_browser=False, file_name=None, client=None, allow_redirects=False, retry=0)
参数:
url: str
url of web file
folder: str
the folder to store output files. Default is current folder.
authorize_from_browser: bool
whether to load cookies used by your web browser for authorization.
This means you can use python to download data by logining in to website
via browser (So far the following browsers are supported: Chrome,Firefox,
Opera, Edge, Chromium"). It will be very usefull when website doesn't support
"HTTP Basic Auth". Default is False.
file_name: str
the file name. If None, will parse from web response or url.
file_name can be the absolute path if folder is None.
client: httpx.Client() object
client maintaining connection. Default is None
allow_redirects: bool
Enables or disables HTTP redirects
retry: int
number of reconnects when status code is 503
例子:
In [6]: url = 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_201
...: 41211.geo.unw.tif'
...:
...: folder = 'D:\\data'
...: downloader.download_data(url,folder)
20141117_20141211.geo.unw.tif: 2%|▌ | 455k/22.1M [00:52<42:59, 8.38kB/s]
2.3 下载数据
从包含 url 的对象列表中下载数据。此功能将逐个下载文件。
downloader.download_datas(urls, folder=None, authorize_from_browser=False, file_names=None):
参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default is current folder.
authorize_from_browser: bool
whether to load cookies used by your web browser for authorization.
This means you can use python to download data by logining in to website
via browser (So far the following browsers are supported: Chrome,Firefox,
Opera, Edge, Chromium"). It will be very usefull when website doesn't support
"HTTP Basic Auth". Default is False.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program to parse
them from website. file_names can cantain the absolute paths if folder is None.
例子:
In [12]: from data_downloader import downloader
...:
...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20
...: 141211.geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221
...: .geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
...: .geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221
...: .geo.cc.tif']
...:
...: folder = 'D:\\data'
...: downloader.download_datas(urls,folder)
20141117_20141211.geo.unw.tif: 6%|█ | 1.37M/22.1M [03:09<2:16:31, 2.53kB/s]
2.4 mp_download_datas
使用多处理同时下载文件。不支持恢复的网站可能下载不完整。您可以download_datas改用
downloader.mp_download_datas(urls, folder=None, authorize_from_browser=False, file_names=None,ncore=None, desc='')
参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default is current folder.
authorize_from_browser: bool
whether to load cookies used by your web browser for authorization.
This means you can use python to download data by logining in to website
via browser (So far the following browsers are supported: Chrome,Firefox,
Opera, Edge, Chromium"). It will be very usefull when website doesn't support
"HTTP Basic Auth". Default is False.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program to parse
them from website. file_names can cantain the absolute paths if folder is None.
ncore: int
Number of cores for parallel processing. If ncore is None then the number returned
by os.cpu_count() is used. Default is None.
desc: str
description of data downloading
例子:
In [12]: from data_downloader import downloader
...:
...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20141211/20141117_20
...: 141211.geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150221/20141024_20150221
...: .geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141024_20150128/20141024_20150128
...: .geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141211_20150128/20141211_20150128
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150317/20141117_20150317
...: .geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_131313/interferograms/20141117_20150221/20141117_20150221
...: .geo.cc.tif']
...:
...: folder = 'D:\\data'
...: downloader.mp_download_datas(urls,folder)
>>> 12 parallel downloading
>>> Total | : 0%| | 0/7 [00:00<?, ?it/s]
20141211_20150128.geo.cc.tif: 15%|██▊ | 803k/5.44M [00:00<?, ?B/s]
2.5 async_download_datas
使用异步模式同时下载文件。不支持恢复的网站可能会导致下载不完整。您可以download_datas改用
downloader.async_download_datas(urls, folder=None, authorize_from_browser=False, file_names=None, limit=30, desc='', allow_redirects=False, retry=0)
参数:
urls: iterator
iterator contains urls
folder: str
the folder to store output files. Default is current folder.
authorize_from_browser: bool
whether to load cookies used by your web browser for authorization.
This means you can use python to download data by logining in to website
via browser (So far the following browsers are supported: Chrome,Firefox,
Opera, Edge, Chromium"). It will be very usefull when website doesn't support
"HTTP Basic Auth". Default is False.
file_names: iterator
iterator contains names of files. Leaving it None if you want the program
to parse them from website. file_names can cantain the absolute paths if folder is None.
limit: int
the number of files downloading simultaneously
desc: str
description of datas downloading
allow_redirects: bool
Enables or disables HTTP redirects
retry: int
number of reconnections when status code is 503
例子:
In [3]: from data_downloader import downloader
...:
...: urls=['http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049
...: _131313/interferograms/20141117_20141211/20141117_20141211.geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141024_20150221/20141024_20150221.geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141024_20150128/20141024_20150128.geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141024_20150128/20141024_20150128.geo.unw.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141211_20150128/20141211_20150128.geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141117_20150317/20141117_20150317.geo.cc.tif',
...: 'http://gws-access.ceda.ac.uk/public/nceo_geohazards/LiCSAR_products/106/106D_05049_13131
...: 3/interferograms/20141117_20150221/20141117_20150221.geo.cc.tif']
...:
...: folder = 'D:\\data'
...: downloader.async_download_datas(urls,folder,limit=3,desc='interferograms')
>>> Total | Interferograms : 0%| | 0/7 [00:00<?, ?it/s]
20141024_20150221.geo.unw.tif: 11%|▌ | 2.41M/21.2M [00:11<41:44, 7.52kB/s]
20141117_20141211.geo.unw.tif: 9%|▍ | 2.06M/22.1M [00:11<25:05, 13.3kB/s]
20141024_20150128.geo.cc.tif: 36%|██▏ | 1.98M/5.42M [00:12<04:17, 13.4kB/s]
20141117_20150317.geo.cc.tif: 0%| | 0.00/5.44M [00:00<?, ?B/s]
20141117_20150221.geo.cc.tif: 0%| | 0.00/5.47M [00:00<?, ?B/s]
20141024_20150128.geo.unw.tif: 0%| | 0.00/23.4M [00:00<?, ?B/s]
20141211_20150128.geo.cc.tif: 0%| | 0.00/5.44M [00:00<?, ?B/s]
2.6 状态_ok
同时检测给定链接是否可访问。
downloader.status_ok(urls, limit=200, authorize_from_browser=False, timeout=60)
参数
urls: iterator
iterator contains urls
limit: int
the number of urls connecting simultaneously
authorize_from_browser: bool
whether to load cookies used by your web browser for authorization.
This means you can use python to download data by logining in to website
via browser (So far the following browsers are supported: Chrome,Firefox,
Opera, Edge, Chromium"). It will be very usefull when website doesn't support
"HTTP Basic Auth". Default is False.
timeout: int
Request to stop waiting for a response after a given number of seconds
返回:
结果列表(真或假)
例子:
在 [ 1 ] 中: from data_downloader import downloader
... :将 numpy导入 为np ... :... :urls = np 。array ([ 'https://www.baidu.com' , ... : 'https://www.bai.com/wrongurl' , ... : 'https://cn.bing.com/' , ... : 'https://bing.com/wrongurl' , ... : 'https://bing.com/' ] ) ... : ... :
status_ok = 下载器。status_ok ( urls )
... : urls_accessable = urls [ status_ok ]
... : 打印( urls_accessable )
[ 'https://www.baidu.com' 'https://cn.bing.com/' 'https://bing.com/' ]
3. parse_url 用法
提供了一种从各种媒体获取 URL 的非常简单的方法
导入:
from data_downloader import parse_urls
3.1 from_urls_file
从仅包含 url 的文件中解析 url
parse_urls.from_urls_file(url_file)
参数:
url_file: str
path to file which only contains urls
返回:
列表包含 url
3.2 from_sentinel_meta4
从从https://scihub.copernicus.eu/dhus下载的哨兵products.meta4文件中 解析 url
parse_urls.from_sentinel_meta4(url_file)
参数:
url_file: str
path to products.meta4
返回:
列表包含 url
3.3 from_html
从 html 网站解析 url
parse_urls.from_html(url, suffix=None, suffix_depth=0, url_depth=0)
参数:
url: str
the website contains datas
suffix: list, optional
data format. suffix should be a list contains multipart.
if suffix_depth is 0, all '.' will parsed.
Examples:
when set 'suffix_depth=0':
suffix of 'xxx8.1_GLOBAL.nc' should be ['.1_GLOBAL', '.nc']
suffix of 'xxx.tar.gz' should be ['.tar', '.gz']
when set 'suffix_depth=1':
suffix of 'xxx8.1_GLOBAL.nc' should be ['.nc']
suffix of 'xxx.tar.gz' should be ['.gz']
suffix_depth: integer
Number of suffixes
url_depth: integer
depth of url in website will parsed
返回:
列表包含 url
例子:
from downloader import parse_urls
url = 'https://cds-espri.ipsl.upmc.fr/espri/pubipsl/iasib_CH4_2014_uk.jsp'
urls = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1)
urls_all = parse_urls.from_html(url, suffix=['.nc'], suffix_depth=1, url_depth=1)
print(len(urls_all)-len(urls))
3.4 from_EarthExplorer_order
从 earthexplorer 中的订单解析 url。
参考:批量下载器
parse_urls.from_EarthExplorer_order(username=None, passwd=None, email=None,
order=None, url_host=None)
参数:
username, passwd: str, optional
your username and passwd to login in EarthExplorer. Chould be
None when you have save them in .netrc
email: str, optional
email address for the user that submitted the order
order: str or dict
which order to download. If None, all orders retrieved from
EarthExplorer will be used.
url_host: str
if host is not USGS ESPA
返回:
{orderid: urls} 格式的字典
例子:
from pathlib import Path
from data_downloader import downloader, parse_urls
folder_out = Path('D:\\data')
urls_info = parse_urls.from_EarthExplorer_order(
'your username', 'your passwd')
for odr in urls_info.keys():
folder = folder_out.joinpath(odr)
if not folder.exists():
folder.mkdir()
urls = urls_info[odr]
downloader.download_datas(urls, folder)