2020-06-03
The following robots.txt file tells Baiduspider (baidu.com), 360Spider (so.com), Yisouspider (sm.cn), PetalBot (Huawei), Bytespider (bytedance.com), Sogou spiders (sogou.com) not to crawl the entire website.
User-agent: Baiduspider User-agent: 360Spider User-agent: Yisouspider User-agent: PetalBot User-agent: Bytespider User-agent: Sogou web spider User-agent: Sogou inst spider Disallow: /
However, I still see Bytespider
and Sogou web spider/4.0
from nginx access logs.
It seems Bytespider and Sogou spiders are not fully compatible with robots exclusion standard. These crawlers magically disappeared one week after I created a separate block for each user agent in robots.txt.
User-agent: Baiduspider User-agent: 360Spider User-agent: Yisouspider User-agent: PetalBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Sogou web spider Disallow: / User-agent: Sogou inst spider Disallow: /
Search Engine | User Agent | Reverse DNS Lookup |
---|---|---|
Baidu | Baiduspider | baiduspider-*-*-*-*.crawl.baidu.com. |
Sogou | Sogou web spider | sogouspider-*-*-*-*.crawl.sogou.com |
Byte Dance | Bytespider | bytespider-*-*-*-*.crawl.bytedance.com |
Shenma | Yisouspider | shenmaspider-*-*-*-*.crawl.sm.cn |
Huawei | PetalBot | petalbot-*-*-*-*.aspiegel.com |
LINE (Naver) | Linespider | crawl.*-*-*-*.search.line-apps.com |
Naver | Yeti | crawl.*-*-*-*.web.naver.com |
Cốc Cốc | coccocbot | bot-*-*-*-*.coccoc.com |
Qwant | Qwantify | qwantbot-*-*-*-*.qwant.com |
Apple | Applebot | *-*-*-*.applebot.apple.com |
Twitterbot | r-*-*-*-*.twttr.com | |
facebookexternalhit | fwdproxy-*-*.fbsv.net |