2020-06-03
The following robots.txt file tells Baiduspider (baidu.com), 360Spider (so.com), Yisouspider (sm.cn), PetalBot (Huawei), Bytespider (bytedance.com), Sogou spiders (sogou.com) not to crawl the entire website.
User-agent: Baiduspider User-agent: 360Spider User-agent: Yisouspider User-agent: PetalBot User-agent: Bytespider User-agent: Sogou web spider User-agent: Sogou inst spider Disallow: /
However, I still see Bytespider and Sogou web spider/4.0 from nginx access logs.
It seems Bytespider and Sogou spiders are not fully compatible with robots exclusion standard. These crawlers magically disappeared one week after I created a separate block for each user agent in robots.txt.
User-agent: Baiduspider User-agent: 360Spider User-agent: Yisouspider User-agent: PetalBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Sogou web spider Disallow: / User-agent: Sogou inst spider Disallow: /
| Search Engine | User Agent | Reverse DNS Lookup |
|---|---|---|
| Baidu | Baiduspider | baiduspider-*-*-*-*.crawl.baidu.com. |
| Sogou | Sogou web spider | sogouspider-*-*-*-*.crawl.sogou.com |
| Byte Dance | Bytespider | bytespider-*-*-*-*.crawl.bytedance.com |
| Shenma | Yisouspider | shenmaspider-*-*-*-*.crawl.sm.cn |
| Huawei | PetalBot | petalbot-*-*-*-*.aspiegel.com |
| LINE (Naver) | Linespider | crawl.*-*-*-*.search.line-apps.com |
| Naver | Yeti | crawl.*-*-*-*.web.naver.com |
| Cốc Cốc | coccocbot | bot-*-*-*-*.coccoc.com |
| Qwant | Qwantify | qwantbot-*-*-*-*.qwant.com |
| Apple | Applebot | *-*-*-*.applebot.apple.com |
| Twitterbot | r-*-*-*-*.twttr.com | |
| facebookexternalhit | fwdproxy-*-*.fbsv.net |