How to Block Sogou Spiders and Bytespider with robots.txt

2020-06-03

The following robots.txt file tells Baiduspider (baidu.com), 360Spider (so.com), Yisouspider (sm.cn), PetalBot (Huawei), Bytespider (bytedance.com), Sogou spiders (sogou.com) not to crawl the entire website.

User-agent: Baiduspider
User-agent: 360Spider
User-agent: Yisouspider
User-agent: PetalBot
User-agent: Bytespider
User-agent: Sogou web spider
User-agent: Sogou inst spider
Disallow: /

However, I still see Bytespider and Sogou web spider/4.0 from nginx access logs.

It seems Bytespider and Sogou spiders are not fully compatible with robots exclusion standard. These crawlers magically disappeared one week after I created a separate block for each user agent in robots.txt.

User-agent: Baiduspider
User-agent: 360Spider
User-agent: Yisouspider
User-agent: PetalBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Sogou web spider
Disallow: /

User-agent: Sogou inst spider
Disallow: /

How to verify bots by reverse DNS lookup

Search Engine User Agent Reverse DNS Lookup
Baidu Baiduspider baiduspider-*-*-*-*.crawl.baidu.com.
Sogou Sogou web spider sogouspider-*-*-*-*.crawl.sogou.com
Byte Dance Bytespider bytespider-*-*-*-*.crawl.bytedance.com
Shenma Yisouspider shenmaspider-*-*-*-*.crawl.sm.cn
Huawei PetalBot petalbot-*-*-*-*.aspiegel.com
LINE (Naver) Linespider crawl.*-*-*-*.search.line-apps.com
Naver Yeti crawl.*-*-*-*.web.naver.com
Cốc Cốc coccocbot bot-*-*-*-*.coccoc.com
Qwant Qwantify qwantbot-*-*-*-*.qwant.com
Apple Applebot *-*-*-*.applebot.apple.com
Twitter Twitterbot r-*-*-*-*.twttr.com
Facebook facebookexternalhit fwdproxy-*-*.fbsv.net