kloia Blog

Trick Your Crawlers With Providing False Data

Written by Mehmet Taskiner | Feb 24, 2020 6:13:17 AM

In today’s world, more than half of the internet traffic comes from bots, and those bots are not always friendly. Some bots are working for search engines and digital assistants, but some of your competitors may be running crawlers to get insights about your business. Let's look at how you can mitigate malicious bots.

Earlier days, people were using javascript injects to detect and act for malicious bots. Still, while the industry is moving towards serverless architecture, we can leverage Cloudflare workers with the help of Cloudflare Bot Management to not just mitigate, but also prepare honeypots for bots. The most crucial thing in this solution is, bots don’t even reach your origin servers.

How does it work?

Bot Detection is made possible because Cloudflare is one of the world’s biggest networks. Cloudflare helps us to act upon bots by analyzing their behaviours and the anomaly in your traffic. Machine Learning is another approach used to learn bots patterns without writing one line of code. Lastly; Cloudflare uses fingerprinting from all of its infrastructures to properly classify the bots.

All you do is turn a knob to activate bot management after that Cloudflare appends the bot score to the request.

After enabling Bot Management Cloudflare will expose bot related fields to the request. You can check the bot score to implement some firewall rules, and these values are also available to your use in workers under variables request.cf.clientTrustScore and request.cf.verifiedBot.

Don’t forget to allow verified bots or your site may not be indexed by Google, Yandex or Siri etc. 

The most important thing in firewall/WAF configuration is monitoring; you should always choose logging first to understand how your traffic is shaped. After doing some research on your traffic logs, you can start challenging or blocking requests.

We can’t dive deep into Bot Management without mentioning Cloudflare Workers. It is the serverless solution of Cloudflare, and it lets you inspect/modify requests or serving content from the edge only.

Use Case

Let’s assume you have an e-commerce platform, and you realize your competitor is crawling your website to match their prices. It will not be easy to implement this solution in house, but with the help of Cloudflare Workers, we can write a simple JavaScript code to modify the API response. 

After activating Bot Management, we will use the request.cf.clientTrustScore to determine if the bot is harmful or not. 

Implementation

Cloudflare has a neat tool for working with Workers called wrangler. You can install it with npm as follows: 

npm i @cloudflare/wrangler -g

You should always check the Template Gallery to see if anything similar you want exists there. 

There is a template which modifies response headers, but we are not going to use any examples for this project. We create a new wrangler project by using wrangler generate my-worker, following the creation of template we will install unofficial Sentry SDK because you cannot effectively track any errors on your workers without implementing custom solutions.

The cf-sentry project is not good enough to give you old sentry experience; tlianza/pigeon is much more compliant with Sentry conventions. You can add it to your project with npm install --save tlianza/pigeon. More information available on their GitHub repo. 

For this PoC, we set up an S3 bucket with website hosting enabled and uploaded a sample products.json in there. You will see products.json in our GitHub repo. 

Below is the what the legitimate user see on endpoint /products.json:


[
  {
    "title": "Brown eggs",
    "type": "dairy",
    "description": "Raw organic brown eggs in a basket",
    "filename": "0.jpg",
    "height": 600,
    "width": 400,
    "price": 58.294328241875824,
    "rating": 4
  },
  {
    "title": "Sweet fresh stawberry",
    "type": "fruit",
    "description": "Sweet fresh stawberry on the wooden table",
    "filename": "1.jpg",
    "height": 450,
    "width": 299,
    "price": 64.84671958098632,
    "rating": 4
  },
  {
    "title": "Asparagus",
    "type": "vegetable",
    "description": "Asparagus with ham on the wooden table",
    "filename": "2.jpg",
    "height": 450,
    "width": 299,
    "price": 42.730711516401065,
    "rating": 3
  }
]
    

For this scenario, we rewrote the prices of each item in response if the request's trust score is less than 30.

You can find the full source code on our GitHub repository

After publishing, you have to assign a worker to a route on your Cloudflare distribution. 

Below is the modified response for bots from Cloudflare:


➜  ~ curl http://products-test.kloia.me/products.json -v
*   Trying 104.18.215.47...
* TCP_NODELAY set
* Connected to products-test.kloia.me (104.18.215.47) port 80 (#0)
> GET /products.json HTTP/1.1
> Host: products-test.kloia.me
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 09 Jan 2020 22:41:57 GMT
< Content-Type: application/json
< Content-Length: 536
< Connection: keep-alive
< Set-Cookie: __cfduid=d2812ecda80c88786a4e035b2e11d2c681578609716; expires=Sat, 08-Feb-20 22:41:56 GMT; path=/; domain=.kloia.me; HttpOnly; SameSite=Lax
< CF-Ray: 5529ebe91f74a792-IST
< ETag: "cbcc2377166e44217f15e49799f5bb98"
< Last-Modified: Wed, 18 Dec 2019 20:00:46 GMT
< CF-Cache-Status: DYNAMIC
< x-amz-id-2: vjek4yfVy/LhJSVot78oXzW1dr9+Vx1ai8ZMthMeXurKmt8pq2IhXJ5JGJf8ToPA6OsE0ZXcKCc=
< x-amz-request-id: 07427183927D94BB
< x-client-trust-score: 1
< Server: cloudflare
<
* Connection #0 to host products-test.kloia.me left intact
[{"title":"Brown eggs","type":"dairy","description":"Raw organic brown eggs in a basket","filename":"0.jpg","height":600,"width":400,"price":66.05054823392972,"rating":4},{"title":"Sweet fresh stawberry","type":"fruit","description":"Sweet fresh stawberry on the wooden table","filename":"1.jpg","height":450,"width":299,"price":84.19805905202143,"rating":4},{"title":"Asparagus","type":"vegetable","description":"Asparagus with ham on the wooden table","filename":"2.jpg","height":450,"width":299,"price":67.96393121019472,"rating":3}]
    

You can see the modified prices and the x-client-trust-score header for that request. This is just the beginning, shows you how to leverage Cloudflare workers. You can implement more scenarios like generating big responses to choke bots or perform some country checks. Your imagination is your limit. 

Last Warning

It is good to repeat; you should always monitor your traffic logs to understand how bots scored to implement this solution; you still have a risk to give wrong response to valid clients.