2021-05-25 - Let's make a CLI web bot!

Test your coding skillz0rs!
Post Reply
User avatar
16BitMiker.com
Site Admin
Posts: 453
Joined: Tue Dec 22, 2020 5:34 pm
Location: Toronto
Contact:

For this code challenge let's create a command line web bot that continually surfs a specific website at random. Why would you want that? So those pesky system administrators who are monitoring your Internet usage think that you are hard at work LOL.

You can use any language, so long as it's a commandline driven app. Post your code below when you're finished!

Hint: if you need a place to start here something to think about:
  • You'll you'll need a way to extract a valid list of URLs
  • You'll need an infinite loop that randomly selects one of those URLs to visit
  • you'll need a way to validate the URL your visiting is a HTML markup file.
  • You'll need a timer so your web crawling looks more 'natural' humanlike.
Good luck! Any questions post em below.
User avatar
16BitMiker.com
Site Admin
Posts: 453
Joined: Tue Dec 22, 2020 5:34 pm
Location: Toronto
Contact:

My Perl solution seen below.

Open a linux terminal and run the following. You maybe need to load LWP::Simple.

FINAL ANSWER v2 (178 characters)
yes | perl -M'LWP::Simple' -plse '$u=$_=@{[keys %{{${\get($u?$u:$U)}=~m`(?<=")$U/[^"]+`g}}]}[0];($u=$U) && redo if !("@{[head $u]}"=~m~t/html~)' -- -U='https://geekalicious.blog'
Full breakdown @ https://geekalicious.blog/wordpress-inf ... wp-simple/
willow
Posts: 0
Joined: Wed May 26, 2021 2:47 am

Here's my Javascript solution. To save characters I decided to break all the rules and just use global variables for everything.

Code: Select all

node -e 'v="https";x=require(v);h="geekalicious.blog";u=v+"://"+h;s=(l)=>{p=l[Math.floor(Math.random()*l.length)].split(u)[1];console.log(u+p);o={hostname:h,port:443,method:"GET",p,};d="";rq=x.request(o,rs=>{rs.on("data",c=>{d+=c});rs.on("end",()=>{n=d.match(/(?<=href=")https:\/\/geekalicious.blog[^"]+/g);!n||n.length<=1?console.log("bad link")&&setTimeout(()=>{s(l)},500):setTimeout(()=>{s(n)},500)})});rq.on("error",e=>{console.error(e)});rq.end()};s([u+"/"]);'
Woohoo! This was suuuuper fun.

I guess if you wanna be able to read it easier, here's the non-one-liner:

Code: Select all

const https = require('https');
const url = 'geekalicious.blog';
const regex = /(?<=href=")https:\/\/geekalicious.blog[^"]+/g;

function scrape(links) {
  const path = links[Math.floor(Math.random() * links.length)].split(`https://${url}`)[1];
  console.log(`https://${url}${path}`);
  const options = {
    hostname: url,
    port: 443,
    method: 'GET',
    path,
  };
  let data = '';
  const req = https.request(options, res => {
    res.on('data', chunk => { data += chunk });
    res.on('end', () => {
      const newLinks = data.match(regex) || [];
      if (newLinks.length <= 1) {
        console.log('bad link')
        setTimeout(() => {
          scrape(links);
        }, 500);
      } else {
        setTimeout(() => {
          scrape(newLinks);
        }, 500);
      }
    });
  });

  req.on('error', error => {
    console.error(error);
  });

  req.end();
}

scrape(['https://geekalicious.blog/']);
Post Reply