When scraping emails, a site will have something like:
domain.com/contacts or domain.com/people/contact, or something like that, one specific directory that has the emails, and pretty much all the rest is useless, I don't really need to scrape domain.com/shop or domain.com/products, I waste dozens of hours on that.
Yet for large sites, there are thousands of useless pages in those directories. Would it be possible to do something like:
- If a lot of emails are found in a specific directory, ignore the other directories. This means that if I set depth to level 3, it will go to level 3 in that directory only.
In other words, is there a smart and more efficient way to detect which directories have contact, and make it ignore all the rest?
Thanks!
Edit: I normally have 1,000+ urls so I can't look up each site manually and enter the directory with contacts, I would like to auto-detect it somehow.