Page 1 of 1

The perfect database or how to do crawling

Posted: Sun Dec 15, 2024 5:38 am
by chandona
I once read a company advertising itself with the slogan β€œ The best database is the Internet .” At the time, about 5 years ago, the phrase sounded fantastic, but making it a reality was downright difficult and I was aware that whoever was saying it had no idea of ​​its technical complexity.

Over the years, my thesis has been proven. What sounded great on a commercial level was, on a technical level, a very basic invention for approaching Internet data.

On the other hand, I have been thinking for years that the best information about a company can be found directly on its website . This is where companies describe who they are, what they do and where they are located.

I have been working on building business databases for years and I know the strengths and weaknesses of accounting records, business activity codes and all the traditional approaches to generating databases and knowledge about companies.



LET'S LAY THE FOUNDATIONS
Although it may sound very critical, 95% of companies that claim to do crawling (download data from the internet) work with commercial scraping software . This consists of entering a website and downloading a series of structured data contained within the same domain or URL. For example, entering the administration website and downloading the addresses and telephone numbers of the town halls.



A significant portion of these actors are engaged in this same work on websites with intellectual property rights; therefore, in addition to having little technical complexity, in many cases it is an activity of bulk sms singapore dubious legality. There is a whole industry of companies that download profiles from Linkedin or hotels from Booking.

Crawling is the massive downloading of information in order to subsequently index the content of a set of unstructured web pages. This is extremely technically complex and there is no commercial software, other than Google's tools, that can do it professionally. In fact, a very significant part of the people who really do crawling go to Common Crawl (a project where actors share a database with an index of a relatively current and exhaustive version of the Internet) or to Google.

We could say that, in the world of the Internet and data, Google and Amazon are the ones who are ahead, but this is where the beautiful part of the story begins.

Image

When we started downloading from the Internet, we started like everyone else: first we did scraping with basic software, then we went to Common Crawl and Google until, out of our curiosity, we came across a series of free software projects that emerged with Lucene in the late 90s and that have evolved to the present day.