Discovery web crawler - need support for crawling data from protected sites requiring authentication

We have an OEM aggrement with IBM for Watson services,

Our internal Customers are evaluating crawling data from their web sites hosting Web content to Watson Discovery collection.

The Watson Discovery connector for web crawling supports crawling data from public sites not requiring any authentication only.

https://cloud.ibm.com/docs/discovery?topic=discovery-sources#connectwebcrawl

Web Crawl

You can use the web crawler to crawl public websites that don’t require a password.

Customer's web sites hosting knowledge articles need authentication.

Is it possible to consider an enhancement to support this?

Is this on the Watson Discovery roadmap?

In the absence of the above support, Customer's have to do one of the following as a workaround.

Copy data from protected site to a public site requiring no authentication so that they can use the Watson Discovery connector for Web crawling.
Write a custom crawler of their own to pull data from the protected site and publish it to a Discovery collection using Discovery API.

Both workarounds above are tedious and need additional investment by Customer's.

Customer's expect the Watson Discovery connector for web crawling to crawl data from protected sites requiring authentication. This is expected out of the box.

Adding this support will make the task of crawling data from protected sites to Watson Discovery collection easier for Watson Discovery customers.

Post comment

Guest

Sep 18, 2022

The platform of web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

Reply
Hide replies

Admin

JOHN PECORARI

Sep 4, 2020

This is planned for Q4 2020

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Discovery web crawler - need support for crawling data from protected sites requiring authentication