# Apache Nutch Reviews
**Vendor:** The Apache Software Foundation  
**Category:** [Java Web Frameworks](https://www.g2.com/categories/java-web-frameworks)  
**Average Rating:** 4.0/5.0  
**Total Reviews:** 20
## About Apache Nutch
Apache Nutch is a extensible and scalable open source web crawler software project.Nutch provides extensible interfaces such as Parse, Index and ScoringFilter&#39;s for custom implementations e.g. Apache Tika for parsing.




## Apache Nutch Reviews
  ### 1. Apache Nutch is Rockstar in terms of huge data crawling.

**Rating:** 5.0/5.0 stars

**Reviewed by:** Narendra A. | Senior Software Engineer, Enterprise (> 1000 emp.)

**Reviewed Date:** August 17, 2020

**What do you like best about Apache Nutch?**

When I used apache Nutch I was amazed with the speed it crawls data and the libraries and data structures provided to customise your crawling and reading the data in desired format. I was crawling the whole IBM data to get the insights and do text analytics on it. The kind of support I got from the forums was also great. So overall it was nice experience using apache Nutch crawler.

**What do you dislike about Apache Nutch?**

What I disliked was the video support it provides in the Internet.

**Recommendations to others considering Apache Nutch:**

It's nice to use and provides lots of flexibility.

**What problems is Apache Nutch solving and how is that benefiting you?**

I was solving the problem in my organisation for data analytics. Where we automate the whole process of bidding with text analytics.

  ### 2. Very efficient, faster and open source tool for crawler

**Rating:** 4.5/5.0 stars

**Reviewed by:** Jaydip L. | Senior Software Engineer, Small-Business (50 or fewer emp.)

**Reviewed Date:** September 02, 2020

**What do you like best about Apache Nutch?**

Open Source
Scalable
Parsing and indexing techniques.
Easy Integration with elastic search and solr.
Different plugins to parse various content types.

**What do you dislike about Apache Nutch?**

Nothing much in my list of dislike because we really enjoyed it very much and it fulfilled our organization needs. But based on experience I can say some cons like it requires good infrastructure in place and consumes good amount of memory and cpu utilization. We also feel if nutch provide good dashboard and kind of admin panel then it would have very helpful to us.

**Recommendations to others considering Apache Nutch:**

When we had requirement for crawling we went with different tools like StormCrawler, scrapy etc. But we found this tool as very much reliable and most importantly open source. It's various features like automatic crawling, finding inner links to crawl, parse different kind of contents, various integrations etc. made us to go for this tool and believe me we never felt regret after using it. Best crawling tool.

**What problems is Apache Nutch solving and how is that benefiting you?**

Our business need is to develop search engine where we provide list of URLs to nutch and it will crawl all those URLs as well as find its inner URL and crawl them as well. We were storing these crawled data to cassandra db and then there was elastic search in place to fulfill our search query. These was actually working perfectly and nutch really helped us to provide crawling with their abilities to parse different content types and store them.

  ### 3. Web Crawling Tool

**Rating:** 5.0/5.0 stars

**Reviewed by:** Sinem A. | Quality Assurance Test Engineer, Mid-Market (51-1000 emp.)

**Reviewed Date:** December 14, 2020

**What do you like best about Apache Nutch?**

It was an open source tool that you can add your own plugins. You can change it own code as you wish. It was very easy to use. It can be run with different tools also.

**What do you dislike about Apache Nutch?**

You should know which version of nutch is suitable to other tools you work with.

**What problems is Apache Nutch solving and how is that benefiting you?**

I used it while i was doing my thesis to crawl Turkish web pages for my improved search engine algorithm. Also i used it at work in a Turkish search engine project.

  ### 4. I am big data developer in KICS, UET Lahore, Pakistan

**Rating:** 3.5/5.0 stars

**Reviewed by:** Naser A. | Research Officer, Mid-Market (51-1000 emp.)

**Reviewed Date:** August 19, 2020

**What do you like best about Apache Nutch?**

I have been using apache nutch since 3 or 4 years, I like it as an open source tool which can run on a system with normal specs and crawl millions of millions pages.

**What do you dislike about Apache Nutch?**

* I don't like its seed creation algorightm, it makes cluster and then went to a loop to crawl the same webesites when it has crawled million of pages.
* Its configuration not easy.
* job Automations not provided
* Documentation is not good.
* Support is not good.

**Recommendations to others considering Apache Nutch:**

Not easy at early days but once you set it up it goes beyond your expectation.

**What problems is Apache Nutch solving and how is that benefiting you?**

I have fetched large number of websites which contain specific language to build a local search engine

  ### 5. Nutch is a light weight scraping tool which has trivial learning curve in its adoption.

**Rating:** 5.0/5.0 stars

**Reviewed by:** Prafulla R. | Technical Architect, Small-Business (50 or fewer emp.)

**Reviewed Date:** December 04, 2020

**What do you like best about Apache Nutch?**

-Easy to configure
-Stable backend store

**What do you dislike about Apache Nutch?**

Use of Java makes it a little bulky
One has to be careful of heap size otherwise OOM errors are inevitable.

**Recommendations to others considering Apache Nutch:**

Be careful about the Heap size setting in the configuration file. Also, use HBase like NoSQL data store to store crawled data.

**What problems is Apache Nutch solving and how is that benefiting you?**

Implementation of eCommerce product comparison engine.
Nutch enables data crawling in ethical ways.

  ### 6. Extract to the depth

**Rating:** 4.5/5.0 stars

**Reviewed by:** Krishnan S. | Software Engineer, Mid-Market (51-1000 emp.)

**Reviewed Date:** December 05, 2020

**What do you like best about Apache Nutch?**

Crawl of URL is excellent function to read the content. Nutch is very useful tool to read the content in the document of various depth.

**What do you dislike about Apache Nutch?**

Bit hard to customize the crawl function.

**Recommendations to others considering Apache Nutch:**

Very nice tool to use.

**What problems is Apache Nutch solving and how is that benefiting you?**

Prepared the content for search engine for a static we page.

  ### 7. Butch is highly scalable open source web crawler.It can customise according to the requirements.

**Rating:** 4.0/5.0 stars

**Reviewed by:** Ruchika J. | Hadoop Developer, Small-Business (50 or fewer emp.)

**Reviewed Date:** August 18, 2020

**What do you like best about Apache Nutch?**

Plugins for indexing and searching.
Integration with solar and other tools.
It finely work in Hadoop clusters as well.

**What do you dislike about Apache Nutch?**

Lack of community to discuss any issue or concern.
Lack of documents for the implementation and integration of nutch.

**Recommendations to others considering Apache Nutch:**

For web crawling and data mining you can easily implement nutch with other big data technologies.

**What problems is Apache Nutch solving and how is that benefiting you?**

Crawl and parse data from XML data from  urls.Apache Tika used for parsing , indexed and filter data from solar and created SEO tool and ppc tool.
I got domain specific materials but it doesn't have batch mode.
It work fine on clusters

  ### 8. A great web crawler for all crawling needs

**Rating:** 4.5/5.0 stars

**Reviewed by:** Usama T. | Python Developer, Mid-Market (51-1000 emp.)

**Reviewed Date:** July 10, 2020

**What do you like best about Apache Nutch?**

Its feature to crawl complete web with inlinks and out links which make it forever crawl.

**What do you dislike about Apache Nutch?**

We need to have a very strong knowledge of Apache Hadoop, Hbase, Zookeeper, and complete environment setup. We have to be very efficient in it for using this. Moreover, we can not view Hbase data easily which is also very difficult.

**What problems is Apache Nutch solving and how is that benefiting you?**

I am working on Search Engine and for it, Crawling is the basic need which I am getting through Apache Nutch. I can crawl complete web data by providing few links and make it to crawl through in-links and out-links.

  ### 9. Nutch is reliable, mature open source crawler

**Rating:** 3.5/5.0 stars

**Reviewed by:** Fred Z. | Founder, Enterprise (> 1000 emp.)

**Reviewed Date:** August 19, 2020

**What do you like best about Apache Nutch?**

I have deployed Nutch on several times when I needed to stand up a crawler quickly.  It is free, straightforward, reliable, well documented, and comes with an OTS integration with Apache Solr for search.

**What do you dislike about Apache Nutch?**

The directory and file partioning scheme for the crawler can be a bit confusing.

**Recommendations to others considering Apache Nutch:**

consider Google Programmable Search Engine

**What problems is Apache Nutch solving and how is that benefiting you?**

It is an excellent solution if you need a quick, simple, free crawler.

  ### 10. Best for web crawling

**Rating:** 5.0/5.0 stars

**Reviewed by:** Verified User in Pharmaceuticals | Small-Business (50 or fewer emp.)

**Reviewed Date:** December 14, 2020

**What do you like best about Apache Nutch?**

I like the default index generation for crawler

**What do you dislike about Apache Nutch?**

When working with Ubuntu OS I find hard to setting the directory paths

**What problems is Apache Nutch solving and how is that benefiting you?**

I have successfully integrated Apache Nutch to Hadoop and hive eco systems and sets the rule based contents in the web pages

  ### 11. Really good experience using Apache Nutch. Crawling capabilities are really good.

**Rating:** 5.0/5.0 stars

**Reviewed by:** Navom S. | Software Developer, Enterprise (> 1000 emp.)

**Reviewed Date:** July 25, 2020

**What do you like best about Apache Nutch?**

Multidepth crawling capabilities are really good. Data extraction from web pages is remarkable.

**What do you dislike about Apache Nutch?**

Based on Map reduce, hence slower. Adding customisations included writing plugins and building it, no feature for dependency injection.

**Recommendations to others considering Apache Nutch:**

Map reduce based implementation in previous implementation is slower.

**What problems is Apache Nutch solving and how is that benefiting you?**

Crawling web pages and government websites to get insight of data related to geographical change.

  ### 12. Comprehensive tool for web scraping and crawling

**Rating:** 4.0/5.0 stars

**Reviewed by:** Verified User in Internet | Mid-Market (51-1000 emp.)

**Reviewed Date:** November 02, 2020

**What do you like best about Apache Nutch?**

Provides an in-depth list of features, html tags, site maps

**What do you dislike about Apache Nutch?**

Didn't have a lot of documentation at the time I was using it which made it hard to use.

**What problems is Apache Nutch solving and how is that benefiting you?**

Crawled our domain urls and got useful revelant information

  ### 13. Powerful but not recommended

**Rating:** 1.5/5.0 stars

**Reviewed by:** Imtiaz S. | Senior Software Engineer, Small-Business (50 or fewer emp.)

**Reviewed Date:** July 10, 2020

**What do you like best about Apache Nutch?**

Easy to use. 
Can crawl almost all kinds of contents. 
Excellent plugin system . 
Supports different storage backends.

**What do you dislike about Apache Nutch?**

Hard to master. Requires Stiff knowledge curve.

Poor documentation. Many are outdated or broken. 
Difficult to setup for production system.

**Recommendations to others considering Apache Nutch:**

Use Apache Storm Crawler instead.

**What problems is Apache Nutch solving and how is that benefiting you?**

We Used Apache Nutch to crawl websites and index them with Solr.

  ### 14. Used apache nutch for a crawling project

**Rating:** 3.0/5.0 stars

**Reviewed by:** Verified User in Computer Software | Enterprise (> 1000 emp.)

**Reviewed Date:** July 10, 2020

**What do you like best about Apache Nutch?**

I used apache nutch in crawling using cygwin, in easy steps it managed to be configured and helped in collecting the desired data.

**What do you dislike about Apache Nutch?**

I didn't see any disadvantage of it to be honest.

**What problems is Apache Nutch solving and how is that benefiting you?**

It helped to configure the database in easy steps

  ### 15. Using Apache Nutch for my Thesis Research

**Rating:** 3.5/5.0 stars

**Reviewed by:** Verified User in Computer & Network Security | Small-Business (50 or fewer emp.)

**Reviewed Date:** August 24, 2020

**What do you like best about Apache Nutch?**

Apache Nutch is an easy configuration application that we can used for research

**What do you dislike about Apache Nutch?**

Its very difficult to find article about apache nutch

**What problems is Apache Nutch solving and how is that benefiting you?**

Because the resource are very difficult to find, mostly about the configuration

  ### 16. Nutch review

**Rating:** 4.0/5.0 stars

**Reviewed by:** Verified User in Higher Education | Enterprise (> 1000 emp.)

**Reviewed Date:** August 14, 2020

**What do you like best about Apache Nutch?**

Easy to use, support from big community of devs

**What do you dislike about Apache Nutch?**

The default interface of the search engine is very outdated

**What problems is Apache Nutch solving and how is that benefiting you?**

Building an Arabic search engine

  ### 17. Great web crawler

**Rating:** 4.0/5.0 stars

**Reviewed by:** Verified User in Newspapers | Mid-Market (51-1000 emp.)

**Reviewed Date:** March 14, 2019

**What do you like best about Apache Nutch?**

Nutch support distributed fetching, and Hadoop support, can be multi-machine distributed fetching, storage and indexing.
Another attractive point is that it provides a plug-in framework, make it of all kinds of web content parsing, a variety of data collection, query, cluster, filtering, and other functions can be convenient to extend, it is because of this framework, the Nutch plug-in development is very easy, third-party plug-in also emerge in endlessly, greatly enhanced the function of Nutch and reputation.

**What do you dislike about Apache Nutch?**

Nutch's crawler customization ability is relatively weak.
If the secondary development of Nutch crawler is carried out, the compilation time and debugging time of crawler will take a lot of time.

**What problems is Apache Nutch solving and how is that benefiting you?**

Massive amounts of data can be obtained from specific websites, which can be screened and analyzed purposefully, and the results of these data can be clearly displayed in front of us through a certain service.


  ### 18. Incredibly Performant Web Crawling

**Rating:** 3.0/5.0 stars

**Reviewed by:** Justin C. | CTO, Small-Business (50 or fewer emp.)

**Reviewed Date:** March 19, 2019

**What do you like best about Apache Nutch?**

I love how easy to configure and run it is and how it performs at scale. Storing in Hadoop is a breeze.

**What do you dislike about Apache Nutch?**

Not quite as easy to use as tools like Scrapy.

**What problems is Apache Nutch solving and how is that benefiting you?**

Distributed batch web crawling. 

  ### 19. Nice opensource crawler used in production at DARPA

**Rating:** 4.0/5.0 stars

**Reviewed by:** Verified User in Computer & Network Security | Small-Business (50 or fewer emp.)

**Reviewed Date:** January 31, 2019

**What do you like best about Apache Nutch?**

HTTP proxy support so my IP does not get block
Nice file size filter with advanced control of network bandwidth
I heard that many big companies and government agencies are using nutch in production
Nutch has parallel reducer to make use of multiple network connections and multi-core CPU

**What do you dislike about Apache Nutch?**

I wish nutch has built-in rate limiting support
Implemented in Java which is a bit memory hungry

**Recommendations to others considering Apache Nutch:**

Use parallel reducer to decrease crawling time

**What problems is Apache Nutch solving and how is that benefiting you?**

Crawl leaked credentials on github

  ### 20. Apache Nutch by Apache Review

**Rating:** 4.0/5.0 stars

**Reviewed by:** Verified User in Information Technology and Services | Mid-Market (51-1000 emp.)

**Reviewed Date:** April 27, 2018

**What do you like best about Apache Nutch?**

Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.
* Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
* The number of plugins for processing various document types being shipped with Nutch has been refined. 
The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
Nutch has had scoring plugins for quite a while, and has supported things like Adaptive Fetch schedules, and all of the Nutch data is in databases and so forth that are interrogated through the command line tools, Java, and now there is an emerging REST interface and also work to create a Python client for this as well. 

**What do you dislike about Apache Nutch?**

Nutch doesn't have to be batch mode.
So lets say that as a Nutch crawl administrator your client has tasked you with the following "Get me domain specific material from a database such as NTIS" (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) What this really translates to is the following:



**What problems is Apache Nutch solving and how is that benefiting you?**

This page provides commentary and thoughts on adapting Nutch not only to fetch AJAX/JavaScript driven dynamic HTML content, but also for interacting with that content (potentially a number of times) within a fetching scenario.




## Apache Nutch Discussions
  - [How to make use of apache nuts more easy ?](https://www.g2.com/discussions/34687-how-to-make-use-of-apache-nuts-more-easy) - 1 upvote
  - [How can i programatically create new crawl jobs and control them?](https://www.g2.com/discussions/31744-how-can-i-programatically-create-new-crawl-jobs-and-control-them) - 1 upvote

- [View Apache Nutch pricing details and edition comparison](https://www.g2.com/products/apache-nutch/reviews?section=pricing&secure%5Bexpires_at%5D=2026-05-18+06%3A34%3A01+-0500&secure%5Bsession_id%5D=8da85be1-e774-4472-b426-f5c54e9cd4f5&secure%5Btoken%5D=10425df6e963694dad3c78e1dc8e465ceb38479bd6508431db622cc8ba9bd9b5&format=llm_user)


## Top Apache Nutch Alternatives
  - [spring.io](https://www.g2.com/products/spring-io/reviews) - 4.5/5.0 (290 reviews)
  - [Apache Tika](https://www.g2.com/products/apache-tika/reviews) - 4.7/5.0 (13 reviews)
  - [JHipster](https://www.g2.com/products/jhipster/reviews) - 4.4/5.0 (83 reviews)

