Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hijack navigation, return custom HTML and setup the page interactions (on_clicks, on_submits...) #841

Closed
Nikola-Milovic opened this issue Mar 21, 2023 · 5 comments
Labels
question Questions related to rod

Comments

@Nikola-Milovic
Copy link

Nikola-Milovic commented Mar 21, 2023

Rod Version: v0.112.6

Description

I am building a testing library for my web crawler using go-rod, and I want to stub the real pages with custom HTML that I control. My goal is to hook into a page, listen for navigation events, and then substitute the real page with my own copy of the page. After substituting the page, I want to call the setupPageInteractions function that hooks into the elements on the page and sets up onClick events, onSubmit and other interactions.

Problem

I am facing an issue while trying to achieve this. When I stub the page using HijackRequests, I cannot block the request, and I have to let it finish. However, this means that I am not able to inject my custom code and call the setupPageInteractions function at the right time. If I call the setupPageInteractions function directly after stubbing the page, the request won't complete, and the execution is blocked indefinitely. I need to setup the page and the interactions before the WaitLoad triggers on the crawler side.

crawler.Navigate("desired_page") -> crawler.WaitLoad() -> stop the request -> inject custom HTML -> setup page interactions -> crawler.WaitLoad() and the crawler.Page is now our custom page

Example Code

Here is the code snippet showcasing my issue:

func hijackRequests(router *rod.HijackRouter, page *rod.Page, config *Config) {
	router.MustAdd("*", func(ctx *rod.Hijack) {
		stubbedPageInfo := stubPage(ctx, page, config)
		if stubbedPageInfo != nil {
			// We stubbed the page, now we need to wait for the browser to load this page
            // and then hook into the page and select elements that we need.
            // If we call the setupPageInteractions here directly, then the request won't complete
            // and we can't keep going. Another solution might be to directly set the page contents,
            // but that seems to block indefinitely.
		}
	})
}

func stubPage(hijack *rod.Hijack, page *rod.Page, config *Config)  {
    if "desired_page_url" == hijack.Request.URL().String() && hijack.Request.Method() == "GET" {
        fmt.Println("stubbing navigation: " + p.URL)
        file, err := ioutil.ReadFile(p.File)
        if err != nil {
            log.Fatalf("failed to load file %v", err)
        }

        hijack.Response.SetBody(file)
        hijack.Response.SetHeader("Content-Type", "text/html")
    }
}

// Call this function once the request has completed
// If necessary, make the page sleep for 200ms so we have enough time to complete
func setupPageInteractions(page *rod.Page) {
    // ...
}

Expected Behavior

I expect to be able to inject my custom HTML and set up interactions after hijacking the request without blocking the request or the execution of the code.

Actual Behavior

I am not able to inject my custom code and call the setupPageInteractions function without blocking the request or the execution of the code.

Any guidance or suggestions on how to resolve this issue would be greatly appreciated.

TestRod reproduction code

rod_test.go

package rod_test

import (
	"fmt"
	"testing"

	"github.com/go-rod/rod"
)

// This is the template to demonstrate how to test Rod.
func TestRod(t *testing.T) {
	g := setup(t)
	g.cancelTimeout() // Cancel timeout protection

	_, page := g.browser, g.page

	router := page.HijackRequests()
	defer router.MustStop()

	//setup
	stubNavigation(router, page)

	go router.Run()

	// Perform some navigation
	page.MustNavigate("https://go-rod.github.io/")

	fmt.Println("done")
}

func stubNavigation(router *rod.HijackRouter, page *rod.Page) {
	router.MustAdd("https://go-rod.github.io/", func(ctx *rod.Hijack) {
		if ctx.Request.Req().Method == "GET" {
			fmt.Println("stubbing navigation for " + ctx.Request.URL().String())

			// Set custom HTML content
			customHTML := `<html><head><title>Hi</title></head><body><h1 id="test-id">Hello, custom REQUEST!</h1></body></html>`

			ctx.Response.SetBody(customHTML)
			ctx.Response.SetHeader("Content-Type", "text/html")

			// Blocks
			setupPageInteractions(page)
			
			// Alternative solution, this would maybe solve my problems because the page contents would immediately be present but this blocks indefinitely so I cannot use it
			// page.MustSetDocumentContent(`<html><head><title>Hi</title></head><body><h1 id="test-id">Hello, custom CONTENT!</h1></body></html>`)
		}
	})
}

func setupPageInteractions(page *rod.Page) {
	// Here we should setup the interactions on the page and we need the loaded page from our stubbed request
	// This blocks indefinitely
	fmt.Println(page.MustHTML())

}

I achieved this in chromedp using the fetch.FullfilRequest

	go func() {
		chromedp.ListenTarget(ctx, func(ev interface{}) {
			switch e := ev.(type) {
			case *fetch.EventRequestPaused:
				go func() {
					c := chromedp.FromContext(ctx)
					ctx := cdp.WithExecutor(ctx, c.Target)

					stubPage(ctx, e, config)
					setupPageInteractions(ctx)
					...
					
func stubPage(ctx context.Context, ev *fetch.EventRequestPaused) {
		if "desired_url" == ev.Request.URL && ev.Request.Method == "GET" {
			fmt.Println("stubbing navigation: " + ev.Request.URL)
			headers := []*fetch.HeaderEntry{{
				Name: "Content-Type", Value: "text/html",
			}}

			customHTML := []byte(`<html><head><title>Hi</title></head><body><h1 id="test-id">Hello, custom REQUEST!</h1></body></html>`)

			err = fetch.FulfillRequest(ev.RequestID, 200).
				WithResponseHeaders(headers).
				WithBody(base64.StdEncoding.EncodeToString(customHTML)).
				WithResponsePhrase("OK").
				Do(ctx)
			if err != nil {
				log.Fatalf("failed to stub navigation request %v", err)
			}
		}
	}
}


func setupPageInteractions(ctx context.Context, pageInfo *Page) {
	c := chromedp.FromContext(ctx)
	ctx = cdp.WithExecutor(ctx, c.Target)

	// Setup elements, query for them, setup their actions
	for i, element := range pageInfo.Elements {
		var nodes []*cdp.Node
		if err := chromedp.Nodes(element.Selector, &nodes, chromedp.ByQuery).Do(ctx); err != nil {
			log.Fatalf("Failed to find element %s: %v", element.Selector, err)
		}
@Nikola-Milovic Nikola-Milovic added the question Questions related to rod label Mar 21, 2023
@rod-robot
Copy link

Please fix the golang code in your markdown:

@@ golang markdown block 1 @@
2:1: expected declaration, found 'go'
4:24: expected 'IDENT', found ')'
4:26: expected type, found '{'
5:34: expected ';', found ':'
6:5: expected declaration, found 'go'
43:3: expected declaration, found 'if'

generated by check-issue

@ysmood
Copy link
Collaborator

ysmood commented Mar 21, 2023

You can do the same thing with rod:

	proto.FetchEnable{
		Patterns: []*proto.FetchRequestPattern{
			{URLPattern: "*"},
		},
	}.Call(page)

	go page.EachEvent(func(e *proto.FetchRequestPaused) {
		fmt.Println("request", e.Request.URL)
	})()

	proto.FetchFulfillRequest{
		ResponseHeaders: []*proto.FetchHeaderEntry{},
		Body:            []byte("Hello World!"),
	}.Call(page)

@Nikola-Milovic
Copy link
Author

That's beautiful, makes it so much easier. Thank you.

@Nikola-Milovic
Copy link
Author

Nikola-Milovic commented Mar 21, 2023

@ysmood Sorry to bother again, I think I am misunderstanding how fetch works with go-rod, I can't wrap my head around the issue. The page isn't valid when I FetchFullfillRequest and it panics when I try to get elements, HTML or anything from it

func main() {
	cfg := getConfig()

	browser := rod.New().MustConnect()
	defer browser.MustClose()

	page := browser.MustPage("")
	defer page.MustClose()

	Setup(page, &cfg)

	done, err := StartCrawler(context.Background(), page)
	if err != nil {
		log.Fatalf("failed to run crawler: %v", err)
	}

	<-done
	
	fmt.Println("success!")
}

func Setup(page *rod.Page, config *Config) {
	fetchEnable := proto.FetchEnable{
		Patterns: []*proto.FetchRequestPattern{
			{URLPattern: "*"},
		},
	}
	if err := fetchEnable.Call(page); err != nil {
		log.Fatalf("failed to enable fetch: %v", err)
	}

	listenEvents(page, config)
}

func listenEvents(page *rod.Page, config *Config) {
	go page.EachEvent(func(e *proto.FetchRequestPaused) {
		stubbedPageInfo := stubPage(page, e, config)

		switch {
		case stubbedPageInfo != nil:
		       // Panics here
			html, err := page.HTML()
			if err != nil {
				log.Fatalf("failed to get html: %v", err)
			}
			fmt.Println(html)
			....
			
func stubPage(page *rod.Page, ev *proto.FetchRequestPaused, config *Config) *Page {
	for _, pageConf := range config.Pages {
		if pageConf.URL == ev.Request.URL && ev.Request.Method == "GET" {
			fmt.Println("stubbing navigation: " + ev.Request.URL)
			headers := []*proto.FetchHeaderEntry{{
				Name: "Content-Type", Value: "text/html",
			}}

			file, err := ioutil.ReadFile(pageConf.File)
			if err != nil {
				log.Fatalf("failed to load file %v", err)
			}

			err = proto.FetchFulfillRequest{
				RequestID:       ev.RequestID,
				ResponseHeaders: headers,
				Body:            file,
				ResponsePhrase:  "OK",
				ResponseCode:    200,
			}.Call(page)
			if err != nil {
				log.Fatalf("failed to stub navigation request %v", err)
			}

			return &pageConf
		}
	}

	return nil
}

html, err := page.HTML() results in a panic

Line 65 is go page.EachEvent(func(e *proto.FetchRequestPaused) {

panic: assignment to entry in nil map

goroutine 26 [running]:
github.com/go-rod/rod.(*Page).setHelper(0x8ff9ab?, {0xc00014e720?, 0x8dde80?}, {0x8f77a0, 0x9}, {0xc00014e900, 0x17})
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page_eval.go:319 +0xf0
github.com/go-rod/rod.(*Page).ensureJSHelper(0x44cfd2?, 0xfb1d00)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page_eval.go:263 +0x19b
github.com/go-rod/rod.(*Page).formatArgs(0xc000124ef8?, 0x2?)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page_eval.go:233 +0x21e
github.com/go-rod/rod.(*Page).evaluate(0x7effe9d02ae8?, 0xc000154440)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page_eval.go:149 +0x3b
github.com/go-rod/rod.(*Page).Evaluate(0xc0002cc000, 0xc000154440)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page_eval.go:128 +0x4e
github.com/go-rod/rod.(*Page).ElementByJS.func2()
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/query.go:172 +0x97
github.com/go-rod/rod/lib/utils.Retry({0xd2ed90, 0xc0002ca000}, 0xc00011d350, 0xc000367500)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/lib/utils/sleeper.go:139 +0x37
github.com/go-rod/rod.(*Page).ElementByJS(0xc0002cc000, 0xc000154440)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/query.go:167 +0xc5
github.com/go-rod/rod.(*Page).Element(0x668f?, {0x8f5ac7?, 0x2?})
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/query.go:143 +0x13c
github.com/go-rod/rod.(*Page).HTML(0x0?)
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/page.go:103 +0x25
main.hijackRequests.func1(0xc00015c140)
        /mnt/hddstorage/files/programming/go/projects/crawl-test/main.go:72 +0x5c
reflect.Value.call({0x825f40?, 0xc0001261f8?, 0x100c0002de668?}, {0x8f58eb, 0x4}, {0xc0002a7f80, 0x1, 0xc00015c140?})
        /usr/local/go/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0x825f40?, 0xc0001261f8?, 0xc00015c140?}, {0xc0002a7f80?, 0x0?, 0x0?})
        /usr/local/go/src/reflect/value.go:370 +0xbc
github.com/go-rod/rod.(*Browser).eachEvent.func1()
        /home/nikola/go/pkg/mod/github.com/go-rod/rod@v0.112.6/browser.go:398 +0x412
created by main.hijackRequests
        /mnt/hddstorage/files/programming/go/projects/crawl-test/main.go:65 +0xa5
exit status 2

I'd expect it to complete the request and the page should load the my stubbed HTML. Maybe some race condition is occurring

crawler.go

func StartCrawler(ctx context.Context, page *rod.Page) (chan bool, error) {
	done := make(chan bool)
	
	go func() {
		fmt.Println("navigating")
		page.MustNavigate(baseUrl + "/login").WaitLoad()

		done <- true
	}()

	return done, nil
}

@ysmood
Copy link
Collaborator

ysmood commented Mar 22, 2023

It's a limitation of cdp, page.HTML() uses js to get the the content of the page, but you can't run js after you FetchRequestPaused, you have to FetchFulfillRequest before you can run js.

Also to run js, you have to wait the navigation to complete, code like before works fine to me:

func main() {
	browser := rod.New().MustConnect()

	page := browser.MustPage("")

	utils.E(proto.FetchEnable{
		Patterns: []*proto.FetchRequestPattern{
			{URLPattern: "*"},
		},
	}.Call(page))

	go page.EachEvent(func(e *proto.FetchRequestPaused) {
		utils.E(proto.FetchFulfillRequest{
			RequestID:    e.RequestID,
			ResponseCode: 200,
			Body:         []byte("<html>test</html>"),
		}.Call(page))
	})()

	page.MustNavigate("http://example.com")

	fmt.Println(page.MustHTML())
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions related to rod
Projects
None yet
Development

No branches or pull requests

3 participants