.\" Man page generated from reStructuredText. . .TH "MECHANIZE" "1" "Jan 17, 2020" "0.4.5" "mechanize" .SH NAME mechanize \- mechanize Documentation . .nr rst2man-indent-level 0 . .de1 rstReportMargin \\$1 \\n[an-margin] level \\n[rst2man-indent-level] level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] - \\n[rst2man-indent0] \\n[rst2man-indent1] \\n[rst2man-indent2] .. .de1 INDENT .\" .rstReportMargin pre: . RS \\$1 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] . nr rst2man-indent-level +1 .\" .rstReportMargin post: .. .de UNINDENT . RE .\" indent \\n[an-margin] .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] .nr rst2man-indent-level -1 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] .in \\n[rst2man-indent\\n[rst2man-indent-level]]u .. .sp Stateful programmatic web browsing in Python. Browse pages programmatically with easy HTML form filling and clicking of links. .SH FREQUENTLY ASKED QUESTIONS .SS Contents .INDENT 0.0 .IP \(bu 2 \fI\%General\fP .INDENT 2.0 .IP \(bu 2 \fI\%Which version of Python do I need?\fP .IP \(bu 2 \fI\%What dependencies does mechanize need?\fP .IP \(bu 2 \fI\%What license does mechanize use?\fP .UNINDENT .IP \(bu 2 \fI\%Usage\fP .INDENT 2.0 .IP \(bu 2 \fI\%I\(aqm not getting the HTML page I expected to see?\fP .IP \(bu 2 \fI\%Is JavaScript supported?\fP .IP \(bu 2 \fI\%My HTTP response data is truncated?\fP .IP \(bu 2 \fI\%Is there any example code?\fP .UNINDENT .IP \(bu 2 \fI\%Cookies\fP .INDENT 2.0 .IP \(bu 2 \fI\%Which HTTP cookie protocols does mechanize support?\fP .IP \(bu 2 \fI\%What about RFC 2109?\fP .IP \(bu 2 \fI\%Why don\(aqt I have any cookies?\fP .IP \(bu 2 \fI\%My response claims to be empty, but I know it\(aqs not?\fP .IP \(bu 2 \fI\%What\(aqs the difference between the .load() and .revert() methods of CookieJar?\fP .IP \(bu 2 \fI\%Is it threadsafe?\fP .IP \(bu 2 \fI\%How do I do X?\fP .UNINDENT .IP \(bu 2 \fI\%Forms\fP .INDENT 2.0 .IP \(bu 2 \fI\%How do I figure out what control names and values to use?\fP .IP \(bu 2 \fI\%What do those \(aq*\(aq characters mean in the string representations of list controls?\fP .IP \(bu 2 \fI\%What do those parentheses (round brackets) mean in the string representations of list controls?\fP .IP \(bu 2 \fI\%Why doesn\(aqt turn up in the data returned by .click*() when that control has non\-None value?\fP .IP \(bu 2 \fI\%Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for RADIO and multiple\-selection SELECT controls?\fP .IP \(bu 2 \fI\%Why does .click() ing on a button not work for me?\fP .IP \(bu 2 \fI\%How do I change INPUT TYPE=HIDDEN field values (for example, to emulate the effect of JavaScript code)?\fP .IP \(bu 2 \fI\%I\(aqm having trouble debugging my code.\fP .IP \(bu 2 \fI\%I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want?\fP .UNINDENT .IP \(bu 2 \fI\%Miscellaneous\fP .INDENT 2.0 .IP \(bu 2 \fI\%I want to see what my web browser is doing?\fP .IP \(bu 2 \fI\%JavaScript is messing up my web\-scraping. What do I do?\fP .UNINDENT .UNINDENT .SS General .SS Which version of Python do I need? .sp mechanize works on all python versions, python 2 (>= 2.7) and 3 (>= 3.5). .SS What dependencies does mechanize need? .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C html5lib .ft P .fi .UNINDENT .UNINDENT .SS What license does mechanize use? .sp mechanize is licensed under the \fI\%BSD\-3\-clause\fP license. .SS Usage .SS I\(aqm not getting the HTML page I expected to see? .sp See debugging\&. .SS Is JavaScript supported? .sp No, sorry. See \fI\%JavaScript is messing up my web\-scraping. What do I do?\fP .SS My HTTP response data is truncated? .sp \fImechanize.Browser\(aqs\fP response objects support the \fI\&.seek()\fP method, and can still be used after \fI\&.close()\fP has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call \fIresponse.get_data()\fP before navigation if you don\(aqt want that to happen. .SS Is there any example code? .sp Look in the \fIexamples/\fP directory. Note that the examples on the forms page are executable as\-is. Contributions of example code would be very welcome! .SS Cookies .SS Which HTTP cookie protocols does mechanize support? .sp Netscape and \fI\%RFC 2965\fP\&. RFC 2965 handling is switched off by default. .SS What about RFC 2109? .sp RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise. .SS Why don\(aqt I have any cookies? .sp See cookies\&. .SS My response claims to be empty, but I know it\(aqs not? .sp Did you call \fIresponse.read()\fP (e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to use \fImechanize.response_seek_wrapper\fP\&. \fImechanize.Browser\fP always returns seekable responses, so it\(aqs not necessary to use this explicitly in that case. .SS What\(aqs the difference between the \fI\&.load()\fP and \fI\&.revert()\fP methods of \fICookieJar\fP? .sp \fI\&.load()\fP \fIappends\fP cookies from a file. \fI\&.revert()\fP discards all existing cookies held by the \fICookieJar\fP first (but it won\(aqt lose any existing cookies if the loading fails). .SS Is it threadsafe? .sp See threading\&. .SS How do I do \fIX\fP? .sp Refer to the API documentation in browser_api\&. .SS Forms .SS How do I figure out what control names and values to use? .sp \fIprint(form)\fP is usually all you need. In your code, things like the \fIHTMLForm.items\fP attribute of \fBmechanize.HTMLForm\fP instances can be useful to inspect forms at runtime. Note that it\(aqs possible to use item labels instead of item names, which can be useful — use the \fIby_label\fP arguments to the various methods, and the \fI\&.get_value_by_label()\fP / \fI\&.set_value_by_label()\fP methods on \fIListControl\fP\&. .SS What do those \fI\(aq*\(aq\fP characters mean in the string representations of list controls? .sp A \fI*\fP next to an item means that item is selected. .SS What do those parentheses (round brackets) mean in the string representations of list controls? .sp Parentheses \fI(foo)\fP around an item mean that item is disabled. .SS Why doesn\(aqt \fI\fP turn up in the data returned by \fI\&.click*()\fP when that control has non\-\fINone\fP value? .sp Either the control is disabled, or it is not successful for some other reason. \(aqSuccessful\(aq (see \fI\%HTML 4 specification\fP) means that the control will cause data to get sent to the server. .SS Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for \fIRADIO\fP and multiple\-selection \fISELECT\fP controls? .sp Because by default, it follows browser behaviour when setting the initially\-selected items in list controls that have no items explicitly selected in the HTML. .SS Why does \fI\&.click()\fP ing on a button not work for me? .sp Clicking on a \fIRESET\fP button doesn\(aqt do anything, by design \- this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on \fIRESET\fP sends nothing to the server, so there is little point in having \fI\&.click()\fP do anything special here. .sp Clicking on a \fIBUTTON TYPE=BUTTON\fP doesn\(aqt do anything either, also by design. This time, the reason is that that \fIBUTTON\fP is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can\(aqt do anything useful with a click on a \fIBUTTON\fP whose type is \fIBUTTON\fP\&. .sp Generally, JavaScript may be messing things up in all kinds of ways. See \fI\%JavaScript is messing up my web\-scraping. What do I do?\fP\&. .SS How do I change \fIINPUT TYPE=HIDDEN\fP field values (for example, to emulate the effect of JavaScript code)? .sp As with any control, set the control\(aqs \fIreadonly\fP attribute false. .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C form.find_control("foo").readonly = False # allow changing .value of control foo form.set_all_readonly(False) # allow changing the .value of all controls .ft P .fi .UNINDENT .UNINDENT .SS I\(aqm having trouble debugging my code. .sp See debugging\&. .SS I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want? .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import bisect def closest_int_value(form, ctrl_name, value): values = map(int, [item.name for item in form.find_control(ctrl_name).items]) return str(values[bisect.bisect(values, value) \- 1]) form["distance"] = [closest_int_value(form, "distance", 23)] .ft P .fi .UNINDENT .UNINDENT .SS Miscellaneous .SS I want to see what my web browser is doing? .sp Use the developer tools for your browser (you may have to install them first). These provide excellent views into all HTTP requests/responses in the browser. .SS JavaScript is messing up my web\-scraping. What do I do? .sp JavaScript is used in web pages for many purposes \-\- for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript. .sp If you come across this in a page you want to automate, you have a few options. Here they are, roughly in order of simplicity: .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .IP \(bu 2 Figure out what the JavaScript is doing and emulate it in your Python code. The simplest case is if the JavaScript is setting some cookies. In that case you can inspect the cookies in your browser and emulate setting them in mechanize with \fBmechanize.Browser.set_simple_cookie()\fP\&. .IP \(bu 2 More complex is to use your browser developer tools to see exactly what requests are sent by the browser and emulate them in mechanize by using \fBmechanize.Request\fP to create the request manually and open it with \fBmechanize.Browser.open()\fP\&. .IP \(bu 2 Third is to use some browser automation framework/library to scrape the site instead of using mechanize. These libraries typically drive a headless version of a full browser that can execute all JavaScript. They are typically much slower than using mechanize and far more resource intensive, but do work as a last resort. .UNINDENT .UNINDENT .UNINDENT .SH BROWSER API .sp API documentation for the mechanize \fBBrowser\fP object. You can create a mechanize \fBBrowser\fP instance as: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from mechanize import Browser br = Browser() .ft P .fi .UNINDENT .UNINDENT .SS Contents .INDENT 0.0 .IP \(bu 2 \fI\%Browser API\fP .INDENT 2.0 .IP \(bu 2 \fI\%The Browser\fP .IP \(bu 2 \fI\%The Request\fP .IP \(bu 2 \fI\%The Response\fP .IP \(bu 2 \fI\%Miscellaneous\fP .UNINDENT .UNINDENT .SS The Browser .INDENT 0.0 .TP .B class mechanize.Browser(history=None, request_class=None, content_parser=None, factory_class=, allow_xhtml=False) Browser\-like class with support for history, forms and links. .sp \fBBrowserStateError\fP is raised whenever the browser is in the wrong state to complete the requested operation \- e.g., when \fI\%back()\fP is called when the browser history is empty, or when \fI\%follow_link()\fP is called when the current response does not contain HTML data. .sp Public attributes: .sp request: current request (\fI\%mechanize.Request\fP) .sp form: currently selected form (see \fI\%select_form()\fP) .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBhistory\fP \-\- object implementing the \fI\%mechanize.History\fP interface. Note this interface is still experimental and may change in future. This object is owned by the browser instance and must not be shared among browsers. .IP \(bu 2 \fBrequest_class\fP \-\- Request class to use. Defaults to \fI\%mechanize.Request\fP .IP \(bu 2 \fBcontent_parser\fP \-\- A function that is responsible for parsing received html/xhtml content. See the builtin \fI\%mechanize._html.content_parser()\fP function for details on the interface this function must support. .IP \(bu 2 \fBfactory_class\fP \-\- HTML Factory class to use. Defaults to \fBmechanize.Factory\fP .UNINDENT .UNINDENT .INDENT 7.0 .TP .B add_client_certificate(url, key_file, cert_file) Add an SSL client certificate, for HTTPS client auth. .sp key_file and cert_file must be filenames of the key and certificate files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS 12) file to PEM format: .sp openssl pkcs12 \-clcerts \-nokeys \-in cert.p12 \-out cert.pem openssl pkcs12 \-nocerts \-in cert.p12 \-out key.pem .sp Note that client certificate password input is very inflexible ATM. At the moment this seems to be console only, which is presumably the default behaviour of libopenssl. In future mechanize may support third\-party libraries that (I assume) allow more options here. .UNINDENT .INDENT 7.0 .TP .B back(n=1) Go back n steps in history, and return response object. .sp n: go back this number of steps (default 1 step) .UNINDENT .INDENT 7.0 .TP .B click(*args, **kwds) See \fBmechanize.HTMLForm.click()\fP for documentation. .UNINDENT .INDENT 7.0 .TP .B click_link(link=None, **kwds) Find a link and return a Request object for it. .sp Arguments are as for \fI\%find_link()\fP, except that a link may be supplied as the first argument. .UNINDENT .INDENT 7.0 .TP .B cookiejar Return the current cookiejar (\fBmechanize.CookieJar\fP) or None .UNINDENT .INDENT 7.0 .TP .B find_link(text=None, text_regex=None, name=None, name_regex=None, url=None, url_regex=None, tag=None, predicate=None, nr=0) Find a link in current page. .sp Links are returned as \fI\%mechanize.Link\fP objects. Examples: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C # Return third link that .search()\-matches the regexp "python" (by # ".search()\-matches", I mean that the regular expression method # .search() is used, rather than .match()). find_link(text_regex=re.compile("python"), nr=2) # Return first http link in the current page that points to # somewhere on python.org whose link text (after tags have been # removed) is exactly "monty python". find_link(text="monty python", url_regex=re.compile("http.*python.org")) # Return first link with exactly three HTML attributes. find_link(predicate=lambda link: len(link.attrs) == 3) .ft P .fi .UNINDENT .UNINDENT .sp Links include anchors \fI\fP, image maps \fI\fP, and frames \fI