Combined gas law and Reinforcement Learning: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Auntof6
m (use standard simple headings and/or general cleanup using AWB)
 
imported>Notfruit
m (resized image for real this time)
 
Line 1: Line 1:
{{orphan|date=December 2016}}
'''Reinforcement learning''' is teaching a ''[[software agent]]'' how to behave in an environment by telling it how good it's doing. It is an area of [[machine learning]] inspired by [[Behaviorism|behaviorist psychology]].
{{simplify}}
{{Continuum mechanics}}
The '''combined gas law''' is a [[formula]] about [[ideal gas]]es. It comes from putting together three different laws about the [[pressure]], [[volume]], and [[temperature]] of the gas. They explain what happens to two of the values of that gas while the third stays the same. The three laws are:
*[[Charles's law]], which says that volume and temperature are directly [[Proportionality|proportional]] to each other as long as pressure [[constant|stays the same]].
*[[Boyle's law]] says that pressure and volume are inversely proportional to each other at the same temperature.
*[[Gay-Lussac's law]] says that temperature and pressure are directly proportional as long as the volume stays the same.
The combined gas law shows how the three variables are related to each other. It says that:


{{cquote|The ratio between the pressure-volume product and the temperature of a system remains constant.}}
Reinforcement learning is different from [[supervised learning]] because the correct inputs and outputs are never shown. Also, reinforcement learning usually learns as it goes (online learning) unlike supervised learning. This means an agent has to choose between exploring and sticking with what it knows best.


The formula of the combined gas law is:
== Introduction ==
[[File:Rl agent.png|369x369px|thumb]]
A reinforcement learning system is made of a ''policy'' (<math>\pi</math>), a ''reward function'' (<math>R</math>), a ''value function'' (<math>v</math>), and an optional ''model'' of the environment.


:<math> \qquad \frac {PV}{T}= k </math>
A ''policy'' tells the agent what to do in a certain situation. It can be a simple table of rules, or a complicated search for the correct action. Policies can even be stochastic, which means instead of rules the policy assigns ''probabilities'' to each action. A policy by itself can make an agent do things, but it can't learn on its own.


where:
A ''reward function'' defines the goal for an agent. It takes in a state (or a state and the action taken at that state) and gives back a number called the ''reward'', which tells the agent how good it is to be in that state. The agent's job is to get the biggest amount of reward it possibly can in the long run. If an action yields a low reward, the agent will probably take a better action in the future. Biology uses reward signals like pleasure or pain to make sure organisms stay alive to reproduce. Reward signals can also be stochastic, like a [[slot machine]] at a casino, where sometimes they pay and sometimes they don't.


:{{math|''P''}} is the [[pressure]]
A ''value function'' tells an agent how much reward it will get following a policy <math>\pi</math> starting from state <math>s</math>. It represents how ''desirable'' it is to be in a certain state. Since the value function isn't given to the agent directly, it needs to come up with a good guess or estimate based on the reward it's gotten so far. Value function estimation is the most important part of most reinforcement learning algorithms.
:{{math|''V''}} is the [[volume]]
:{{math|''T''}} is the [[thermodynamic temperature|temperature]] measured in [[kelvin]]
:{{math|''k''}} is a constant (with units of energy divided by temperature).


To compare the same gas with two of these cases, the law can be written as:
A ''model'' is the agent's mental copy of the environment. It's used to ''plan'' future actions.


:<math> \qquad \frac {P_1V_1}{T_1}= \frac {P_2V_2}{T_2} </math>
Knowing this, we can talk about the main loop for a reinforcement learning episode. The agent interacts with the environment in ''discrete time steps''. Think of it like the "tick-tock" of a clock. With discrete time, things only happen during the "ticks" and the "tocks", and not in between. At each time <math>t=0, 1, 2, 3,...</math>, the agent observes the environment's state <math>S_t</math> and picks an action <math>A_t</math> based on a policy <math>\pi</math>. The next time step, the agent receives a reward signal <math>R_{t+1}</math> and a new observation <math>S_{t+1}</math>. The value function <math>v(S_t)</math> is updated using the reward. This continues until a terminal state <math>S_T</math> is reached
 
By adding [[Avogadro's law]] to the combined gas law, we get what is called the [[ideal gas law]].
 
==Derivation from the gas laws==
 
{{main|Gas Laws}}
 
[[Boyle's Law]] states that the pressure-volume product is constant:
:<math>PV = k_1 \qquad (1)</math>
[[Charles's Law]] shows that the volume is proportional to the [[absolute temperature]]:
:<math>\frac{V}{T} = k_2 \qquad (2)</math>
[[Gay-Lussac's Law]] says that the pressure is proportional to the absolute temperature:
:<math>P = k_3T \qquad (3)</math>
where ''P'' is the [[pressure]], ''V'' the volume and ''T'' the absolute temperature of an [[ideal gas]].
 
By combining (1) and either of (2) or (3), we can gain a new equation with ''P'', ''V'' and ''T''. If we divide equation (1) by temperature and multiply equation (2) by pressure we will get:
:<math>\frac{PV}{T} = \frac{k_1(T)}{T}</math>
:<math>\frac{PV}{T} = k_2(P)P</math>.
As the left-hand side of both equations is the same, we arrive at
:<math>\frac{k_1(T)}{T} = k_2(P)P</math>,
which means that
:<math>\frac{PV}{T} = \textrm{constant}</math>.
 
Substituting in [[Avogadro's Law]] yields the [[ideal gas equation]].
 
==Physical derivation==
 
A derivation of the combined gas law using only elementary algebra can contain surprises. For example, starting from the three empirical laws
:<math> P = k_V\, T \,\!</math>{{spaces|10}}(1) Gay-Lussac's Law, volume assumed constant
:<math> V = k_P T \,\!</math>{{spaces|10}}(2) Charles's Law, pressure assumed constant
:<math> P V = k_T \,\!</math>{{spaces|10}}(3) Boyle's Law, temperature assumed constant
where ''k<sub>V</sub>'', ''k<sub>P</sub>'', and ''k<sub>T</sub>'' are the constants, one can multiply the three together to obtain
:<math> PVPV = k_V T k_P T k_T \,\!</math>
Taking the square root of both sides and dividing by ''T'' appears to produce of the desired result
:<math> \frac {PV}{T} = \sqrt{k_P k_V k_T} \,\!</math>
However, if before applying the above procedure, one merely rearranges the terms in Boyle's Law, ''k<sub>T</sub>''&nbsp;=&nbsp;''PV'', then after canceling and rearranging, one obtains
:<math> \frac{k_T}{k_V k_P} = T^2 \,\!</math>
which is not very helpful if not misleading.
 
A physical derivation, longer but more reliable, begins by realizing that the constant volume parameter in Gay-Lussac's law will change as the system volume changes. At constant volume, ''V''<sub>1</sub> the law might appear ''P''&nbsp;=&nbsp;''k''<sub>1</sub>''T'', while at constant volume ''V''<sub>2</sub> it might appear ''P''&nbsp;=&nbsp;''k''<sub>2</sub>''T''.
Denoting this "variable constant volume" by ''k<sub>V</sub>''(''V''), rewrite the law as
:<math> P = k_V(V) \,T \,\!</math>{{spaces|10}}(4)
The same consideration applies to the constant in Charles's law, which may be rewritten
:<math> V = k_P(P) \,T \,\!</math>{{spaces|10}}(5)
 
In seeking to find ''k<sub>V</sub>''(''V''), one should not unthinkingly eliminate ''T'' between (4) and (5), since ''P'' is varying in the former while it is assumed constant in the latter. Rather, it should first be determined in what sense these equations are compatible with one another. To gain insight into this, recall that any two variables determine the third. Choosing ''P'' and ''V'' to be independent, we picture the ''T'' values forming a surface above the ''PV''-plane. A definite ''V''<sub>0</sub> and ''P''<sub>0</sub> define a ''T''<sub>0</sub>, a point on that surface. Substituting these values in (4) and (5), and rearranging yields
:<math> T_0 = \frac{P_0}{k_V(V_0)} \quad and \quad T_0 = \frac{V_0}{k_P(P_0)}</math>
Since these both describe what is happening at the same point on the surface, the two numeric expressions can be equated and rearranged
:<math> \frac{k_V(V_0)}{k_P(P_0)} = \frac{P_0}{V_0}\,\!</math>{{spaces|10}}(6)
Note that {{sfrac|1|''k<sub>V</sub>''(''V''<sub>0</sub>)}} and {{sfrac|1|''k<sub>P</sub>''(''P''<sub>0</sub>)}} are the slopes of orthogonal lines parallel to the ''P''-axis/''V''-axis and through that point on the surface above the ''PV'' plane. The ratio of the slopes of these two lines depends only on the value of {{sfrac|''P''<sub>0</sub>|''V''<sub>0</sub>}} at that point.
 
Note that the functional form of (6) did not depend on the particular point chosen. The same formula would have arisen for any other combination of P and V values. Therefore, one can write
:<math> \frac{k_V(V)}{k_P(P)} = \frac{P}{V} \quad\forall P, \forall V</math>{{spaces|10}}(7)
This says that each point on the surface has its own pair of orthogonal lines through it, with their slope ratio depending only on that point. Whereas (6) is a relation between specific slopes and variable values, (7) is a relation between slope functions and function variables. It holds true for any point on the surface, i.e. for any and all combinations of ''P'' and ''V'' values. To solve this equation for the function ''k<sub>V</sub>''(''V''), first separate the variables, ''V'' on the left and ''P'' on the right.
:<math> V\,k_V(V) = P\,k_P(P)</math>
Choose any pressure ''P''<sub>1</sub>. The right side evaluates to some arbitrary value, call it ''k''<sub>arb</sub>.
:<math> V\,k_V(V) = k_{\text{arb}} \,\!</math>{{spaces|10}}(8)
This particular equation must now hold true, not just for one value of ''V'', but for '''all''' values of ''V''. The only definition of ''k<sub>V</sub>''(''V'') that guarantees this for all ''V'' and arbitrary ''k''<sub>arb</sub> is
:<math> k_V(V) = \frac{k_{\text{arb}}}{V}</math>{{spaces|10}}(9)
which may be verified by substitution in (8).
 
Finally, substituting (9) in Gay-Lussac's law (4) and rearranging produces the combined gas law
:<math> \frac{PV}{T} = k_{\text{arb}}\,\!</math>
 
Note that while Boyle's law was not used in this derivation, it is easily deduced from the result. Generally, any two of the three starting laws are all that is needed in this type of derivation&nbsp;– all starting pairs lead to the same combined gas law.<ref>A similar derivation, one starting from Boyle's law, may be found in Raff, pp.&nbsp;14–15</ref>
 
==Applications==
The combined gas law can be used to explain the mechanics where pressure, temperature, and volume are affected. For example: air conditioners, refrigerators and the formation of clouds and also use in fluid mechanics and thermodynamics.
 
==Related pages==
*[[Dalton's law]]
 
==Notes==
{{reflist}}
 
==Sources==
*Raff, Lionel. Principles of Physical Chemistry. New Jersey: Prentice-Hall 2001
 
==Other websites==
* [http://chair.pa.msu.edu/applets/pvt/a.htm Interactive Java applet on the combined gas law] by Wolfgang Bauer
 
{{DEFAULTSORT:Combined gas law}}

Latest revision as of 01:25, 5 April 2017

Reinforcement learning is teaching a software agent how to behave in an environment by telling it how good it's doing. It is an area of machine learning inspired by behaviorist psychology.

Reinforcement learning is different from supervised learning because the correct inputs and outputs are never shown. Also, reinforcement learning usually learns as it goes (online learning) unlike supervised learning. This means an agent has to choose between exploring and sticking with what it knows best.

Introduction

File:Rl agent.png

A reinforcement learning system is made of a policy ([math]\displaystyle{ \pi }[/math]), a reward function ([math]\displaystyle{ R }[/math]), a value function ([math]\displaystyle{ v }[/math]), and an optional model of the environment.

A policy tells the agent what to do in a certain situation. It can be a simple table of rules, or a complicated search for the correct action. Policies can even be stochastic, which means instead of rules the policy assigns probabilities to each action. A policy by itself can make an agent do things, but it can't learn on its own.

A reward function defines the goal for an agent. It takes in a state (or a state and the action taken at that state) and gives back a number called the reward, which tells the agent how good it is to be in that state. The agent's job is to get the biggest amount of reward it possibly can in the long run. If an action yields a low reward, the agent will probably take a better action in the future. Biology uses reward signals like pleasure or pain to make sure organisms stay alive to reproduce. Reward signals can also be stochastic, like a slot machine at a casino, where sometimes they pay and sometimes they don't.

A value function tells an agent how much reward it will get following a policy [math]\displaystyle{ \pi }[/math] starting from state [math]\displaystyle{ s }[/math]. It represents how desirable it is to be in a certain state. Since the value function isn't given to the agent directly, it needs to come up with a good guess or estimate based on the reward it's gotten so far. Value function estimation is the most important part of most reinforcement learning algorithms.

A model is the agent's mental copy of the environment. It's used to plan future actions.

Knowing this, we can talk about the main loop for a reinforcement learning episode. The agent interacts with the environment in discrete time steps. Think of it like the "tick-tock" of a clock. With discrete time, things only happen during the "ticks" and the "tocks", and not in between. At each time [math]\displaystyle{ t=0, 1, 2, 3,... }[/math], the agent observes the environment's state [math]\displaystyle{ S_t }[/math] and picks an action [math]\displaystyle{ A_t }[/math] based on a policy [math]\displaystyle{ \pi }[/math]. The next time step, the agent receives a reward signal [math]\displaystyle{ R_{t+1} }[/math] and a new observation [math]\displaystyle{ S_{t+1} }[/math]. The value function [math]\displaystyle{ v(S_t) }[/math] is updated using the reward. This continues until a terminal state [math]\displaystyle{ S_T }[/math] is reached